Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Span<char> from null-terminated char* #40202

Closed
fbrosseau opened this issue Jul 31, 2020 · 47 comments · Fixed by #47539
Closed

API Proposal: Span<char> from null-terminated char* #40202

fbrosseau opened this issue Jul 31, 2020 · 47 comments · Fixed by #47539
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Milestone

Comments

@fbrosseau
Copy link

fbrosseau commented Jul 31, 2020

Background and Motivation

PInvoke scenarios (mostly those on Windows) interact a lot with null-terminated wide strings.

String has always had a ctor(char*), which in .Net Core now uses a highly optimized wcslen, internal to the framework.

I am proposing the equivalent functionality, minus the allocation+copy, using Span<>.
Given a null-terminated wide string as input (in the shape of unsafe char*), return a Span whose length is the count of characters before the null.

I am proposing Span<> and not ReadOnlySpan<>, to let the caller decide what to do with the result, according to their specific scenario and the nature of their char-pointer.
As Jan mentionned, the big majority of usecases are const, so the API should be ReadOnlySpan.

I am not proposing to also add the equivalent narrow ctor(sbyte*) version for span, as this one cannot be implemented without allocating a new buffer.
UTF8 (byte*) returns ReadOnlySpan<byte>.

Implementing the same functionality with just public API is possible, but very awkward and undocumented internal behavior (this is more or less what wcslen does, with error-checking omitted):

int len = MemoryExtensions.IndexOf(new ReadOnlySpan<char>(value, int.MaxValue), '\0');
return new Span<char>(value, len);

Proposed API

namespace System.Runtime.InteropServices
{
    public static class MemoryMarshal
    {
+       public static unsafe ReadOnlySpan<char> CreateFromNullTerminated(char* value);
+       public static unsafe ReadOnlySpan<byte> CreateFromNullTerminated(byte* value);
    }
}

The implementation of this proposal is also trivial with the existing tools in the framework (null is non-throwing, the behavior of string ctor(char*)):

public static unsafe ReadOnlySpan<char> CreateFromNullTerminated(char* value)
{
    if (value == null)
        return default;
 
    int count = string.wcslen(value);
    if (count == 0)
        return default;

    return new ReadOnlySpan<char>(value, count);
}

Usage Examples

['ptr' is a char* that points to null-terminated string "abc\0"]

var span = MemoryMarshal.CreateFromNullTerminated(ptr);

['span' is now a Span<char> of length 3.]

Alternative Designs

Risks

@fbrosseau fbrosseau added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jul 31, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Jul 31, 2020
@svick
Copy link
Contributor

svick commented Jul 31, 2020

Is it common that the total buffer size is unknown? I worry about the situation where the null terminator is missing, which would mean the created Span points to random regions of memory. Would it make more sense to add a parameter for the buffer size?

public static unsafe Span<char> CreateSpanFromNullTerminatedString(char* value, int maxLength);

If \0 is not found within maxLength characters, an exception is thrown.

If you actually do not know the size of the buffer, you can specify int.MaxValue as maxLength.

@fbrosseau
Copy link
Author

A missing null terminator in a null-terminated string would be a breach of a very fundamental contract that is present in 99% of c-style APIs, including big parts of the Win32 API. Null-terminated strings of unknown length are everywhere.

A null-terminated string without a null is a bug in the code that generated it, and I don't think the framework should patch around that in what is unsafe code in the first place (because of the use of pointers).

@jkotas
Copy link
Member

jkotas commented Jul 31, 2020

Real-world example of this pattern:

var calendarStringSpan = new ReadOnlySpan<char>(calendarStringPtr, string.wcslen(calendarStringPtr));

@jkotas
Copy link
Member

jkotas commented Jul 31, 2020

  • Should the API return ReadOnlySpan<char>? The null-terminated strings are typically meant to be read-only.
  • The prosed name is a mouthful. The existing interop APIs that work with null-terminated string (e.g. new string(char*) or Marshal.PtrToStringUni) do not mention zero-terminated in the name. What about MemoryMarshal.CreateSpanFromString(char*) or even just MemoryMarshal.CreateSpan(char*) ?

@GrabYourPitchforks
Copy link
Member

int len = MemoryExtensions.IndexOf(new ReadOnlySpan<char>(value, int.MaxValue), '\0');

FWIW, I would not recommend this in a production application. This takes advantage of undocumented behavior within the framework, and this behavior is subject to change.

If applications need an immediate workaround, I'd suggest ReadOnlySpan<char> span = new string(charPtr).AsSpan();. Yes, it allocates, but it's 100% guaranteed to work and uses only supported, documented APIs. Or write the for loop manually if you need to avoid allocations.

For a first-class API, I like the name MemoryMarshal.CreateSpan(char*).

@GrabYourPitchforks
Copy link
Member

This takes advantage of undocumented behavior within the framework, and this behavior is subject to change.

I should probably add a comment in the source clarifying this.

@fbrosseau
Copy link
Author

This takes advantage of undocumented behavior within the framework, and this behavior is subject to change.

This is why I am making this public API proposal :).

And I agree with any name suggestion, CreateSpan sounds great, I knew my name suggestion was not good.

@jkotas
Copy link
Member

jkotas commented Jul 31, 2020

I have updated the proposal with the shorter name

@jkotas jkotas added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jul 31, 2020
@fbrosseau
Copy link
Author

Thanks

@am11
Copy link
Member

am11 commented Aug 1, 2020

The existing interop APIs that work with null-terminated string (e.g. new string(char*) or Marshal.PtrToStringUni) do not mention zero-terminated in the name.

Marshal.PtrToStringUni also works with non- zero-terminated strings, should CreateSpan also follow the pattern?
(tried on linux x64 with net5 preview8: https://gist.github.com/am11/f21778e56a9984ae37b97d4257fd63ec)

@svick
Copy link
Contributor

svick commented Aug 1, 2020

@am11 C string literals (including UTF-16 literals) are automatically null-terminated:

u" s-char-sequence "

16-bit wide string literal: The type of the literal is char16_t[N], where N is the size of the string in code units of implementation-defined 16-bit encoding (typically UTF-16), including the null terminator.

So, effectively, the second string is 'f', 'o', 'o', '2', '\0' and the first string is 'f', 'o', 'o', '1', '\0', '\0'.

@fbrosseau
Copy link
Author

fbrosseau commented Aug 1, 2020

Your sample copies characters into an already zero-filled array, meaning the null characters it puts at the end make no difference, it was already null characters :)

In any case, even if it did not, it could still randomly work according to what happens to be in the memory around it.

@am11
Copy link
Member

am11 commented Aug 1, 2020

Ah right, updated. Filled inArray with (byte)42 and now I can see garbage next to foo2. Thanks for the correction, PtrToStringUni does indeed look for first null character (as documented). No magic there. :)

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed untriaged New issue has not been triaged by the area owner api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 6, 2020
@terrajobst
Copy link
Member

terrajobst commented Aug 7, 2020

Video

  • Let's make it clear in the name what this method does.
  • Let's have an overload that deals with UTF8
  • Let's hold back UTF32 until we need it.
  • We should also provide strlen-like APIs
namespace System.Runtime.InteropServices
{
    public partial class MemoryMarshal
    {
        public static Span<char> CreateFromNullTerminated(char* value);
        public static Span<byte> CreateFromNullTerminated(byte* value);
    }
}
namespace System
{
    public partial class Buffer
    {
        public unsafe static nuint GetStringLength(char* source);
        public unsafe static nuint GetStringLength(char* source, nuint maxLength);
        public unsafe static nuint GetStringLength(byte* source);
        public unsafe static nuint GetStringLength(byte* source, nuint maxLength);
    }
}

@fbrosseau
Copy link
Author

thanks, updated first post

@stephentoub
Copy link
Member

The existing methods on MemoryMarshal that create spans are called CreateSpan. Shouldn't these also include Span in the name?

@stephentoub
Copy link
Member

Let's make it clear in the name what this method does.

With no length provided as an argument, what else would it do?

@jkotas
Copy link
Member

jkotas commented Aug 10, 2020

I do not think GetStringLength should take or return nuint. It makes it very inconvenient to use. We use int for lengths everywhere else, GetStringLength should be on the same plan. Yes, it means that it is different from what wcslen in C and that it will need to throw on overflow.

@GrabYourPitchforks
Copy link
Member

@jkotas What about when we introduce NativeArray / NativeSpan / whatever in the future? It would be nice if we our existing wcslen method worked for those scenarios as well. Since these APIs take raw pointers I don't think it's burdensome to force the caller to think about native-sized integer return values.

@jkotas
Copy link
Member

jkotas commented Aug 10, 2020

NativeArray / NativeSpan / whatever in the future?

If that ever happens, we will have hundreds of existing methods to update. A few more or less won't be a big deal.

@adamsitnik adamsitnik added this to the Future milestone Aug 14, 2020
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Oct 6, 2020
@terrajobst
Copy link
Member

terrajobst commented Oct 6, 2020

Video

  • The API should return Span<T>, we can add a ReadOnly version later
  • We should throw InvalidOperationException when the size would exceed int.MaxValue.
namespace System.Runtime.InteropServices
{
    public static class MemoryMarshal
    {
        public static unsafe Span<char> CreateSpanFromNullTerminated(char* value);
        public static unsafe Span<byte> CreateSpanFromNullTerminated(byte* value);
    }
}

@jkotas
Copy link
Member

jkotas commented Oct 6, 2020

The API should return Span, we can add a ReadOnly version later

This feels backwards. 99+% case for zero-terminated strings is getting read-only string and parsing it.

We should throw InvalidOperationException when the size would exceed int.MaxValue.

All existing APIs that take zero-terminated strings throw ArgumentException(SR.Arg_MustBeNullTerminatedString) for this case. Is the inconsistency intentional?

(I am sorry that I was not able to participate in the review discussion.)

@bartonjs
Copy link
Member

bartonjs commented Oct 7, 2020

Is the inconsistency intentional?

Nope, just lack of looking at prior art. It felt like a strange thing to call an ArgumentException to me, since the caller has no good way of pre-validating it... but if that's what string..ctor(char*) does then it's about as prior art as it gets.

@stephentoub
Copy link
Member

This feels backwards. 99+% case for zero-terminated strings is getting read-only string and parsing it.

I believe the argument is that with Span<char> implicitly converting to ReadOnlySpan<char>, the span-based one enables the 99% case as well as the 1% case. Are you concerned about the impact on the JIT?

@jkotas
Copy link
Member

jkotas commented Oct 9, 2020

My primary concern is that returning Span<char> is prone to hiding accidental string-mutation bugs in code that uses this API:

void f(char* s)
{
    var span = MemoryMarshal.CreateSpanFromNullTerminated(s);
...
    span[0] = 'a'; // This is a bug with near 100% probability
...
}

I expect that this API is going to be primarily used on interop boundaries. Incomming zero-terminated char* on interop boundary is pretty much always immutable string. The rare cases where the zero-terminated string is meant to be mutable can use explicit gesture to create the mutable span, or just stick with unmanaged pointers.

Do we have any good examples for mutable zero-terminated char*? I cannot think about any.

@tannergooding
Copy link
Member

Do we have any good examples for mutable zero-terminated char*? I cannot think about any.

For the APIs I've bound so far in my TerraFX.Interop.Windows bindings, there are 404 instances of LPWSTR (which is distinct from LPCWSTR).

Most of the APIs are taking in a mutable wchar_t buffer with a known length and tell you how many characters were copied into the buffer (such as GetTextFaceW), but with the buffer still being logically null-terminated.

@tannergooding
Copy link
Member

(Sorry, hit the wrong key and sent too early.)

However, there are also APIs such as IMFAudioPolicy.GetDisplayName which return a mutable null-terminated string without telling you the length and which you are expected to free yourself by calling CoTaskMemFree.

@jkotas
Copy link
Member

jkotas commented Oct 9, 2020

Yes, the underlying buffer is technically mutable in these cases because of you took ownership of it. I believe that it is very rare for the code to take advantage of it as a micro-optimization. Are there real world code examples that take advantage of this today?

@jkotas
Copy link
Member

jkotas commented Oct 9, 2020

(I am looking for example of existing code, the existing code rewritten using Span<char> CreateSpanFromNullTerminated and the existing code rewritten using ReadOnlySpan<char> CreateSpanFromNullTerminated.)

@tannergooding
Copy link
Member

Are there real world code examples that take advantage of this today?

Not that I'm aware of and I don't see anything obvious popping out in the list of APIs (in Windows, Vulkan, PulseAudio, Xlib, or of the several other libraries I've created bindings for) where it would be more than a micro-optimization.

I'm sure the code exists, but it might be as you said and exceptionally rare. So if there is a concern around having it return Span to also cover the 1% case due to users accidentally mutating where they didn't intend, then just exposing ROSpan first and waiting for users to request the other mutable seems reasonable.

@stephentoub
Copy link
Member

stephentoub commented Jan 5, 2021

So... do we want to change the return to be ReadOnlySpan<char>/ReadOnlySpan<byte>? I'm ok with it, especially given that for the mutable case someone could write a simple wrapper:

var span = new Span<char>(value, MemoryMarshal.CreateSpanFromNullTerminated(value).Length);

Presumably we'd rename it as well?

@jkotas
Copy link
Member

jkotas commented Jan 5, 2021

Presumably we'd rename it as well?

My vote would be to keep MemoryMarshal.CreateSpanFromNullTerminated. MemoryMarshal.CreateReadOnlySpanFromNullTerminated is a very long name, and we typically use a simple Span in names of methods and properties even when they return ReadOnlySpan.

@stephentoub
Copy link
Member

we typically use a simple Span in names of methods and properties even when they return ReadOnlySpan

We do, except we have MemoryMarshal.CreateSpan and MemoryMarshal.CreateReadOnlySpan.

FWIW, for consistency with the existing names, to avoid collisions in the future if we decided to add a span-based one, and to keep the names relatively short, my preference is:

public static Span<char> CreateReadOnlySpan(char* value);
public static Span<byte> CreateReadOnlySpan(byte* value);

but I understand some folks felt the meaning of that wasn't clear enough.

@bartonjs
Copy link
Member

bartonjs commented Jan 5, 2021

I think my objection to not having the NullTerminated in the name is that the semantics of the byte* version is not obvious from its signature (looking at the two together, it's "oh, for ASCII or UTF-8", but without seeing the char* version it's like "so I have some bytes... how far does this span go?", and once I wanted the name in that one the char* one got dragged along with it. (If the type were StringMarshal I'd feel differently, since there's context in the type name.)

@stephentoub
Copy link
Member

stephentoub commented Jan 6, 2021

So options discussed:

  • MemoryMarshal.CreateSpanFromNullTerminated returning a ReadOnlySpan... but that's inconsistent with MemoryMarshal.CreateSpan, which returns a Span.
  • MemoryMarshal.CreateReadOnlySpanFromNullTerminated... but that's super long.
  • MemoryMarshal.CreateReadOnlySpan... but that may not be intuitive for the byte* overload as to what it means.

I'm not sure there are any great answers here.

Alternatively, we could just expose a nuint Marshal.StringLength(char*) and nuint Marshal.StringLength(byte*), and let the developer do with those what they will, e.g.

static ReadOnlySpan<char> CreateReadOnlySpan(char* value) =>
    new Span<char>(value, checked((int)Marshal.StringLength(value)));

@GrabYourPitchforks
Copy link
Member

@stephentoub Per #40202 (comment), desire was to have any strlen/wcslen-like API return int instead of nuint. Which raised the further question: if we're returning an int, it can obviously fit into a span, so may as well return the span instead of the int. Hence the latest approved API had only span-returning members.

But if you're saying something like "look, the only true building block we need expose is strlen/wcslen, and that can easily live on Marshal, and we can ignore everything else since it can all be written in terms of this API," then I'm good with that.

@stephentoub
Copy link
Member

But if you're saying something like "look, the only true building block we need expose is strlen/wcslen, and that can easily live on Marshal, and we can ignore everything else since it can all be written in terms of this API," then I'm good with that.

I'm more saying "I think we should add MemoryMarshal.CreateReadOnlySpan(char*/byte*), but there's so much disagreement on naming that we could escape that and still get the core functionality by adding the lower-level strlen primitive."

@bartonjs
Copy link
Member

FWIW, (n)int Marshal::StringLength(byte*) seems descriptive enough to me. It may require slightly more caller-written code, but has a lot less confusion. (I don't have opinions on int vs nint there)

@GrabYourPitchforks
Copy link
Member

Let's say for argument's sake that we expose it as nuint. That means that a naïve call to the Span<T> ctor will also "work".

// wcslen returns nuint, we perform a narrowing cast to int
ROS<char> span = new ROS<char>(myPtr, (int)Marshal.wcslen(myPtr));

In the common case, wcslen will return a value within the range of int, so life is good. In the extreme case, this value might overflow, which means one of two things will happen: (a) if bit 31 of the return integer is set, the ROS<T> ctor will throw due to a negative length being passed in; or (b) if bit 31 of the return integer is not set, the call to the ROS<T> ctor will not throw, but the span will be truncated. I think both of these are acceptable fallback behaviors since neither results in a buffer overrun.

@jkotas
Copy link
Member

jkotas commented Jan 20, 2021

Silent data loss is not good either. I think we would standardize on new ROS<char>(myPtr, checked((int)Marshal.wcslen(myPtr))) in that case.

@GrabYourPitchforks
Copy link
Member

Agree that silent data loss isn't good. Our built-in usage would be correct per your example (and we could enforce via analyzers). I was speculating on what improper third-party code might encounter, and in that scenario I think incorrect truncation is an acceptable risk.

@jkotas
Copy link
Member

jkotas commented Jan 20, 2021

MemoryMarshal.CreateReadOnlySpanFromNullTerminate(myPtr) is a long name, but it is still shorter and easier to read than new ReadOnlySpan<char>(myPtr, checked((int)Marshal.StringLength(myPtr))). I would pick the long name out of these two options.

@stephentoub
Copy link
Member

When no one loves it, you know it's a good compromise 😄

namespace System.Runtime.InteropServices
{
    public static class MemoryMarshal
    {
        public static unsafe ReadOnlySpan<char> CreateReadOnlySpanFromNullTerminated(char* value);
        public static unsafe ReadOnlySpan<byte> CreateReadOnlySpanFromNullTerminated(byte* value);
        ...
    }
}

Going... going...

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 27, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 30, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.