-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Proposal: Span<char> from null-terminated char* #40202
Comments
Is it common that the total buffer size is unknown? I worry about the situation where the null terminator is missing, which would mean the created public static unsafe Span<char> CreateSpanFromNullTerminatedString(char* value, int maxLength); If If you actually do not know the size of the buffer, you can specify |
A missing null terminator in a null-terminated string would be a breach of a very fundamental contract that is present in 99% of c-style APIs, including big parts of the Win32 API. Null-terminated strings of unknown length are everywhere. A null-terminated string without a null is a bug in the code that generated it, and I don't think the framework should patch around that in what is unsafe code in the first place (because of the use of pointers). |
Real-world example of this pattern: runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/CalendarData.Icu.cs Line 436 in ed6eda5
|
|
FWIW, I would not recommend this in a production application. This takes advantage of undocumented behavior within the framework, and this behavior is subject to change. If applications need an immediate workaround, I'd suggest For a first-class API, I like the name |
I should probably add a comment in the source clarifying this. |
This is why I am making this public API proposal :). And I agree with any name suggestion, CreateSpan sounds great, I knew my name suggestion was not good. |
I have updated the proposal with the shorter name |
Thanks |
|
@am11 C string literals (including UTF-16 literals) are automatically null-terminated:
So, effectively, the second string is |
Your sample copies characters into an already zero-filled array, meaning the null characters it puts at the end make no difference, it was already null characters :) In any case, even if it did not, it could still randomly work according to what happens to be in the memory around it. |
Ah right, updated. Filled inArray with |
namespace System.Runtime.InteropServices
{
public partial class MemoryMarshal
{
public static Span<char> CreateFromNullTerminated(char* value);
public static Span<byte> CreateFromNullTerminated(byte* value);
}
}
namespace System
{
public partial class Buffer
{
public unsafe static nuint GetStringLength(char* source);
public unsafe static nuint GetStringLength(char* source, nuint maxLength);
public unsafe static nuint GetStringLength(byte* source);
public unsafe static nuint GetStringLength(byte* source, nuint maxLength);
}
} |
thanks, updated first post |
The existing methods on MemoryMarshal that create spans are called |
With no length provided as an argument, what else would it do? |
I do not think |
@jkotas What about when we introduce NativeArray / NativeSpan / whatever in the future? It would be nice if we our existing wcslen method worked for those scenarios as well. Since these APIs take raw pointers I don't think it's burdensome to force the caller to think about native-sized integer return values. |
If that ever happens, we will have hundreds of existing methods to update. A few more or less won't be a big deal. |
namespace System.Runtime.InteropServices
{
public static class MemoryMarshal
{
public static unsafe Span<char> CreateSpanFromNullTerminated(char* value);
public static unsafe Span<byte> CreateSpanFromNullTerminated(byte* value);
}
} |
This feels backwards. 99+% case for zero-terminated strings is getting read-only string and parsing it.
All existing APIs that take zero-terminated strings throw (I am sorry that I was not able to participate in the review discussion.) |
Nope, just lack of looking at prior art. It felt like a strange thing to call an ArgumentException to me, since the caller has no good way of pre-validating it... but if that's what |
I believe the argument is that with |
My primary concern is that returning void f(char* s)
{
var span = MemoryMarshal.CreateSpanFromNullTerminated(s);
...
span[0] = 'a'; // This is a bug with near 100% probability
...
} I expect that this API is going to be primarily used on interop boundaries. Incomming zero-terminated Do we have any good examples for mutable zero-terminated |
For the APIs I've bound so far in my TerraFX.Interop.Windows bindings, there are 404 instances of Most of the APIs are taking in a mutable |
(Sorry, hit the wrong key and sent too early.) However, there are also APIs such as |
Yes, the underlying buffer is technically mutable in these cases because of you took ownership of it. I believe that it is very rare for the code to take advantage of it as a micro-optimization. Are there real world code examples that take advantage of this today? |
(I am looking for example of existing code, the existing code rewritten using |
Not that I'm aware of and I don't see anything obvious popping out in the list of APIs (in Windows, Vulkan, PulseAudio, Xlib, or of the several other libraries I've created bindings for) where it would be more than a micro-optimization. I'm sure the code exists, but it might be as you said and exceptionally rare. So if there is a concern around having it return |
So... do we want to change the return to be var span = new Span<char>(value, MemoryMarshal.CreateSpanFromNullTerminated(value).Length); Presumably we'd rename it as well? |
My vote would be to keep |
We do, except we have FWIW, for consistency with the existing names, to avoid collisions in the future if we decided to add a span-based one, and to keep the names relatively short, my preference is: public static Span<char> CreateReadOnlySpan(char* value);
public static Span<byte> CreateReadOnlySpan(byte* value); but I understand some folks felt the meaning of that wasn't clear enough. |
I think my objection to not having the NullTerminated in the name is that the semantics of the |
So options discussed:
I'm not sure there are any great answers here. Alternatively, we could just expose a static ReadOnlySpan<char> CreateReadOnlySpan(char* value) =>
new Span<char>(value, checked((int)Marshal.StringLength(value))); |
@stephentoub Per #40202 (comment), desire was to have any strlen/wcslen-like API return int instead of nuint. Which raised the further question: if we're returning an int, it can obviously fit into a span, so may as well return the span instead of the int. Hence the latest approved API had only span-returning members. But if you're saying something like "look, the only true building block we need expose is strlen/wcslen, and that can easily live on |
I'm more saying "I think we should add |
FWIW, |
Let's say for argument's sake that we expose it as // wcslen returns nuint, we perform a narrowing cast to int
ROS<char> span = new ROS<char>(myPtr, (int)Marshal.wcslen(myPtr)); In the common case, wcslen will return a value within the range of int, so life is good. In the extreme case, this value might overflow, which means one of two things will happen: (a) if bit 31 of the return integer is set, the |
Silent data loss is not good either. I think we would standardize on |
Agree that silent data loss isn't good. Our built-in usage would be correct per your example (and we could enforce via analyzers). I was speculating on what improper third-party code might encounter, and in that scenario I think incorrect truncation is an acceptable risk. |
|
When no one loves it, you know it's a good compromise 😄 namespace System.Runtime.InteropServices
{
public static class MemoryMarshal
{
public static unsafe ReadOnlySpan<char> CreateReadOnlySpanFromNullTerminated(char* value);
public static unsafe ReadOnlySpan<byte> CreateReadOnlySpanFromNullTerminated(byte* value);
...
}
} Going... going... |
Background and Motivation
PInvoke scenarios (mostly those on Windows) interact a lot with null-terminated wide strings.
String has always had a ctor(char*), which in .Net Core now uses a highly optimized wcslen, internal to the framework.
I am proposing the equivalent functionality, minus the allocation+copy, using Span<>.
Given a null-terminated wide string as input (in the shape of unsafe char*), return a Span whose length is the count of characters before the null.
I am proposing Span<> and not ReadOnlySpan<>, to let the caller decide what to do with the result, according to their specific scenario and the nature of their char-pointer.As Jan mentionned, the big majority of usecases are const, so the API should be ReadOnlySpan.
I am not proposing to also add the equivalent narrow ctor(sbyte*) version for span, as this one cannot be implemented without allocating a new buffer.UTF8 (byte*) returns
ReadOnlySpan<byte>
.Implementing the same functionality with just public API is possible, but very awkward and undocumented internal behavior (this is more or less what wcslen does, with error-checking omitted):
Proposed API
The implementation of this proposal is also trivial with the existing tools in the framework (null is non-throwing, the behavior of string ctor(char*)):
Usage Examples
Alternative Designs
Risks
The text was updated successfully, but these errors were encountered: