-
Notifications
You must be signed in to change notification settings - Fork 345
Utf8String design proposal #2350
Comments
Does that mean var ss = s.Substring(s.IndexOf(',')); Would be a double traversal? i.e. any use of |
Yes, I know this is dated from the future! :) |
@benaadams No, it's a single traversal, just like if s were typed as |
But if |
APIs that operate on indices (like (I get that it might be confusing since enumeration of |
Thanks, Levi. Some questions/comments:
I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).
I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).
This is already making some trade-offs. If I've read the data off the wire, I already have it in some memory, which I can then process as a
Why is the to-memory conversion called
I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept
Presumably
What does the return value mean? Is that the number of the byte offset of the
From a design discussion perspective, I would think we'd want this outline to represent the ultimate shape we want, and the implementations can throw NotImplementedException until the functionality is available (before it ships).
What's the plan for integration of this with the existing unicode support in .NET? For example, how do I get a
Similar questions related to the APIs on And, presumably we wouldn't define any APIs (outside of For me, it also begs the question why do we need both? If we're going to have
I don't see why we'd place this restriction. Arrays don't guarantee null termination but are pinnable. Lots of types don't guarantee null termination but are pinnable.
I would hope that before or as part of enabling this, we add support for
I didn't understand this part. Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?
Equivalents for |
Not that I know of. Windows is, with very good legacy reasons, very UTF-16/UCS-2 focused. |
What about Seems like not implementing this would make it difficult for existing ecosystems to adopt this type. |
|
The signature First, an array must be allocated for the return value. Then, each element in the array must be a copy of each match, into a newly-allocated buffer, as If I understand this correctly, except for the trivial case when the separator is not present at all, this signature would basically require copying the whole input string. Would it make sense to return a custom enumerator of |
I think the biggest issue with the proposed API is confusion between UTF8 code units and Unicode scalar values, especially when it comes to lengths and indexes. Would it make sense to alleviate that confusion by more explicit names, like [EditorBrowsable(EditorBrowsableState.Never)]
public static Utf8String DangerousCreateWithoutValidation(ReadOnlySpan<byte> value); Is
Would this mean that if I write
What is the relationship between
There was an issue about creating
Where can I find those comments? I didn't find the More generally, with this proposal we will have: |
Yes, this is a typo.
This is possible via
I struggled with this, and the reason I ultimately decided not to include it is because I think the majority of calls to these methods involve searching for literal substrings, and I'd rather rely on a one-time compiler conversion of the search target from UTF-16 to UTF-8 than a constantly-reoccurring runtime conversion from UTF-16 to UTF-8. I'm concerned that the presence of these overloads would encourage callers to inadvertently use a slow path that requires transcoding. We can go over this in Friday's discussion.
I had planned APIs like
Check the comment at the top of https://github.com/dotnet/corefxlab/blob/utf8string/src/System.Text.Utf8/System/Text/StringSegment.cs. It explains in detail why I think this type provides significant benefits that we can't get simply from using
We do violate the specification in a few cases. For instance,
|
I didn't know that, interesting.
As far as I can tell, that commit is about
That doesn't sound like a good enough reason to have two different types to me, especially since you can create an invalid |
This proposal assumes that |
I see that it's already committed, but can I just go on record as saying that This type really ought to be named I'm mostly OK with the rest of it, though it would be nice if |
"Unicode Scalar Value" is the term Unicode uses for this.
"Character" doesn't really mean anything (Unicode lists 4 different meanings) and would be easily confused with "Code Point" is closer, but that term includes invalid Unicode Scalar Values (the range from U+D800 to U+DFFF). |
The question is, what would the C# keyword be? ( |
This. Will there be a language word for the type? If there is, you can call the type @svick yeah, I know that "chartacter" is nearly meaningless hence my suggesting it. I prefer "code point" because how on Earth are you going to prevent me from writing invalid values to a |
Nobody's stopping you. In fact, there's a public static factory that skips validation and allows you to create such an invalid value. But if you do this you're now violating the contractual guarantees offered by the type, I'd recommend not doing this. :) To be clear, creating an invalid Unlike the |
Sure, great, but a lot of the data being read into these structures will be coming from external sources. Very happy to hear that there's no validation steps being taking as the data is read in (because it would be horribly expensive), but still very concerned about:
BUT there is no guarantee - you've said so in your previous statement. There's an assumption, but no guarantee; so let's be careful how we describe this. |
The The Exception: If you construct a The The reason for the difference is that it's going to be common to construct a So when I use the term "contractual guarantee", it's really shorthand for "This API behaves as expected as long as the caller didn't do anything untoward while constructing the instance. If the API misbehaves, take it up with whoever constructed this instance, as they expressly ignored the overloads that tried to save them from themselves and went straight for the 'I know what I'm doing' APIs." |
FWIW, the reason for this design is that it means that consumers of these types don't have to worry about any of this. Just call the APIs like normal and trust that they'll give you sane values. If you take This philosophy is different from the |
I suppose those are safe-enough trade-offs. Still, too bad the name has to be so unwieldy. 🤷♂️ |
The name doesn't have to be unwieldy. If there's consensus that it should be named |
We should not call it a "Rune" if it's not a representation for the Unicode Code Point, i.e. let's not hijack a good unambiguous term and use it for something else. |
ᚺᛖᛚᛚᛟ᛫ᚹᛟᚱᛚᛞ |
I think in graphemics (branch of science studying writing) rune is indeed a grapheme. I think in software engineering, rune is a code point. But possibly it's not such a clear cut as I think. The point I was trying to make is using "rune" to mean Unicode Scalar would be at least yet another overload of the word "rune". |
Thanks for your reply, I really appreciate the open discussion here 🙂 If you say that a "char-size agnostic" option could be implemented, that would be an awesome option as it could be progressive through the ecosystem. Thanks for the discussion, i'll keep following the thread here, but apart from a blocker, I'd go for the bold idea of letting thing break and get updated incrementally until we get a better .net. My estimate is that 99% of the time, upgrading a library or an application to the new version will be as easy as enabling the flag (as opposed to the nullable story where you had to break many things). I'd say that most .net devs don't care about how a string is represented in memory, the same way they don't need to worry about the size of an int and other internal implementations, until they reach the limit or work with low-level code, as you probably do on a daily basis. |
How would this handle the user string table (that is C# string literals)? Today, it is UTF16 encoded. I would have assumed that if we switched to UTF8, then it would become UTF8 encoded rather than carrying both or having a conversion cost on startup by the runtime. |
I think it would stay UTF16 encoded. Utf8 optimized runtime mode would pay for the conversion, but that should not be a problem. For JITed cases, the conversion cost is miniscule compared to cost of JITing and type loading. For AOT cases, we can store the converted string in the AOT binary if it makes a difference. Also, keeping the user string blob UTF16 encoded allows the agnostic binaries to work with existing tools and on older runtime in the "I know what I doing mode.". There are number of encoding inefficiencies in the IL format. If we believe that it is important to do something about them and rev the format, it should be a separate exercise. |
Bit of a long thread so is there a quick summary of why we feel like we need My mental model of this approach, based on discussions we had back in the Midori days, is that the path forward here would be to change Is the issue that we feel like we'd need to update to much code that is already written in terms of Let's leave issues of how C# uses |
I think the whole discussion, especially since @jkotas intervened, moved to discuss a lower level approach that is not aiming to solve the problem through an additive API, but through changing the lower lever storage for the regular |
Yes, it is the crux of the problem. We have a lot of APIs and code written in term of string and char today. We are trying to find the best way to move the code to Utf8, while maximizing the performance benefits of Utf8 and minimizing additional cognitive load of the platform.
I agree that this design would be an option. I think the downsite of this approch is that we would need to add
I do not think that we would be ever able to deprecate |
Could this be solved with some JIT level marshalling? Whenever we make a native call which involves a string, we have to marshal the string with a full copy anyway. Could the JIT be used so that it ensures that a string passed to a method in a "legacy" assembly is converted to a 16-bit per character string and any callbacks to a newer assembly is converted to utf8. A mechanism could be added to treat an assembly as utf8 safe in the cases where an unmaintained assembly only uses string in a way which is safe (e.g. passes it through to a .NET assembly such as calling a File api with the string). This would ensure everything is safe, although with a potential performance cost. It would be up to the application developer to decide that their own code is safe and that the performance hit is worth it. Some helper classes could help alleviate a lot of the performance issues in targeted places, e.g. a class for newer apps to use that represents a cached UTF16 converted string which marshalling code recognizes and can use for the cached utf16 representation for hot code paths with a lot of reuse. |
I have commented on it in #2350 (comment) . I do not think it is feasible to solve it via JIT level marshaling. I believe the marshaling would be too expensive and it would have to be done for too many situations. |
What if from the runtime's perspective, the UTF-8 string type were a completely separate type, and the choice between encodings would lie not with the CLR, but with the compiler? For example, the |
@Serentty I don't know how it works internally, but I'm under the impression that the runtime is a very small part to change compared to the BCL, which would need to be recompiled and re-checked, because there are assumptions that char represents an UTF-16 code point. (16bytes) As jaredpar objected, that would create a branch in the ecosystem, and you would need to choose one or the other of the versions of .net. If you need an old library that won't get compiled for the new .net, but have other "new libraries" that use the new string representation, you'll get stuck. jkotas' proposal seems like a more reasonable approach, where the BCL would become "string representation agnostic" over time, and encourage other libraries to do the same. you would only be able to "upgrade" if all your libs are agnostic, but with a less performant but still working fallback. |
@mconnew what do you think about a Asking this specifically because only a 4-byte WORD has sufficient space to fit any Unicode value. Of course, we'd have to give up on constant time access to |
@whoisj That would waste a lot of memory for most of the characters, and would require a conversion from UTF8 and UTF16 from everywhere outside the .net world. What you're suggesting is basically an UTF32 encoding, but utf8 is the most compatible with other pieces of software in my opinion |
@whoisj, in addition to what Jeremy said, it still wouldn't be sufficient. Unicode can use composed characters. For example é can be represented by the two character pair of the letter e (U+0065) and the acute accent (U+0301). If using the composed (as opposed to the single (U+00E9) form) version of the character, you still need to use two char's to represent this. This is a simple version, but things like emoji's use composition. For example, all the people and hand emoji's where you can choose the skin tone are a composed character with a skin tone specifying character followed by the image representing character. There is no single char representation.
I wonder how much mileage you could get out of implicit conversion operators to convert ReadOnlySpan<Rune> or Span<Rune> to byte[], ReadOnlySpan<byte> and old fashioned strings. Add in some useful methods which mirror the String methods such as Replace etc. As long as you can get to/from a byte[] or Span<byte>, you have your IO covered. Any existing api's in libraries which need a string would be converted, but I don't think that should cost much more than if you just went to string to begin with as all you have done is defer the conversion to UTF-16. As long as you cached the conversion and had some container to hold the Span<Rune> with it's cache (or used a weak reference table) I think you could minimize the cost. Char is the least useful representation of Unicode. It's neither the raw representation of what gets sent over a stream (file or network), nor a representation of single entities that you see on the screen. It's a halfway representation which has no useful purpose in isolation, unlike byte[] or Rune[]. |
Iterating over bytes is a O(1) operation, iterating over UTF-8 values is doable in a reasonable time, but I don't know what it takes to iterate over runes? How do one know a composition is possible? does it need to use a lookup table? However, i'm OK for the inclusion of a |
I think its a property of the code point. Similar to upper and lower case.
I'm not sure someone would iterate over a UTF-8 string to write every rune to a stream. I think I'd never seen someone iterate over a string and write single char's to a stream. After all a stream does not take Runes as input but bytes. |
This would be lossy. As @mconnew mentioned earlier, sometimes you need to look at a pair of As Jan said, we're looking into whether it would make sense to back |
Have a look at how many classes use one of the overloads of |
@mconnew , I think you're missing the point. There's no reason a My suggestion was Seems like I might also need to remind some of you that there are platforms besides Windows, and those platforms forcing all string data into UTF-16 requires a re-encoding of that data and an increase in its memory foot print. |
NOTE: Sorry for a wrong comment with no content from me, if you received it.
This proposal has for sure some drawbacks, for example all |
I'm implemented utf-8 based string types in C++ and C# many times. In every case the optimal path is to keep the indexer returning the internal type value. example (in the case where the underlying data type is public byte this[int index] { get; } Example usage would be seeking the next new line character, which absolutely doesn't require a character-by-character search, but merely a byte-by-byte seek.
Even when looking for seeking for a character like In 99% of string parsing cases (the most likely reason code is using the indexer), reading a When code needs to pull each Unicode character out of a |
@jkotas shall we transfer this to runtime or runtimelab as we are archiving this repo? |
The discussion in this issue is too long and github has troubles rendering it. I think we should close this issue and start a new one in dotnet/runtime. |
Utf8String design discussion - last edited 14-Sep-19
Utf8String design overview
Audience and scenarios
Utf8String
and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.A naive way to accomplish this would be to represent UTF-8 data as
byte[]
/Span<byte>
, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particularbyte[]
instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code likebyte[] imageData = ...; imageData.ToUpperInvariant();
. This defeats the purpose of using a typed language.We want to expose enough functionality to make the
Utf8String
type usable and desirable by our developer audience, but it's not intended to serve as a full drop-in replacement for its sibling typestring
. For example, we might addUtf8String
-related overloads to existing APIs in theSystem.IO
namespace, but we wouldn't add an overloadAssembly.LoadFrom(Utf8String assemblyName)
.In addition to networking and i/o scenarios, it's expected that there will be an audience who will want to use
Utf8String
for interop scenarios, especially when interoperating with components written in Rust or Go. Both of these languages use UTF-8 as their native string representation, and providing a type which can be used as a data exchange type for that audience will make their scenarios a bit easier.Finally, we should afford power developers the opportunity to improve their throughput and memory utilization by limiting data copying where feasible. This doesn't imply that we must be allocation-free or zero-copy for every scenario. But it does imply that we should investigate common operations and consider alternative ways of performing these tasks as long as it doesn't compromise the usability of the mainline scenarios.
It's important to call out that
Utf8String
is not intended to be a replacement forstring
. The standard UTF-16string
will remain the core primitive type used throughout the .NET ecosystem and will enjoy the largest supported API surface area. We expect that developers who useUtf8String
in their code bases will do so deliberately, either because they're working in one of the aforementioned scenarios or because they find other aspects ofUtf8String
(such as its API surface or behavior guarantees) desirable.Design decisions and type API
To make internal
Utf8String
implementation details easier, and to allow consumers to better reason about the type's behavior, theUtf8String
type maintains the following invariants:Instances are immutable. Once data is copied to the
Utf8String
instance, it is unchanging for the lifetime of the instance. All members onUtf8String
are thread-safe.Instances are heap-allocated. This is a standard reference type, like
string
andobject
.The backing data is guaranteed well-formed UTF-8. It can be round-tripped through
string
(or any other Unicode-compatible encoding) and back without any loss of fidelity. It can be passed verbatim to any other component whose contract requires that it operate only on well-formed UTF-8 data.The backing data is null-terminated. If the
Utf8String
instance is pinned, the resultingbyte*
can be passed to any API which takes aLPCUTF8STR
parameter. (Likestring
,Utf8String
instances can contain embedded nulls.)These invariants help shape the proposed API and usage examples as described throughout this document.
Non-allocating types
While
Utf8String
is an allocating, heap-based, null-terminated type; there are scenarios where a developer may want to represent a segment (or "slice") of UTF-8 data from an existing buffer without incurring an allocation.The
Utf8Segment
(alternative name:Utf8Memory
) andUtf8Span
types can be used for this purpose. They represent a view into UTF-8 data, with the following guarantees:These types have
Utf8String
-like methods hanging off of them as instance methods where appropriate. Additionally, they can be projected asROM<byte>
andROS<byte>
for developers who want to deal with the data at the raw binary level or who want to call existing extension methods on theROM
andROS
types.Since
Utf8Segment
andUtf8Span
are standalone types distinct fromROM
andROS
, they can have behaviors that developers have come to expect from string-like types. For example,Utf8Segment
(unlikeROM<char>
orROM<byte>
) can be used as a key in a dictionary without jumping through hoops:Utf8Span
instances can be compared against each other:An alternative design that was considered was to introduce a type
Char8
that would represent an 8-bit code unit - it would serve as the elemental type ofUtf8String
and its slices. However,ReadOnlyMemory<Char8>
andReadOnlySpan<Char8>
were a bit unweildy for a few reasons.First, there was confusion as to what
ROS<Char8>
actually meant when the developer could useROS<byte>
for everything. WasROS<Char8>
actually providing guarantees thatROS<byte>
couldn't? (No.) When would I ever want to use a loneChar8
by itself rather than as part of a larger sequence? (You probably wouldn't.)Second, it introduced a complication that if you had a
ROM<Char8>
, it couldn't be converted to aROM<byte>
. This impacted the ability to perform text manipulation and then act on the data in a binary fashion, such as sending it across the network.Creating segment types
Segment types can be created safely from
Utf8String
backing objects. As mentioned earlier, we enforce that data in the UTF-8 segment types is well-formed. This implies that an instance of a segment type cannot represent data that has been sliced in the middle of a multibyte boundary. Calls to slicing APIs will throw an exception if the caller tries to slice the data in such a manner.The
Utf8Segment
type introduces additional complexity in that it could be torn in a multi-threaded application, and that tearing may invalidate the well-formedness assumption by causing the torn segment to begin or end in the middle of a multi-byte UTF-8 subsequence. To resolve this issue, any instance method onUtf8Segment
(including its projection toROM<byte>
) must first validate that the instance has not been torn. If the instance has been torn, an exception is thrown. This check is O(1) algorithmic complexity.It is possible that the developer will want to create a
Utf8Segment
orUtf8Span
instance from an existing buffer (such as a pooled buffer). There are zero-cost APIs to allow this to be done; however, they are unsafe because they easily allow the developer to violate invariants held by these types.If the developer wishes to call the unsafe factories, they must maintain the following three invariants hold.
The provided buffer (
ROM<byte>
orROS<byte>
) remains "alive" and immutable for the duration of theUtf8Segment
orUtf8Span
's existence. Whichever component receives aUtf8Segment
orUtf8Span
- however the instance has been created - must never observe that the underlying contents change or that dereferencing the contents might result in an AV or other undefined behavior.The provided buffer contains only well-formed UTF-8 data, and the boundaries of the buffer do not split a multibyte UTF-8 sequence.
For
Utf8Segment
in particular, the caller must not create aUtf8Segment
instance wrapped around aROM<byte>
in circumstances where the component which receives the newly createdUtf8Segment
might tear it. The reason for this is that the "check that theUtf8Segment
instance was not torn across a multi-byte subsequence" protection is only reliable when theUtf8Segment
instance is backed by aUtf8String
. TheUtf8Segment
type makes a best effort to offer protection for other backing buffers, but this protection is not ironclad in those scenarios. This could lead to a violation of invariant (2) immediately above.The type design here - including the constraints placed on segment types and the elimination of the
Char8
type - also draws inspiration from the Go, Swift, and Rust communities.Supporting types
Like
StringComparer
, there's also aUtf8StringComparer
which can be passed into theDictionary<,>
andHashSet<>
constructors. ThisUtf8StringComparer
also implementsIEqualityComparer<Utf8Segment>
, which allows usingUtf8Segment
instances directly as the keys inside dictionaries and other collection types.The
Dictionary<,>
class is also being enlightened to understand that these types have both non-randomized and randomized hash code calculation routines. This allows dictionaries instantiated with TKey =Utf8String
or TKey =Utf8Segment
to enjoy the same performance optimizations as dictionaries instantiated with TKey =string
.Finally, the
Utf8StringComparer
type has convenience methods to compareUtf8Span
instances against one another. This will make it easier to compare texts using specific cultures, even if that specific culture is not the current thread's active culture.Manipulating UTF-8 data
CoreFX and Azure scenarios
What exchange types do we use when passing around UTF-8 data into and out of Framework APIs?
How do we generate UTF-8 data in a low-allocation manner?
How do we apply a series of transformations to UTF-8 data in a low-allocation manner?
Leave everything as
Span<byte>
, use a specialUtf8StringBuilder
type, or something else?Do we need to support UTF-8 string interpolation?
If we have builders, who is ultimately responsible for lifetime management?
Perhaps should look at
ValueStringBuilder
for inspiration.A
MutableUtf8Buffer
type would be promising, but we'd need to be able to generateUtf8Span
slices from it, and if the buffer is being modified continually the spans could end up holding invalid data. Example below:Some folks will want to perform operations in-place.
Sample operations on arbitrary buffers
(Devs may want to perform these operations on arbitrary byte buffers, even if those buffers aren't guaranteed to contain valid UTF-8 data.)
Validate that buffer contains well-formed UTF-8 data.
Convert ASCII data to upper / lower in-place, leaving all non-ASCII data untouched.
Split on byte patterns. (Probably shouldn't split on runes or UTF-8 string data, since we can't guarantee data is well-formed UTF-8.)
These operations could be on the newly-introduced
System.Text.Unicode.Utf8
static class. They would takeROS<byte>
andSpan<byte>
as input parameters because they can operate on arbitrary byte buffers. Their runtime performance would be subpar compared to similar methods onUtf8String
,Utf8Span
, or other types where we can guarantee that no invalid data will be seen, as the APIs which operate on raw byte buffers would need to be defensive and would probably operate over the input in an iterative fashion rather than in bulk. One potential behavior could be skipping over invalid data and leaving it unchanged as part of the operation.Sample
Utf8StringBuilder
implementation for private useCode samples and metadata representation
The C# compiler could detect support for UTF-8 strings by looking for the existence of the
System.Utf8String
type and the appropriate helper APIs onRuntimeHelpers
as called out in the samples below. If these APIs don't exist, then the target framework does not support the concept of UTF-8 strings.Literals
Literal UTF-8 strings would appear as regular strings in source code, but would be prefixed by a u as demonstrated below. The u prefix would denote that the return type of this literal string expression should be
Utf8String
instead ofstring
.The u prefix would also be combinable with the @ prefix and the $ prefix (more on this below).
Additionally, literal UTF-8 strings must be well-formed Unicode strings.
Three alternative designs were considered. One was to use RVA statics (through
ldsflda
) instead of literal UTF-16 strings (throughldstr
) before calling a "load from RVA" method onRuntimeHelpers
. The overhead of using RVA statics is somewhat greater than the overhead of using the normal UTF-16 string table, so the normal UTF-16 string literal table should still be the more optimized case for small-ish strings, which we believe to be the common case.Another alternative considered was to introduce a new opcode
ldstr.utf8
, which would act as a UTF-8 equivalent to the normalldstr
opcode. This would be a breaking change to the .NET tooling ecosystem, and the ultimate decision was that there would be too much pain to the ecosystem to justify the benefit.The third alternative considered was to smuggle UTF-8 data in through a normal UTF-16 string in the string table, then call a
RuntimeHelpers
method to reinterpret the contents. This would result in a "garbled" string for anybody looking at the raw IL. While that in itself isn't terrible, there is the possibility that smuggling UTF-8 data in this manner could result in a literal string which has ill-formed UTF-16 data. Not all .NET tooling is resilient to this. For example, xunit's test runner produces failures if it sees attributes initialized from literal strings containing ill-formed UTF-16 data. There is a risk that other tooling would behave similarly, potentially modifying the DLL in such a manner that errors only manifest themselves at runtime. This could result in difficult-to-diagnose bugs.We may wish to reconsider this decision in the future. For example, if we see that it is common for developers to use large UTF-8 literal strings, maybe we'd want to dynamically switch to using RVA statics for such strings. This would lower the resulting DLL size. However, this would add extra complexity to the compilation process, so we'd want to tread lightly here.
Constant handling
String concatenation
There would be APIs on
Utf8String
which mirror thestring.Concat
APIs. The compiler should special-case the+
operator to call the appropriate overload n-ary overload ofConcat
.Since we expect use of
Utf8String
to be "deliberate" when compared tostring
(see the beginning of this document), we should consider that a developer who is using UTF-8 wants to stay in UTF-8 during concatenation operations. This means that if there's a line which involves the concatenation of both aUtf8String
and astring
, the final type post-concatenation should beUtf8String
.This is still open for discussion, as the behavior may be surprising to people. Another alternative is to produce a build warning if somebody tries to mix-and-match UTF-8 strings and UTF-16 strings in a single concatenation expression.
If string interpolation is added in the future, this shouldn't result in ambiguity. The
$
interpolation operator will be applied to a literalUtf8String
or a literalstring
, and that would dictate the overall return type of the operation.Equality comparisons
There are standard
==
and!=
operators defined on theUtf8String
class.The C# compiler should special-case when either side of an equality expression is known to be a literal null object, and if so the compiler should emit a referential check against the null object instead of calling the operator method. This matches the
if (myString == null)
behavior that thestring
type enjoys today.Additionally, equality / inequality comparisons between
Utf8String
andstring
should produce compiler warnings, as they will never succeed.I attempted to define
operator ==(Utf8String a, string b)
so that I could slap[Obsolete]
on it and generate the appropriate warning, but this had the side effect of disallowing the user to write the codeif (myUtf8String == null)
since the compiler couldn't figure out which overload ofoperator ==
to call. This was also one of the reasons I had opened dotnet/csharplang#2340.Marshaling behaviors
Like the
string
type, theUtf8String
type shall be marshalable across p/invoke boundaries. The corresponding unmanaged type shall beLPCUTF8
(equivalent to aBYTE*
pointing to null-terminated UTF-8 data) unless a different unmanaged type is specified in the p/invoke signature.If a different
[MarshalAs]
representation is specified, the stub routine creates a temporary copy in the desired representation, performs the p/invoke, then destroys the temporary copy or allows the GC to reclaim the temporary copy.If a
Utf8String
must be marshaled from native-to-managed (e.g., a reverse p/invoke takes place on a delegate which has aUtf8String
parameter), the stub routine is responsible for fixing up invalid UTF-8 data before creating theUtf8String
instance (or it may let theUtf8String
constructor perform the fixup automatically).Unmanaged routines must not modify the contents of any
Utf8String
instance marshaled across the p/invoke boundary.Utf8String
instances are assumed to be immutable once created, and violating this assumption could cause undefined behaviors within the runtime.There is no default marshaling behavior for
Utf8Segment
orUtf8Span
since they are not guaranteed to be null-terminated. If in the future the runtime allows marshaling{ReadOnly}Span<T>
across a p/invoke boundary (presumably as a non-null-terminated array equivalent), library authors may fetch the underlyingReadOnlySpan<byte>
from theUtf8Segment
orUtf8Span
instance and directly marshal that span across the p/invoke boundary.Automatic coercion of UTF-16 literals to UTF-8 literals
If possible, it would be nice if UTF-16 literals (not arbitrary
string
instances) could be automatically coerced to UTF-8 literals (via theldstr / call
routines mentioned earlier). This coercion would only be considered if attempting to leave the data as astring
would have caused a compilation error. This could help eliminate some errors resulting from developers forgetting to put the u prefix in front of the string literal, and it could make the code cleaner. Some examples follow.UTF-8 String interpolation
The string interpolation feature is undergoing significant churn (see dotnet/csharplang#2302). I envision that when a final design is chosen, there would be a UTF-8 counterpart for symmetry. The internal
IUtf8Formattable
interface as proposed above is being designed partly with this feature in mind in order to allow single-allocationUtf8String
interpolation.ustring
contextual language keywordFor simplicity, we may want to consider a contextual language keyword which corresponds to the
System.Utf8String
type. The exact name is still up for debate, as is whether we'd want it at all, but we could consider something like the below.The name
ustring
is intended to invoke "Unicode string". Another leading candidate wasutf8
. We may wish not to ship with this keyword support in v1 of theUtf8String
feature. If we opt not to do so we should be mindful of how we might be able to add it in the future without introducing breaking changes.An alternative design to use a
u
suffix instead of au
prefix. I'm mostly impartial to this, but there is a nice symmetry to having the charactersu
,$
, and@
all available as prefixes on literal strings.We could also drop the
u
prefix entirely and rely solely on type targeting:This has implications for string interpolation, as it wouldn't be possible to prepend both the
(ustring)
coercion hint and the$
interpolation operator simultaneously.Switching and pattern matching
If a value whose type is statically known to be
Utf8String
is passed to aswitch
statement, the correspondingcase
statements should allow the use of literalUtf8String
values.Since pattern matching operates on input values of arbitrary types, I'm pessimistic that pattern matching will be able to take advantage of target typing. This may instead require that developers specify the
u
prefix onUtf8String
literals if they wish such values to participate in pattern matching.A brief interlude on indexers and
IndexOf
Utf8String
and related types do not expose an elemental indexer (this[int]
) or a typicalIndexOf
method because they're trying to rid the developer of the notion that bytewise indices into UTF-8 buffers can be treated equivalently as charwise indices into UTF-16 buffers. Consider the naïve implementation of a typical "string split" routine as presented below.One subtlety of the above code is that when culture-sensitive or case-insensitive comparers are used (such as OrdinalIgnoreCase in the above example), the target string doesn't have to be an exact char-for-char match of a sequence present in the source string. For example, consider the UTF-16 string "GREEN" (
[ 0047 0052 0045 0045 004E ]
). Performing an OrdinalIgnoreCase search for the substring "e" ([ 0065 ]
) will result in a match, as 'e' (U+0065
) and 'E' (U+0045
) compare as equal under an OrdinalIgnoreCase comparer.As another example, consider the UTF-16 string "preſs" (
[ 0070 0072 0065 017F 0073 ]
), whose fourth character is the Latin long s 'ſ' (U+017F
). Performing an OrdinalIgnoreCase search for the substring "S" ([ 0053 ]
) will result in a match, as 'ſ' (U+017F
) and 'S' (U+0053
) compare as equal under an OrdinalIgnoreCase comparer.There are also scenarios where the length of the match within the search string might not be equal to the length of the target string. Consider the UTF-16 string "encyclopædia" (
[ 0065 006E 0063 0079 0063 006C 006F 0070 00E6 0064 0069 0061 ]
), whose ninth character is the ligature 'æ' (U+00E6
). Performing an InvariantCultureIgnoreCase search for the substring "ae" ([ 0061 0065 ]
) will result in a match at index 8, as "æ" ([ 00E6 ]
) and "ae" ([ 0061 0065 ]
) compare as equal under an InvariantCultureIgnoreCase comparer.This result is interesting and should give us pause. Since
"æ".Length == 1
and"ae".Length == 2
, the arithmetic at the end of the method will actually result in the wrong substrings being returned to the caller.Due to the nature of UTF-16 (used by
string
), when performing an Ordinal or an OrdinalIgnoreCase comparison, the length of the matched substring within the source will always have achar
count equal totarget.Length
. The length mismatch as demonstrated by "encyclopædia" above can only happen with a culture-sensitive comparer or any of the InvariantCulture comparers.However, in UTF-8, these same guarantees do not hold. Under UTF-8, only when performing an Ordinal comparison is there a guarantee that the length of the matched substring within the source will have a
byte
count equal to the target. All other comparers - including OrdinalIgnoreCase - have the behavior that the byte length of the matched substring can change (either shrink or grow) when compared to the byte length of the target string.As an example of this, consider the string "preſs" from earlier, but this time in its UTF-8 representation (
[ 70 72 65 C5 BF 73 ]
). Performing an OrdinalIgnoreCase for the target UTF-8 string "S"([ 53 ]
) will match on the([ C5 BF ]
) portion of the source string. (This is the UTF-8 representation of the letter 'ſ'.) To properly split the source string along this search target, the caller need to know not only where the match was, but also how long the match was within the original source string.This fundamental problem is why
Utf8String
and related types don't expose a standardIndexOf
function or a standardthis[int]
indexer. It's still possible to index directly into the underlying byte buffer by using an API which projects the data as aROS<byte>
. But for splitting operations, these types instead offer a simpler API that performs the split on the caller's behalf, handling the length adjustments appropriately. For callers who want the equivalent ofIndexOf
, the types instead provideTryFind
APIs that return aRange
instead of a typical integral index value. ThisRange
represents the matching substring within the original source string, and new C# language features make it easy to take this result and use it to create slices of the original source input string.This also addresses feedback that was given in a previous prototype: users weren't sure how to interpret the result of the
IndexOf
method. (Is it a byte count? Is it a char count? Is it something else?) Similarly, there was confusion as to what parameters should be passed to athis[int]
indexer or aSubstring(int, int)
method. By having the APIs promote use ofRange
and related C# language features, this confusion should subside. Power developers can inspect theRange
instance directly to extract raw byte offsets if needed, but most devs shouldn't need to query such information.API usage samples
Scenario: Split an incoming string of the form "LastName, FirstName" into individual FirstName and LastName components.
Additionally, the
SplitResult
struct returned byUtf8Span.Split
implements both a standardIEnumerable<T>
pattern and the C# deconstruct pattern, which allows it to be used separately from enumeration for simple cases where only a small handful of values are returned.Scenario: Split a comma-delimited input into substrings, then perform an operation with each substring.
Miscellaneous topics and open questions
What about comparing UTF-16 and UTF-8 data?
Currently there is a set of APIs
Utf8String.AreEquivalent
which will decode sequences of UTF-16 and UTF-8 data and compare them for ordinal equality. The general code pattern is below.Do we want to add an
operator==(Utf8String, string)
overload which would allow easy==
comparison of UTF-8 and UTF-16 data? There are three main downsides to this which caused me to vote no, but I'm open to reconsideration.The compiler would need to special-case
if (myUtf8String == null)
, which would now be ambiguous between the two overloads. (If the compiler is already special-casing null checks, this is a non-issue.)The performance of UTF-16 to UTF-8 comparison is much worse than the performance of UTF-16 to UTF-16 (or UTF-8 to UTF-8) comparison. When the representation is the same on both sides, certain shortcuts can be implemented to avoid the O(n) comparison, and even the O(n) comparison itself can be implemented as a simple memcmp operation. When the representations are heterogeneous, the opportunity for taking shortcuts is much more restricted, and the O(n) comparison itself has a higher constant factor. Developers might not expect such a performance characteristic from an equality operator.
Comparing a
Utf8String
against a literal string would no longer go through the fast path, as target typing would cause the compiler to emit a call tooperator==(Utf8String, string)
instead ofoperator==(Utf8String, Utf8String)
. The comparison itself would then have the lower performance described by bullet (2) above.One potential upside to having such a comparison is that it would prevent developers from using the antipattern
if (myUtf8String.ToString() == someString)
, which would result in unnecessary allocations. If we are concerned about this antipattern one way to address it would be through a Code Analyzer.What if somebody passes invalid data to the "skip validation" factories?
When calling the "unsafe" APIs, callers are fully responsible for ensuring that the invariants are maintained. Our debug builds could double-check some of these invariants (such as the initial
Utf8String
creation consisting only of well-formed data). We could also consider allowing applications to opt-in to these checks at runtime by enabling an MDA or other diagnostic facility. But as a guiding principle, when "unsafe" APIs are called the Framework should trust the developer and should have as little overhead as possible.Consider consolidating the unsafe factory methods under a single unsafe type.
This would prevent pollution of the type's normal API surface and could help write tools which audit use of a single "unsafe" type.
Some of the methods may need to be extension methods instead of normal static factories. (Example: Unsafe slicing routines, should we choose to expose them.)
Potential APIs to enlighten
System namespace
Include
Utf8String
/Utf8Span
overloads onConsole.WriteLine
. Additionally, perhaps introduce an APIConsole.ReadLineUtf8
.System.Data.* namepace
Include generalized support for serializing Utf8String properties as a primitive with appropriate mapping to
nchar
ornvarchar
.System.Diagnostics.* namespace
Enlighten
EventSource
so that a caller can writeUtf8String
/Utf8Span
instances cheaply. Additionally, some types likeActivitySpanId
already haveROS<byte>
ctors; overloads can be introduced here.System.Globalization.* namespace
The
CompareInfo
type has many members which operate onstring
instances. These should be spanified foremost, andUtf8String
/Utf8Span
overloads should be added. Good candidates areCompare
,GetHashCode
,IndexOf
,IsPrefix
, andIsSuffix
.The
TextInfo
type has members which should be treated similarly.ToLower
andToUpper
are good candidates. Can we get away without enlighteningToTitleCase
?System.IO.* namespace
BinaryReader
andBinaryWriter
should have overloads which operate onUtf8String
andUtf8Span
. These overloads could potentially be cheaper than the normalstring
/ROS<char>
based overloads, since the reader / writer instances may in fact be backed by UTF-8 under the covers. If this is the case then writing is simple projection, and reading is validation (faster than transcoding).File
:WriteAllLines
,WriteAllText
,AppendAllText
, etc. are good candidates for overloads to be added. On the read side, there'sReadAllTextUtf8
andReadAllLinesUtf8
.TextReader.ReadLine
andTextWriter.Write
are also good candidates to overload. This follows the same general premise asBinaryReader
andBinaryWriter
as mentioned above.Should we also enlighten
SerialPort
or GPIO APIs? I'm not sure if UTF-8 is a bottleneck here.System.Net.Http.* namespace
Introduce
Utf8StringContent
, which automatically sets the charset header. This type already exists in the System.Utf8String.Experimental package.System.Text.* namespace
UTF8Encoding
: Overload candidates areGetChars
,GetString
, andGetCharCount
(ofUtf8String
orUtf8Span
). These would be able to skip validation after transcoding as long as the developer hasn't subclassed the type.Rune
: AddToUtf8String
API. AddIsDefined
API to query the OS's NLS tables (could help with databases and other components that need to adhere to strict case / comparison processing standards).TextEncoder
: AddEncode(Utf8String): Utf8String
andFindFirstIndexToEncode(Utf8Span): Index
. This is useful for HTML-escaping, JSON-escaping, and related operations.Utf8JsonReader
: Add read APIs (GetUtf8String
) and overloads to both the ctor andValueTextEquals
.JsonEncodedText
: Add anEncodedUtf8String
property.Regex is a bit of a special case because there has been discussion about redoing the regex stack all-up. If we did proceed with redoing the stack, then it would make sense to add first-class support for UTF-8 here.
The text was updated successfully, but these errors were encountered: