You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today when dealing with converting utf16 text to utf8 text (or vice versa), we have 2 options:
Use higher-level methods that are inefficient
Use lower-level methods that are very efficient but harder to use
Things are mostly great when the source text is in a single string or char[], but less great when the text is a stream of data, or broken into chunks. One of the problems is that we have multiple ways to represent these chunks of data and not enough APIs that support converting from one to the other.
Utf16
string
char[]
ReadOnly/Span/Memory<char>
StringBuilder
StreamReader/StreamWriter
ReadOnlySequence<char>
Utf8
byte[]
ReadOnly/Span/Memory<byte>
ReadOnlySequence<byte>
IBufferWriter<byte>
Stream
PipeWriter/PipeReader
Rune
APIs with encoding operations
Encoding.UTF8
Rune.*
Utf8.*
What further complicates things, is that there sometimes aren't good conversions between some of these types (and none that are allocation free).
Here was an example that came up recently:
publicasyncTaskExecuteAsync(HttpContexthttpContext){httpContext.Response.ContentType=$"{MediaTypeNames.Text.Csv}; charset=utf-8";if(!string.IsNullOrWhiteSpace(filename)){httpContext.Response.Headers.Append("Content-Disposition",$"attachment; filename=\"{filename}.csv\"");}// Don't dispose of writer as that will result in the synchronous version of// httpContext.Response.Body.Write to be called which is not supported.varwriter=newStreamWriter(httpContext.Response.Body,Encoding.UTF8,bufferSize:4096,leaveOpen:true);varstringBuilder=newStringBuilder();if(headerRowActionis not null){headerRowAction.Invoke(stringBuilder);awaitwriter.WriteLineAsync(stringBuilder,cancellationToken);}awaitforeach(variteminitems.WithCancellation(cancellationToken)){stringBuilder.Clear();itemToRowAction.Invoke(item,stringBuilder);awaitwriter.WriteLineAsync(stringBuilder,cancellationToken);}awaitwriter.FlushAsync(cancellationToken);}
There are tons of copies here, StringBuilder ->(copy) StreamWriter (char[] ->(transcode) byte[]) -> HttpResponseStream (copy).
This could be improved by writing the StringBuilder (assuming that's the right public API here) directly to the underlying HttpResponse buffer (PipeWriter/IBufferWrite). The problem is, now I'm stuck writing this complex encoding logic to translate utf16 to utf8.
Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.
davidfowl
changed the title
Add more higher level efficient methods for converting between Utf16 and Utf8
Add more efficient higher-level methods for converting between Utf16 and Utf8
Dec 4, 2024
Today when dealing with converting utf16 text to utf8 text (or vice versa), we have 2 options:
Things are mostly great when the source text is in a single string or char[], but less great when the text is a stream of data, or broken into chunks. One of the problems is that we have multiple ways to represent these chunks of data and not enough APIs that support converting from one to the other.
What further complicates things, is that there sometimes aren't good conversions between some of these types (and none that are allocation free).
Here was an example that came up recently:
There are tons of copies here, StringBuilder ->(copy) StreamWriter (char[] ->(transcode) byte[]) -> HttpResponseStream (copy).
This could be improved by writing the StringBuilder (assuming that's the right public API here) directly to the underlying HttpResponse buffer (PipeWriter/IBufferWrite). The problem is, now I'm stuck writing this complex encoding logic to translate utf16 to utf8.
This was another example https://www.reddit.com/r/dotnet/comments/1gx11ex/reading_streams_efficiently/.
PS: We did some of this work in System.Memory for
ReadOnlySequence<char>
a while back https://learn.microsoft.com/en-us/dotnet/api/system.text.encodingextensions?view=net-8.0. We'd need to do similar work to expand the set of types that can participate in this conversion (OR we can push IBufferWriter lower into the stack 😄).I'd love to turn this into an API proposal once we get a handle on the problem.
The text was updated successfully, but these errors were encountered: