Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more efficient higher-level methods for converting between Utf16 and Utf8 #110388

Open
davidfowl opened this issue Dec 4, 2024 · 1 comment
Labels
area-System.IO untriaged New issue has not been triaged by the area owner

Comments

@davidfowl
Copy link
Member

davidfowl commented Dec 4, 2024

Today when dealing with converting utf16 text to utf8 text (or vice versa), we have 2 options:

  • Use higher-level methods that are inefficient
  • Use lower-level methods that are very efficient but harder to use

Things are mostly great when the source text is in a single string or char[], but less great when the text is a stream of data, or broken into chunks. One of the problems is that we have multiple ways to represent these chunks of data and not enough APIs that support converting from one to the other.

Utf16
string
char[]
ReadOnly/Span/Memory<char>
StringBuilder
StreamReader/StreamWriter
ReadOnlySequence<char>
Utf8
byte[]
ReadOnly/Span/Memory<byte>
ReadOnlySequence<byte>
IBufferWriter<byte>
Stream
PipeWriter/PipeReader
Rune
APIs with encoding operations
Encoding.UTF8
Rune.*
Utf8.*

What further complicates things, is that there sometimes aren't good conversions between some of these types (and none that are allocation free).

Here was an example that came up recently:

public async Task ExecuteAsync(HttpContext httpContext)
{
    httpContext.Response.ContentType = $"{MediaTypeNames.Text.Csv}; charset=utf-8";
    if (!string.IsNullOrWhiteSpace(filename))
    {
        httpContext.Response.Headers.Append("Content-Disposition", $"attachment; filename=\"{filename}.csv\"");
    }

    // Don't dispose of writer as that will result in the synchronous version of
    // httpContext.Response.Body.Write to be called which is not supported.
    var writer = new StreamWriter(httpContext.Response.Body, Encoding.UTF8, bufferSize: 4096, leaveOpen: true);

    var stringBuilder = new StringBuilder();
    if (headerRowAction is not null)
    {
        headerRowAction.Invoke(stringBuilder);
        await writer.WriteLineAsync(stringBuilder, cancellationToken);
    }

    await foreach (var item in items.WithCancellation(cancellationToken))
    {
        stringBuilder.Clear();
        itemToRowAction.Invoke(item, stringBuilder);
        await writer.WriteLineAsync(stringBuilder, cancellationToken);
    }

    await writer.FlushAsync(cancellationToken);
}

There are tons of copies here, StringBuilder ->(copy) StreamWriter (char[] ->(transcode) byte[]) -> HttpResponseStream (copy).

This could be improved by writing the StringBuilder (assuming that's the right public API here) directly to the underlying HttpResponse buffer (PipeWriter/IBufferWrite). The problem is, now I'm stuck writing this complex encoding logic to translate utf16 to utf8.

This was another example https://www.reddit.com/r/dotnet/comments/1gx11ex/reading_streams_efficiently/.

PS: We did some of this work in System.Memory for ReadOnlySequence<char> a while back https://learn.microsoft.com/en-us/dotnet/api/system.text.encodingextensions?view=net-8.0. We'd need to do similar work to expand the set of types that can participate in this conversion (OR we can push IBufferWriter lower into the stack 😄).

I'd love to turn this into an API proposal once we get a handle on the problem.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 4, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

@davidfowl davidfowl changed the title Add more higher level efficient methods for converting between Utf16 and Utf8 Add more efficient higher-level methods for converting between Utf16 and Utf8 Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.IO untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

1 participant