Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Consider exposing Utf8Utility #2792

Closed
Symbai opened this issue Nov 30, 2019 · 5 comments
Closed

Consider exposing Utf8Utility #2792

Symbai opened this issue Nov 30, 2019 · 5 comments

Comments

@Symbai
Copy link

Symbai commented Nov 30, 2019

I was trying out the Utf8String which has been added to 5.0 as it seems. Unfortunately creating an Utf8String from a byte array throws an ArgumentException if the byte array contains invalid utf8 code. While this makes sense in a way (in other ways not as Utf8Encoder.GetString does not throw) it sadly is a huge problem when performance matters.

The code reveals that there is a simple call to Utf8Utility.IsWellFormedUtf8 in the ctor and if it returns false it throws the exception. I would like to call this myself and skip Utf8String creation / exception. Unfortunately it is internal only.

https://github.com/dotnet/runtime/blob/4f9ae42d861fcb4be2fcd5d3d55d5f227d30e723/src/coreclr/src/System.Private.CoreLib/src/System/Utf8String.Construction.cs#L75-L80

Is there a chance to expose it or avoid exception thrown?

@ahsonkhan
Copy link
Member

ahsonkhan commented Dec 3, 2019

I believe you can call the TryCreateFrom static factory API which will return false if the bytes were invalid UTF-8 (rather than throwing), which will help you avoid the creation/exception. That way, you don't need to expose internal workhorse methods like Utf8Utility.IsWellFormedUtf8.

  public static bool TryCreateFrom(ReadOnlySpan<byte> buffer, [NotNullWhen(true)] out Utf8String? value)

https://github.com/dotnet/runtime/blob/4f9ae42d861fcb4be2fcd5d3d55d5f227d30e723/src/coreclr/src/System.Private.CoreLib/src/System/Utf8String.Construction.cs#L275

cc @GrabYourPitchforks

@Symbai
Copy link
Author

Symbai commented Dec 3, 2019

Thank you. You are right, I missed this api. It doesn't throw and works perfectly fine for my purpose. I'll close this issue unless there is anything else someone wants to start a discussion from.

@Symbai Symbai closed this as completed Dec 3, 2019
@GrabYourPitchforks
Copy link
Member

@Symbai If there's a particular API you'd like exposed (e.g., bool IsWellFormedUtf8(ROS<byte>)), definitely let us know! We plan on exposing some of these as the Utf8String work gets further fleshed out, but it's always good to have feedback like what you've given so that we can prioritize work appropriately.

@Symbai
Copy link
Author

Symbai commented Dec 3, 2019

@GrabYourPitchforks I think my use case is too specific but as you're asking. Making GetIndexOfFirstInvalidUtf8Sequence public would be helpful. I am parsing memory chunks and need to predict what data type they represent. Iterating through a 1GB of memory for example and trying to guess whether the bytes at a specific position could be an utf8 string is currently expensive as I have to create a string first using Text.Encoding.Utf8.GetString and then see if the result makes sense (i.e. contains a minimum of chars or numbers). With help of Utf8string I can reduce much of the costs, see below, but exposing GetIndexOfFirstInvalidUtf8Sequence would help on making my prediction better. I could determine where the valid utf8 sequence starts and ends and so I could make another prediction whether it IS (could be) an utf8 string and I just only thrown in too many bytes.

On the other hand I could imagine this API may be useful for others as well to isolate the invalid sequence and deal with the them / the valid rest. IsWellFormedUtf8 may not be that much important when we have TryCreate unless there is any good performance benefit because calling TryCreate also calls IsWellFormatedUtf8. Maybe someone wants to run a benchmark to get some numbers.

Benchmark on my scenario with 1GB of memory chunk using Utf8String.TryCreate:

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Utf8_String 6.309 s 0.1245 s 0.1704 s 876000.0000 - - 6.4 GB
Encoding_UTF8_GetString 55.232 s 0.8865 s 0.7858 s 4646000.0000 5000.0000 - 33.93 GB

So thank all of you for adding Utf8String. I'm really excited for the release.

@GrabYourPitchforks
Copy link
Member

I opened dotnet/runtime#502 to track your suggestion and added the appropriate labels so that it doesn't fall off the radar. Thanks for your feedback. :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants