New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8 Parsing and Formatting #23831
Comments
What's the practical use case of the |
@GrabYourPitchforks, the caller would need to do work, what amounts to parsing or at best searching, to know which characters to exclude from the slice. |
@KrzysztofCwalina callers generally shouldn't have problems isolating to-be-parsed values because they are normally unambiguously delimited (e.g., via newline) or already split out (e.g., via Request.QueryString). what's an example of a practical application where an API like ReadUInt64 would be used and where the caller wouldn't expect failure when a character like '$' is encountered mid-number, and is this expected to be common enough that we're going to place a burden on all callers that they'll need to check the out parameter to see if the entire input was consumed? |
It's not whether it's a problem or not. It's that seeking to the delimiter and then slicing has cost. Why would we want to pay this cost in this low level API that is not optimized for usability but rather for cost/performance? |
I’m concerned about luring callers into a pit of failure. My gut tells me most callers would expect this to behave exactly like Int32.Parse, but operating on Span instead of String. |
We have such APIs being added directly to Int32, i.e. TryParse overload that takes Span |
That addresses my concern then. 👍 |
@KrzysztofCwalina - how are callers of TryFormatter supposed to know how big a Span to provide? The api doesn't provide the usual amenity of returning a required size. |
@KrzysztofCwalina - What are the post-condition guarantees about the value |
After some testing of the integer formatters, it appears that the post-call state in the case of too-small a buffer is:
|
The callers don't have to know the size of the buffer. If TryFormat returns false, the caller will expand the buffer and call the method again. I think we should set bytesWritten to 0 when we fail in TryFormat, but we don't clear partially written data. i.e. what you said above :-) |
Ditch type from name and use overloads? public static class Utf8Parser
{
public static bool TryParse(ReadOnlySpan<byte> text, out bool value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out byte value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out DateTimevalue, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out DateTimeOffset value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out double value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out decimal value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out Guid value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out short value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out long value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out sbyte value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out float value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out TimeSpan value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out ushort value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out uint value, out int bytesConsumed, char format = null);
public static bool TryParse(ReadOnlySpan<byte> text, out ulong value, out int bytesConsumed, char format = null);
} |
|
The Should it be nullable or declared |
Yes, that was discussed in the review and is just a typo. It should be default. @atsushikan, it looks like the compiler throws an error even for |
FYI: The API review discussion was recorded - see https://youtu.be/OZnaGV2omvI?t=2977 (38 min duration) |
* Produce Utf8Parser and Utf8Formatter Fixes https://github.com/dotnet/corefx/issues/24607 Remaining debt (cut for time): Parsing Intgers with the "N" format https://github.com/dotnet/corefx/issues/24986 Some questions to be resolved as to whether to be compatible (BCL doesn't care where you put the commas) or correct. Format of floating point is still a wrapper hack https://github.com/dotnet/corefx/issues/25077 The portable DoubleToNumber() code was never ported to C# (though the big block comment advertising it was). * PR feedback. - Move StandardFormat to System.Buffers - Mark it readonly - Fix spelling: "Seperator" - Deduplicate namespace in ref .cs * PR feedback. - Assert that a culture-invariant ToString() on double produced ASCII characters only - Add magic literal comments for the 4-byte compares in DateTimeOffset parsing. - More "seperator" vs. "separator" - Rename formatter benchmarks to be you know, "formatter"-like. * PR feedback. - Improve perf of long.MinValue path in TryFormat(long) - Make TryParseDateTimeOffset compare exact casing for Rfx1123 formats ('R' and 'l') * PR feedback. - Lots of small items in the last round of feedback. * PR feedback (ThrowHelper, AllocHelper, random easy stuff) * PR feedback (StandardFormat.Parse/ToString()) Removed the extra allocations from this path (though I still can't imagine anyone who cares enough about perf to use this parser wanting to use these apis) and addressed the outstanding feedback surrounding these. Removed the 1.2% false positive noise from Utf8Parser code coverage. It's now at 100%. * Replace 'buffer' text in XML docs
Why are they named Utf8xxx and not Asciixxxx? |
@panost Just a mistake. We discussed internally a few weeks ago but it's too late in the release cycle to do anything about it. |
I don't think it's a mistake. We used to have overload on these types that let the formatter/parser control "culture" of the input/output. I hope that we will add these overlods at some point. Once the overloads are in, this types will parse/write data that is truly UTF8, i.e. outside of the ASCII range. See the following test: https://github.com/dotnet/corefxlab/blob/master/tests/System.Text.Primitives.Tests/Parsing/PrimitiveParserIntegerTests.cs#L483 |
@KrzysztofCwalina Your answer made me re-evaluate the usefulness of this approach. Has anyone did a benchmark using those methods, decoding/encoding real-world data, for example a CSV or Json files ? My concern is that using a char array buffer (like TextReader/Writer), we do one encoding (or decoding) call every 4096 bytes (or whatever the buffer size is). Using those methods we do skip the encoding/decoding overhead for some fields, but we increase encoding/decoding calls we made for rest of the data. For example using http://download.geonames.org/export/dump/countryInfo.txt the first 37 lines of data require one call of Encoding.GetChars but if I was using the UTF8Parsers approach that will be 37 * 16 = 592 calls to Encoding.GetChars Is it worth it? |
Utf8Parser does not do any encoding/decoding, i.e. it does not call Encoding.GetChars. It natively knows which UTF8 bytes correspond to which digit (when it parses numbers). It is significantly faster than TextWriter/Reader. |
@KrzysztofCwalina That will be true, if you have to parse a file that contains numbers only (or any other datatype that Utf8Parser supports). But that's rarely the case. You have to deal with text fields/regions also and you have to decode each of them using Encoding.GetChars (or probably Decoder.Convert since you have to support streaming). How many more calls? It depends from the file you are parsing. |
Yes, it depends on many things, but we are writing a JSON parser using these APIs and the parser is significantly faster than other parsers that pre-transdode. |
UTF8 Parsing
The current .NET Framework parsing APIs (e.g. int.TryParse) can parse text represented by System.String (UTF16).
These APIs work great in many scenarios, e.g. parsing text contained in GUI application's text boxes. They are not suitable for processing modern network protocols, which are often text (e.g. JSON, HTTP headers), but encoded with UTF8/ASCII, not UTF16, and contained in byte buffers, not strings. Because of this, all modern web servers written for the .NET platform either don't use the current BCL parsing APIs or take a performance hit to transcode between UTF8 and UTF16, and to copy from buffers to strings.
To address modern networking scenarios better, we will provide parsing APIs that:
String
to parseProposed Api
The APIs are policy free (not specific to scenarios, higher level frameworks, etc.) and optimized for speed, and will look like the following (per data type being parsed; the sample shows Int32 APIs):
Cultures
APIs listed in this proposal support only the invariant culture. In corfxlab, we have APIs that support culture-aware formatting and parsing of basic primitive types, but these APIs are not part of this proposal, as they are not strictly needed by SignalR.
ParsedFormat
ParsedFormat
is an efficient (non allocating, preparsed) representation of the standard format strings. It's used in the formatting APIs. Ifdefault(ParsedFormat)
is passed, a type-specific default (typically 'G') is used.For the parsing API's, we pass only a
char
as the "precision" component is meaningless when parsing. Furthermore, format specifiers that differ only in upper/lowercasing out of the output are treated as equivalent by the Parse methods. Thus, passing 'X' or 'x' toTryParseInt32
is equivalent (the parser will accept either upper or lower case or mixed hex characters.) Ifdefault(char)
is passed, a type-specific default (typically 'G') is used.Apis that attempt to parse all possible formats were considered but this approach would add overhead and make the future addition of formats an event that changes prior API behavior.
More information about the parsing APIs is avaliable here
The text was updated successfully, but these errors were encountered: