UTF8 Parsing and Formatting #23831

ghost · 2017-10-12T16:05:04Z

UTF8 Parsing

The current .NET Framework parsing APIs (e.g. int.TryParse) can parse text represented by System.String (UTF16).

string text = ...
if(int.TryParse(text, out int value)) { ... }

These APIs work great in many scenarios, e.g. parsing text contained in GUI application's text boxes. They are not suitable for processing modern network protocols, which are often text (e.g. JSON, HTTP headers), but encoded with UTF8/ASCII, not UTF16, and contained in byte buffers, not strings. Because of this, all modern web servers written for the .NET platform either don't use the current BCL parsing APIs or take a performance hit to transcode between UTF8 and UTF16, and to copy from buffers to strings.

To address modern networking scenarios better, we will provide parsing APIs that:

Can parse byte buffers without the need to have a String to parse
Can parse text encoded as UTF8 (and possibly other encodings)
Can parse without any GC heap allocations

Proposed Api

namespace System.Buffers.Text
{
   public struct ParsedFormat 
   {
        public ParsedFormat(char symbol, byte precision=NoPrecision);

        public char Symbol { get; }

        public bool HasPrecision { get; }
        public byte Precision { get; }

        public bool IsDefault { get; }

        public static ParsedFormat Parse(ReadOnlySpan<char> format);
        public static ParsedFormat Parse(string format);

        public static implicit operator ParsedFormat (char symbol);

        public const byte MaxPrecision = (byte)99;
        public const byte NoPrecision = (byte)255;
    }

    public static class Utf8Formatter
    {
        public static bool TryFormat(bool value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(byte value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(DateTime value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(DateTimeOffset value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(decimal value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(double value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(Guid value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(short value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(int value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(long value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(sbyte value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(float value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(TimeSpan value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(ushort value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(uint value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(ulong value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
     }  

     public static class Utf8Parser 
     {
        public static bool TryParseBoolean(ReadOnlySpan<byte> text, out bool value, out int bytesConsumed, char format = null);
        public static bool TryParseByte(ReadOnlySpan<byte> text, out byte value, out int bytesConsumed, char format = null);
        public static bool TryParseDateTime(ReadOnlySpan<byte> text, out DateTimevalue, out int bytesConsumed, char format = null);
        public static bool TryParseDateTimeOffset(ReadOnlySpan<byte> text, out DateTimeOffset value, out int bytesConsumed, char format = null);
        public static bool TryParseDouble(ReadOnlySpan<byte> text, out double value, out int bytesConsumed, char format = null);
        public static bool TryParseDecimal(ReadOnlySpan<byte> text, out decimal value, out int bytesConsumed, char format = null);
        public static bool TryParseGuid(ReadOnlySpan<byte> text, out Guid value, out int bytesConsumed, char format = null);
        public static bool TryParseInt16(ReadOnlySpan<byte> text, out short value, out int bytesConsumed, char format = null);
        public static bool TryParseInt32(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);
        public static bool TryParseInt64(ReadOnlySpan<byte> text, out long value, out int bytesConsumed, char format = null);
        public static bool TryParseSByte(ReadOnlySpan<byte> text, out sbyte value, out int bytesConsumed, char format = null);
        public static bool TryParseSingle(ReadOnlySpan<byte> text, out float value, out int bytesConsumed, char format = null);
        public static bool TryParseTimeSpan(ReadOnlySpan<byte> text, out TimeSpan value, out int bytesConsumed, char format = null);
        public static bool TryParseUInt16(ReadOnlySpan<byte> text, out ushort value, out int bytesConsumed, char format = null);
        public static bool TryParseUInt32(ReadOnlySpan<byte> text, out uint value, out int bytesConsumed, char format = null);
        public static bool TryParseUInt64(ReadOnlySpan<byte> text, out ulong value, out int bytesConsumed, char format = null);
    }
}

The APIs are policy free (not specific to scenarios, higher level frameworks, etc.) and optimized for speed, and will look like the following (per data type being parsed; the sample shows Int32 APIs):

public static bool TryFormat(int value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
public static bool TryParseInt32(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);

Cultures

APIs listed in this proposal support only the invariant culture. In corfxlab, we have APIs that support culture-aware formatting and parsing of basic primitive types, but these APIs are not part of this proposal, as they are not strictly needed by SignalR.

ParsedFormat

ParsedFormat is an efficient (non allocating, preparsed) representation of the standard format strings. It's used in the formatting APIs. If default(ParsedFormat) is passed, a type-specific default (typically 'G') is used.

For the parsing API's, we pass only a char as the "precision" component is meaningless when parsing. Furthermore, format specifiers that differ only in upper/lowercasing out of the output are treated as equivalent by the Parse methods. Thus, passing 'X' or 'x' to TryParseInt32 is equivalent (the parser will accept either upper or lower case or mixed hex characters.) If default(char) is passed, a type-specific default (typically 'G') is used.

Apis that attempt to parse all possible formats were considered but this approach would add overhead and make the future addition of formats an event that changes prior API behavior.

More information about the parsing APIs is avaliable here

The text was updated successfully, but these errors were encountered:

KrzysztofCwalina · 2017-10-12T21:14:26Z

cc: @GrabYourPitchforks, @ahsonkhan, @joshfree, @terrajobst

GrabYourPitchforks · 2017-10-17T07:27:08Z

What's the practical use case of the bytesConsumed out parameter? As far as I can tell that out param just contains the index of the first invalid (e.g., non-digit) character, but presumably if the caller didn't want the parser to try parsing those characters it wouldn't have included them in the slice it provided.

KrzysztofCwalina · 2017-10-17T15:25:49Z

@GrabYourPitchforks, the caller would need to do work, what amounts to parsing or at best searching, to know which characters to exclude from the slice.

GrabYourPitchforks · 2017-10-17T15:56:44Z

@KrzysztofCwalina callers generally shouldn't have problems isolating to-be-parsed values because they are normally unambiguously delimited (e.g., via newline) or already split out (e.g., via Request.QueryString). what's an example of a practical application where an API like ReadUInt64 would be used and where the caller wouldn't expect failure when a character like '$' is encountered mid-number, and is this expected to be common enough that we're going to place a burden on all callers that they'll need to check the out parameter to see if the entire input was consumed?

KrzysztofCwalina · 2017-10-18T15:44:36Z

callers generally shouldn't have problems isolating to-be-parsed values because they are normally unambiguously delimited

It's not whether it's a problem or not. It's that seeking to the delimiter and then slicing has cost. Why would we want to pay this cost in this low level API that is not optimized for usability but rather for cost/performance?

GrabYourPitchforks · 2017-10-18T16:00:31Z

I’m concerned about luring callers into a pit of failure. My gut tells me most callers would expect this to behave exactly like Int32.Parse, but operating on Span instead of String.

KrzysztofCwalina · 2017-10-18T16:31:15Z

We have such APIs being added directly to Int32, i.e. TryParse overload that takes Span

GrabYourPitchforks · 2017-10-18T16:34:50Z

That addresses my concern then. 👍

ghost · 2017-10-18T18:42:21Z

@KrzysztofCwalina - how are callers of TryFormatter supposed to know how big a Span to provide? The api doesn't provide the usual amenity of returning a required size.

ghost · 2017-10-18T18:49:18Z

@KrzysztofCwalina - What are the post-condition guarantees about the value bytesWritten if the buffer is too small to hold the formatted text?

ghost · 2017-10-18T23:41:09Z

After some testing of the integer formatters, it appears that the post-call state in the case of too-small a buffer is:

byteWritten always zero
but, in certain corner cases, some partial (and useless) characters may still have been written to the output buffer.

KrzysztofCwalina · 2017-10-19T15:49:34Z

The callers don't have to know the size of the buffer. If TryFormat returns false, the caller will expand the buffer and call the method again.

I think we should set bytesWritten to 0 when we fail in TryFormat, but we don't clear partially written data.

i.e. what you said above :-)

benaadams · 2017-10-24T18:09:43Z

Ditch type from name and use overloads?

 public static class Utf8Parser 
 {
    public static bool TryParse(ReadOnlySpan<byte> text, out bool value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out byte value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out DateTimevalue, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out DateTimeOffset value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out double value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out decimal value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out Guid value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out short value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out long value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out sbyte value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out float value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out TimeSpan value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out ushort value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out uint value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out ulong value, out int bytesConsumed, char format = null);
}

terrajobst · 2017-10-24T18:29:08Z

Video

ParsedFormat -> StandardFormat
ParsedFormat should implement IEquatable<ParsedFormat>
Follow @benaadams's proposal and drop the type names from methods on Utf8Parser
Rename the format parameter on Utf8Parser to standardFormat

nil4 · 2017-10-24T19:30:02Z

The format parameter is a struct in both Utf8Formatter and Utf8Parser: A value of type '<null>' cannot be used as a default parameter because there are no standard conversions to type 'char'.

Should it be nullable or declared = default instead?

ahsonkhan · 2017-10-24T19:31:01Z

The format parameter is a struct in both Utf8Formatter and Utf8Parser: A value of type '<null>' cannot be used as a default parameter because there are no standard conversions to type 'char'.

Should it be nullable or declared = default instead?

Yes, that was discussed in the review and is just a typo. It should be default.

@atsushikan, it looks like the compiler throws an error even for char format = null

karelz · 2017-10-24T20:18:11Z

FYI: The API review discussion was recorded - see https://youtu.be/OZnaGV2omvI?t=2977 (38 min duration)

* Produce Utf8Parser and Utf8Formatter Fixes https://github.com/dotnet/corefx/issues/24607 Remaining debt (cut for time): Parsing Intgers with the "N" format https://github.com/dotnet/corefx/issues/24986 Some questions to be resolved as to whether to be compatible (BCL doesn't care where you put the commas) or correct. Format of floating point is still a wrapper hack https://github.com/dotnet/corefx/issues/25077 The portable DoubleToNumber() code was never ported to C# (though the big block comment advertising it was). * PR feedback. - Move StandardFormat to System.Buffers - Mark it readonly - Fix spelling: "Seperator" - Deduplicate namespace in ref .cs * PR feedback. - Assert that a culture-invariant ToString() on double produced ASCII characters only - Add magic literal comments for the 4-byte compares in DateTimeOffset parsing. - More "seperator" vs. "separator" - Rename formatter benchmarks to be you know, "formatter"-like. * PR feedback. - Improve perf of long.MinValue path in TryFormat(long) - Make TryParseDateTimeOffset compare exact casing for Rfx1123 formats ('R' and 'l') * PR feedback. - Lots of small items in the last round of feedback. * PR feedback (ThrowHelper, AllocHelper, random easy stuff) * PR feedback (StandardFormat.Parse/ToString()) Removed the extra allocations from this path (though I still can't imagine anyone who cares enough about perf to use this parser wanting to use these apis) and addressed the outstanding feedback surrounding these. Removed the 1.2% false positive noise from Utf8Parser code coverage. It's now at 100%. * Replace 'buffer' text in XML docs

panost · 2018-05-18T15:26:45Z

Why are they named Utf8xxx and not Asciixxxx?
From a quick look at the source code it appears that they support any ASCII compatible encoding.

GrabYourPitchforks · 2018-05-18T15:33:48Z

@panost Just a mistake. We discussed internally a few weeks ago but it's too late in the release cycle to do anything about it.

KrzysztofCwalina · 2018-05-21T17:41:14Z

I don't think it's a mistake. We used to have overload on these types that let the formatter/parser control "culture" of the input/output. I hope that we will add these overlods at some point. Once the overloads are in, this types will parse/write data that is truly UTF8, i.e. outside of the ASCII range.

See the following test: https://github.com/dotnet/corefxlab/blob/master/tests/System.Text.Primitives.Tests/Parsing/PrimitiveParserIntegerTests.cs#L483

panost · 2018-05-22T08:54:50Z

@KrzysztofCwalina Your answer made me re-evaluate the usefulness of this approach.

Has anyone did a benchmark using those methods, decoding/encoding real-world data, for example a CSV or Json files ?
I mean files that you do have an integer or a date here and there, but the vast majority of data is textual.

My concern is that using a char array buffer (like TextReader/Writer), we do one encoding (or decoding) call every 4096 bytes (or whatever the buffer size is). Using those methods we do skip the encoding/decoding overhead for some fields, but we increase encoding/decoding calls we made for rest of the data.

For example using http://download.geonames.org/export/dump/countryInfo.txt
(has 3 integer fields and 16 text)

the first 37 lines of data require one call of Encoding.GetChars but if I was using the UTF8Parsers approach that will be 37 * 16 = 592 calls to Encoding.GetChars

Is it worth it?

KrzysztofCwalina · 2018-05-22T15:04:43Z

Utf8Parser does not do any encoding/decoding, i.e. it does not call Encoding.GetChars. It natively knows which UTF8 bytes correspond to which digit (when it parses numbers). It is significantly faster than TextWriter/Reader.

panost · 2018-05-22T16:51:23Z

@KrzysztofCwalina That will be true, if you have to parse a file that contains numbers only (or any other datatype that Utf8Parser supports).

But that's rarely the case. You have to deal with text fields/regions also and you have to decode each of them using Encoding.GetChars (or probably Decoder.Convert since you have to support streaming).

How many more calls? It depends from the file you are parsing.

KrzysztofCwalina · 2018-05-22T20:11:59Z

Yes, it depends on many things, but we are writing a JSON parser using these APIs and the parser is significantly faster than other parsers that pre-transdode.

ghost assigned ghost and KrzysztofCwalina Oct 12, 2017

KrzysztofCwalina changed the title ~~UTF8 Parsing api~~ UTF8 Parsing and Formatting Oct 12, 2017

ghost closed this as completed in dotnet/corefx#25078 Nov 7, 2017

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.1.0 milestone Jan 31, 2020

dotnet locked as resolved and limited conversation to collaborators Dec 20, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 Parsing and Formatting #23831

UTF8 Parsing and Formatting #23831

ghost commented Oct 12, 2017

KrzysztofCwalina commented Oct 12, 2017

GrabYourPitchforks commented Oct 17, 2017

KrzysztofCwalina commented Oct 17, 2017 •

edited

GrabYourPitchforks commented Oct 17, 2017

KrzysztofCwalina commented Oct 18, 2017

GrabYourPitchforks commented Oct 18, 2017

KrzysztofCwalina commented Oct 18, 2017

GrabYourPitchforks commented Oct 18, 2017

ghost commented Oct 18, 2017

ghost commented Oct 18, 2017

ghost commented Oct 18, 2017

KrzysztofCwalina commented Oct 19, 2017 •

edited

benaadams commented Oct 24, 2017 •

edited

terrajobst commented Oct 24, 2017 •

edited

nil4 commented Oct 24, 2017

ahsonkhan commented Oct 24, 2017 •

edited

karelz commented Oct 24, 2017

panost commented May 18, 2018

GrabYourPitchforks commented May 18, 2018

KrzysztofCwalina commented May 21, 2018 •

edited

panost commented May 22, 2018

KrzysztofCwalina commented May 22, 2018 •

edited

panost commented May 22, 2018

KrzysztofCwalina commented May 22, 2018

UTF8 Parsing and Formatting #23831

UTF8 Parsing and Formatting #23831

Comments

ghost commented Oct 12, 2017

UTF8 Parsing

Proposed Api

Cultures

ParsedFormat

KrzysztofCwalina commented Oct 12, 2017

GrabYourPitchforks commented Oct 17, 2017

KrzysztofCwalina commented Oct 17, 2017 • edited

GrabYourPitchforks commented Oct 17, 2017

KrzysztofCwalina commented Oct 18, 2017

GrabYourPitchforks commented Oct 18, 2017

KrzysztofCwalina commented Oct 18, 2017

GrabYourPitchforks commented Oct 18, 2017

ghost commented Oct 18, 2017

ghost commented Oct 18, 2017

ghost commented Oct 18, 2017

KrzysztofCwalina commented Oct 19, 2017 • edited

benaadams commented Oct 24, 2017 • edited

terrajobst commented Oct 24, 2017 • edited

nil4 commented Oct 24, 2017

ahsonkhan commented Oct 24, 2017 • edited

karelz commented Oct 24, 2017

panost commented May 18, 2018

GrabYourPitchforks commented May 18, 2018

KrzysztofCwalina commented May 21, 2018 • edited

panost commented May 22, 2018

KrzysztofCwalina commented May 22, 2018 • edited

panost commented May 22, 2018

KrzysztofCwalina commented May 22, 2018

KrzysztofCwalina commented Oct 17, 2017 •

edited

KrzysztofCwalina commented Oct 19, 2017 •

edited

benaadams commented Oct 24, 2017 •

edited

terrajobst commented Oct 24, 2017 •

edited

ahsonkhan commented Oct 24, 2017 •

edited

KrzysztofCwalina commented May 21, 2018 •

edited

KrzysztofCwalina commented May 22, 2018 •

edited