Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 Parsing and Formatting #23831

Closed
ghost opened this issue Oct 12, 2017 · 24 comments · Fixed by dotnet/corefx#25078
Closed

UTF8 Parsing and Formatting #23831

ghost opened this issue Oct 12, 2017 · 24 comments · Fixed by dotnet/corefx#25078
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Milestone

Comments

@ghost
Copy link

ghost commented Oct 12, 2017

UTF8 Parsing

The current .NET Framework parsing APIs (e.g. int.TryParse) can parse text represented by System.String (UTF16).

string text = ...
if(int.TryParse(text, out int value)) { ... }

These APIs work great in many scenarios, e.g. parsing text contained in GUI application's text boxes. They are not suitable for processing modern network protocols, which are often text (e.g. JSON, HTTP headers), but encoded with UTF8/ASCII, not UTF16, and contained in byte buffers, not strings. Because of this, all modern web servers written for the .NET platform either don't use the current BCL parsing APIs or take a performance hit to transcode between UTF8 and UTF16, and to copy from buffers to strings.

To address modern networking scenarios better, we will provide parsing APIs that:

  • Can parse byte buffers without the need to have a String to parse
  • Can parse text encoded as UTF8 (and possibly other encodings)
  • Can parse without any GC heap allocations

Proposed Api

namespace System.Buffers.Text
{
   public struct ParsedFormat 
   {
        public ParsedFormat(char symbol, byte precision=NoPrecision);

        public char Symbol { get; }

        public bool HasPrecision { get; }
        public byte Precision { get; }

        public bool IsDefault { get; }

        public static ParsedFormat Parse(ReadOnlySpan<char> format);
        public static ParsedFormat Parse(string format);

        public static implicit operator ParsedFormat (char symbol);

        public const byte MaxPrecision = (byte)99;
        public const byte NoPrecision = (byte)255;
    }

    public static class Utf8Formatter
    {
        public static bool TryFormat(bool value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(byte value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(DateTime value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(DateTimeOffset value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(decimal value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(double value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(Guid value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(short value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(int value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(long value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(sbyte value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(float value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(TimeSpan value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(ushort value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(uint value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
        public static bool TryFormat(ulong value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
     }  

     public static class Utf8Parser 
     {
        public static bool TryParseBoolean(ReadOnlySpan<byte> text, out bool value, out int bytesConsumed, char format = null);
        public static bool TryParseByte(ReadOnlySpan<byte> text, out byte value, out int bytesConsumed, char format = null);
        public static bool TryParseDateTime(ReadOnlySpan<byte> text, out DateTimevalue, out int bytesConsumed, char format = null);
        public static bool TryParseDateTimeOffset(ReadOnlySpan<byte> text, out DateTimeOffset value, out int bytesConsumed, char format = null);
        public static bool TryParseDouble(ReadOnlySpan<byte> text, out double value, out int bytesConsumed, char format = null);
        public static bool TryParseDecimal(ReadOnlySpan<byte> text, out decimal value, out int bytesConsumed, char format = null);
        public static bool TryParseGuid(ReadOnlySpan<byte> text, out Guid value, out int bytesConsumed, char format = null);
        public static bool TryParseInt16(ReadOnlySpan<byte> text, out short value, out int bytesConsumed, char format = null);
        public static bool TryParseInt32(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);
        public static bool TryParseInt64(ReadOnlySpan<byte> text, out long value, out int bytesConsumed, char format = null);
        public static bool TryParseSByte(ReadOnlySpan<byte> text, out sbyte value, out int bytesConsumed, char format = null);
        public static bool TryParseSingle(ReadOnlySpan<byte> text, out float value, out int bytesConsumed, char format = null);
        public static bool TryParseTimeSpan(ReadOnlySpan<byte> text, out TimeSpan value, out int bytesConsumed, char format = null);
        public static bool TryParseUInt16(ReadOnlySpan<byte> text, out ushort value, out int bytesConsumed, char format = null);
        public static bool TryParseUInt32(ReadOnlySpan<byte> text, out uint value, out int bytesConsumed, char format = null);
        public static bool TryParseUInt64(ReadOnlySpan<byte> text, out ulong value, out int bytesConsumed, char format = null);
    }
}

The APIs are policy free (not specific to scenarios, higher level frameworks, etc.) and optimized for speed, and will look like the following (per data type being parsed; the sample shows Int32 APIs):

public static bool TryFormat(int value, Span<byte> buffer, out int bytesWritten, ParsedFormat format=null);
public static bool TryParseInt32(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);

Cultures

APIs listed in this proposal support only the invariant culture. In corfxlab, we have APIs that support culture-aware formatting and parsing of basic primitive types, but these APIs are not part of this proposal, as they are not strictly needed by SignalR.

ParsedFormat

ParsedFormat is an efficient (non allocating, preparsed) representation of the standard format strings. It's used in the formatting APIs. If default(ParsedFormat) is passed, a type-specific default (typically 'G') is used.

For the parsing API's, we pass only a char as the "precision" component is meaningless when parsing. Furthermore, format specifiers that differ only in upper/lowercasing out of the output are treated as equivalent by the Parse methods. Thus, passing 'X' or 'x' to TryParseInt32 is equivalent (the parser will accept either upper or lower case or mixed hex characters.) If default(char) is passed, a type-specific default (typically 'G') is used.

Apis that attempt to parse all possible formats were considered but this approach would add overhead and make the future addition of formats an event that changes prior API behavior.

More information about the parsing APIs is avaliable here

@ghost ghost assigned ghost and KrzysztofCwalina Oct 12, 2017
@KrzysztofCwalina KrzysztofCwalina changed the title UTF8 Parsing api UTF8 Parsing and Formatting Oct 12, 2017
@KrzysztofCwalina
Copy link
Member

@GrabYourPitchforks
Copy link
Member

What's the practical use case of the bytesConsumed out parameter? As far as I can tell that out param just contains the index of the first invalid (e.g., non-digit) character, but presumably if the caller didn't want the parser to try parsing those characters it wouldn't have included them in the slice it provided.

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Oct 17, 2017

@GrabYourPitchforks, the caller would need to do work, what amounts to parsing or at best searching, to know which characters to exclude from the slice.

@GrabYourPitchforks
Copy link
Member

@KrzysztofCwalina callers generally shouldn't have problems isolating to-be-parsed values because they are normally unambiguously delimited (e.g., via newline) or already split out (e.g., via Request.QueryString). what's an example of a practical application where an API like ReadUInt64 would be used and where the caller wouldn't expect failure when a character like '$' is encountered mid-number, and is this expected to be common enough that we're going to place a burden on all callers that they'll need to check the out parameter to see if the entire input was consumed?

@KrzysztofCwalina
Copy link
Member

callers generally shouldn't have problems isolating to-be-parsed values because they are normally unambiguously delimited

It's not whether it's a problem or not. It's that seeking to the delimiter and then slicing has cost. Why would we want to pay this cost in this low level API that is not optimized for usability but rather for cost/performance?

@GrabYourPitchforks
Copy link
Member

I’m concerned about luring callers into a pit of failure. My gut tells me most callers would expect this to behave exactly like Int32.Parse, but operating on Span instead of String.

@KrzysztofCwalina
Copy link
Member

We have such APIs being added directly to Int32, i.e. TryParse overload that takes Span

@GrabYourPitchforks
Copy link
Member

That addresses my concern then. 👍

@ghost
Copy link

ghost commented Oct 18, 2017

@KrzysztofCwalina - how are callers of TryFormatter supposed to know how big a Span to provide? The api doesn't provide the usual amenity of returning a required size.

@ghost
Copy link

ghost commented Oct 18, 2017

@KrzysztofCwalina - What are the post-condition guarantees about the value bytesWritten if the buffer is too small to hold the formatted text?

@ghost
Copy link

ghost commented Oct 18, 2017

After some testing of the integer formatters, it appears that the post-call state in the case of too-small a buffer is:

  • byteWritten always zero
  • but, in certain corner cases, some partial (and useless) characters may still have been written to the output buffer.

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Oct 19, 2017

The callers don't have to know the size of the buffer. If TryFormat returns false, the caller will expand the buffer and call the method again.

I think we should set bytesWritten to 0 when we fail in TryFormat, but we don't clear partially written data.

i.e. what you said above :-)

@benaadams
Copy link
Member

benaadams commented Oct 24, 2017

Ditch type from name and use overloads?

 public static class Utf8Parser 
 {
    public static bool TryParse(ReadOnlySpan<byte> text, out bool value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out byte value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out DateTimevalue, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out DateTimeOffset value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out double value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out decimal value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out Guid value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out short value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out int value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out long value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out sbyte value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out float value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out TimeSpan value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out ushort value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out uint value, out int bytesConsumed, char format = null);
    public static bool TryParse(ReadOnlySpan<byte> text, out ulong value, out int bytesConsumed, char format = null);
}

@terrajobst
Copy link
Member

terrajobst commented Oct 24, 2017

Video

  • ParsedFormat -> StandardFormat
  • ParsedFormat should implement IEquatable<ParsedFormat>
  • Follow @benaadams's proposal and drop the type names from methods on Utf8Parser
  • Rename the format parameter on Utf8Parser to standardFormat

@nil4
Copy link
Contributor

nil4 commented Oct 24, 2017

The format parameter is a struct in both Utf8Formatter and Utf8Parser: A value of type '<null>' cannot be used as a default parameter because there are no standard conversions to type 'char'.

Should it be nullable or declared = default instead?

@ahsonkhan
Copy link
Member

ahsonkhan commented Oct 24, 2017

The format parameter is a struct in both Utf8Formatter and Utf8Parser: A value of type '<null>' cannot be used as a default parameter because there are no standard conversions to type 'char'.

Should it be nullable or declared = default instead?

Yes, that was discussed in the review and is just a typo. It should be default.

@atsushikan, it looks like the compiler throws an error even for char format = null
image

@karelz
Copy link
Member

karelz commented Oct 24, 2017

FYI: The API review discussion was recorded - see https://youtu.be/OZnaGV2omvI?t=2977 (38 min duration)

ghost referenced this issue in dotnet/corefx Nov 7, 2017
* Produce Utf8Parser and Utf8Formatter

Fixes https://github.com/dotnet/corefx/issues/24607

Remaining debt (cut for time):

Parsing Intgers with the "N" format

  https://github.com/dotnet/corefx/issues/24986

  Some questions to be resolved as to whether to be compatible
  (BCL doesn't care where you put the commas) or correct.

Format of floating point is still a wrapper hack

  https://github.com/dotnet/corefx/issues/25077

  The portable DoubleToNumber() code was never ported
  to C# (though the big block comment advertising it
  was).

* PR feedback.

- Move StandardFormat to System.Buffers

- Mark it readonly

- Fix spelling: "Seperator"

- Deduplicate namespace in ref .cs

* PR feedback.

- Assert that a culture-invariant ToString() on double produced ASCII characters only

- Add magic literal comments for the 4-byte compares in DateTimeOffset parsing.

- More "seperator" vs. "separator"

- Rename formatter benchmarks to be you know, "formatter"-like.

* PR feedback.

- Improve perf of long.MinValue path in TryFormat(long)

- Make TryParseDateTimeOffset compare exact casing for Rfx1123 formats ('R' and 'l')

* PR feedback.

- Lots of small items in the last round of feedback.

* PR feedback (ThrowHelper, AllocHelper, random easy stuff)

* PR feedback (StandardFormat.Parse/ToString())

Removed the extra allocations from this path
(though I still can't imagine anyone who cares
enough about perf to use this parser wanting
to use these apis) and addressed the outstanding
feedback surrounding these.

Removed the 1.2% false positive noise
from Utf8Parser code coverage. It's now at 100%.

* Replace 'buffer' text in XML docs
@panost
Copy link

panost commented May 18, 2018

Why are they named Utf8xxx and not Asciixxxx?
From a quick look at the source code it appears that they support any ASCII compatible encoding.

@GrabYourPitchforks
Copy link
Member

@panost Just a mistake. We discussed internally a few weeks ago but it's too late in the release cycle to do anything about it.

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented May 21, 2018

I don't think it's a mistake. We used to have overload on these types that let the formatter/parser control "culture" of the input/output. I hope that we will add these overlods at some point. Once the overloads are in, this types will parse/write data that is truly UTF8, i.e. outside of the ASCII range.

See the following test: https://github.com/dotnet/corefxlab/blob/master/tests/System.Text.Primitives.Tests/Parsing/PrimitiveParserIntegerTests.cs#L483

@panost
Copy link

panost commented May 22, 2018

@KrzysztofCwalina Your answer made me re-evaluate the usefulness of this approach.

Has anyone did a benchmark using those methods, decoding/encoding real-world data, for example a CSV or Json files ?
I mean files that you do have an integer or a date here and there, but the vast majority of data is textual.

My concern is that using a char array buffer (like TextReader/Writer), we do one encoding (or decoding) call every 4096 bytes (or whatever the buffer size is). Using those methods we do skip the encoding/decoding overhead for some fields, but we increase encoding/decoding calls we made for rest of the data.

For example using http://download.geonames.org/export/dump/countryInfo.txt
(has 3 integer fields and 16 text)

the first 37 lines of data require one call of Encoding.GetChars but if I was using the UTF8Parsers approach that will be 37 * 16 = 592 calls to Encoding.GetChars

Is it worth it?

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented May 22, 2018

Utf8Parser does not do any encoding/decoding, i.e. it does not call Encoding.GetChars. It natively knows which UTF8 bytes correspond to which digit (when it parses numbers). It is significantly faster than TextWriter/Reader.

@panost
Copy link

panost commented May 22, 2018

@KrzysztofCwalina That will be true, if you have to parse a file that contains numbers only (or any other datatype that Utf8Parser supports).

But that's rarely the case. You have to deal with text fields/regions also and you have to decode each of them using Encoding.GetChars (or probably Decoder.Convert since you have to support streaming).

How many more calls? It depends from the file you are parsing.

@KrzysztofCwalina
Copy link
Member

Yes, it depends on many things, but we are writing a JSON parser using these APIs and the parser is significantly faster than other parsers that pre-transdode.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 20, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Buffers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants