Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Productize Utf8Parser and Utf8Formatter #25078

Merged
8 commits merged into from Nov 7, 2017
Merged

Productize Utf8Parser and Utf8Formatter #25078

8 commits merged into from Nov 7, 2017

Conversation

ghost
Copy link

@ghost ghost commented Nov 6, 2017

Fixes https://github.com/dotnet/corefx/issues/24607

Remaining debt (cut for time):

Parsing Intgers with the "N" format

https://github.com/dotnet/corefx/issues/24986

Some questions to be resolved as to whether to be compatible
(BCL doesn't care where you put the commas) or correct.

Format of floating point is still a wrapper hack

https://github.com/dotnet/corefx/issues/25077

The portable DoubleToNumber() code was never ported
to C# (though the big block comment advertising it
was).

@ghost ghost added the area-System.Memory label Nov 6, 2017
@ghost ghost added this to the 2.1.0 milestone Nov 6, 2017
@ghost ghost self-assigned this Nov 6, 2017
@ghost ghost requested review from ahsonkhan and KrzysztofCwalina November 6, 2017 15:55
@ghost ghost changed the title Produce Utf8Parser and Utf8Formatter Productize Utf8Parser and Utf8Formatter Nov 6, 2017
@@ -281,6 +281,66 @@ public static class BinaryPrimitives

namespace System.Buffers.Text
{
public struct StandardFormat : IEquatable<StandardFormat>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this type being in System.Buffers namespace. I know that SignalR team wanted to use it as a format specifier for non-textual output formats.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

namespace System.Buffers.Text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why open the same namespace as above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge artifact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/// Represents a standard formatting string without using an actual String. A StandardFormat consists of a character (such as 'G', 'D' or 'X')
/// and an optional precision ranging from 0..99, or the special value NoPrecision.
/// </summary>
public struct StandardFormat : IEquatable<StandardFormat>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applicable structs were recently marked readonly throughout the repo (#24997). Looks like this could be readonly as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// <summary>
/// Precision values for format that don't use a precision, or for when the precision is to be unspecified.
/// </summary>
public const byte MaxPrecision = 99;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this to be a const? What if we want to change that in the future? i.e. should be be simply a readonly field?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it doesn't need to exist at all? We usually publish argument range limits in documentation rather than as IL metadata.

/// Represents a standard formatting string without using an actual String. A StandardFormat consists of a character (such as 'G', 'D' or 'X')
/// and an optional precision ranging from 0..99, or the special value NoPrecision.
/// </summary>
public struct StandardFormat : IEquatable<StandardFormat>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth making it a readonly struct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

public const byte Space = (byte)' ';
public const byte Hyphen = (byte)'-';

public const byte Seperator = (byte)',';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seperator => Separator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// </summary>
public static StandardFormat Parse(ReadOnlySpan<char> format)
{
return Parse(new string(format.ToArray()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this seems backwards. I think we should implement this method without allocations and then call it from the Parse(string) overload.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that was some leftover debt too - it was using Utf16Parser which we weren't productizing and uint.TryParse(ReadOnlySpan) isn't available in NetStandard 1.1.

/// We don't have access to Math.DivRem, so this is a copy of the implementation.
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long DivMod(long numerator, long denominator, out long modulo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last time I looked, our JIT was generating two idiv instructions for this. @AndyAyersMS, it would be great if we could fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should fix; but there are challenges. See dotnet/coreclr#757 for some notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int CountDigits(long n)
{
if (n == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be if (n < 10)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was probably meant to handle the special case of "0" - probably ok to let it cover the "1"-"9" case too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, "n" is signed so it n < 10 isn't the right test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, < > double check is probably not worth?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not unchecked((ulong)n) < 10, which should be the equivalent of (n >= 0 && n < 10), but only a single comparison

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the equivalent - it would need to be Abs(n) for it to do the check we need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't realize the dual check was n > -10 && n < 10 (which would be: Abs(n) < 10), not n >= 0 && n < 10 (which would be: unchecked((ulong)n) < 10)

// 05/25/2017 10:30:15 -08:00
//
private static bool TryFormatDateTimeG(DateTime value, TimeSpan offset, Span<byte> buffer, out int bytesWritten)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth Debug.Asserting that the format is indeed 'G'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have the format here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the github source viewer long line clipping bites again :-)

if (format.Precision == StandardFormat.NoPrecision)
formatString = format.Symbol.ToString();
else
formatString = format.Symbol.ToString() + format.Precision;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we allocate string of length 1 + numberOfDigits, pin it, set pStr[0] = symbol, and then format the precision into it? Also, could we optimize for the case of the default format?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't think it was kosher to mutate a string outside of CoreLib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine as long as we implement it in one place: StandarfFormat.ToString(). I think it will be quite common for people to want to do the same thing, i.e. convert StandardFormat to string representation,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? I thought conversion between format strings and StandardFormat wouldn't be done by anyone who actually cares about performance. I'd see ToString() as a debugging aid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is on the floor right now but I'm pretty sureString.Copy() was only introduced in NetStandard 2.0 and I've been bitten before by the "there's no way the tooling will intern this" only to have it happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building for .NET Core, we can do the more efficient thing, using the new String.Create method, etc. But even without that, this is only a few chars in length, right? Why not just use temporary stack space to build up the chars and then create a string from that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


for (int i = 0; i < length; i++)
{
buffer[i] = (byte)(utf16Text[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to Debug.Assert that the chars are ASCII?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

uint dowString = (dow0 << 24) | (dow1 << 16) | (dow2 << 8) | comma;
switch (dowString)
{
case 0x53554E2c: dayOfWeek = DayOfWeek.Sunday; break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a comment explaining what these magic numbers are.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

{
case 'R':
case 'l':
return TryParseDateTimeOffsetR(text, out value, out bytesConsumed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with passing in 'R' and then successfully parsing lower cased date time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've generally been "case-insensitive" for parsing but given that this is a RFC wire format, maybe an exception is warranted. Waiting to see what others have to say on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made them 'strict case' for now.

[InlineData(12837467L)] // standard format
[InlineData(-9223372036854775808L)] // min value
[InlineData(9223372036854775807L)] // max value
private static void ParserInt64(long value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the results compare to UTF16 routines in mscorlib?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement

44% // 12837467L
-26% // long.MinValue
13% // long.MaxValue

("long.MinValue" has a special path just for it in FormatInt64D() that delegates to the unsigned formatter - looks like this is an outlier case. 'Course, if we really think people will benchmark that a lot, it is easy to write a super-fast path when it's only for one specific value...)

set { Flags = (Flags & ~ScaleMask) | ((uint)value << ScaleShift); }
}

// Sign mask for the flags field. A value of zero in this bit indicates a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A value of zero in this bit indicates

Which bit exactly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x80000000. That's the assigned value of SignMask.

<ItemGroup>
<!-- Common or Common-branched source files -->
<Compile Include="$(CommonPath)\System\NotImplemented.cs">
<Link>Common\System\NotImplemented.cs</Link>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we specify the complete path here rather than using Link property? What does it do?
i.e. Why not just Compile Include="..\Common\src\System\NotImplemented.cs" />?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$(CommonPath) is how it's done in all the other .csproj files that include this file. Personally, my care factor here is low - this was copy-paste.

public const byte NoPrecision = byte.MaxValue;

/// <summary>
/// Precision values for format that don't use a precision, or for when the precision is to be unspecified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xml comments for MaxPrecision and NoPrecision are mismatched.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// <summary>
/// Create a StandardFormat.
/// </summary>
/// <param name="symbol">A type-specific formatting characeter such as 'G', 'D' or 'X'</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

characeter spelling

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

public static implicit operator StandardFormat(char symbol) => new StandardFormat(symbol);

/// <summary>
/// Converts a classic .NET format string into a StandardFormat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to reflect ROS<char> as input.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "string" is used generically here, not specifically referrring to System.String. This is just an overload of the one that takes String - I don't think a different XML comment is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Although, it is qualified as a "classic .NET format string".

if (!(standardFormat == default(char) || standardFormat == 'G' || standardFormat == 'l'))
throw new FormatException(SR.Argument_BadFormatSpecifier);

if (text.Length >= 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation doesn't produce different results based on the standardFormat.
Is any upper/lower case combination of the characters within true considered valid for both 'G' and 'l'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

public static partial class Utf8Parser
{
/// <summary>
/// Parses a Byte at the start of a Utf8 string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update xml comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

/// <summary>
/// Parses a SByte at the start of a Utf8 string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, update the comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return TryParseUInt16X(text, out Unsafe.As<short, ushort>(ref value), out bytesConsumed);

default:
throw new FormatException(SR.Argument_BadFormatSpecifier);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use throw helper to help with inlining.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return true;
}

private static bool TryParseUInt32D(ReadOnlySpan<byte> text, out uint value, out int bytesConsumed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider applying the same loop unrolling optimization as TryParseInt32D (https://github.com/dotnet/corefx/pull/25078/files#diff-5e90497df430f6541664f0a780acac46R199).

Same with TryParseUInt16D and TryParseByteD.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR. We need to get this in before I leave for vacation and this sort of thing is too much code churn for a commit this size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed :)
Already pushing the bar for what can be reviewed within the browser.


private static bool TryParseNumber(ReadOnlySpan<byte> text, ref NumberBuffer number, out int bytesConsumed, ParseNumberOptions options, out bool textUsedExponentNotation)
{
Debug.Assert(number.Digits[0] == 0 && number.Scale == 0 && !number.IsNegative, "Number not initialized to default(NumberBuffer)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Debug.Assert condition and comment inverted here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the condition isn't and the MSDN example (and usability) says the text should describe what went wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That makes sense.

if number == default, it will be Debug.Assert(true,""), and hence all is good.
if number != default, it will be Debug.Assert(false,""), and we see that message.


using Xunit;

namespace System.Buffers.Text.Tests
Copy link
Member

@ahsonkhan ahsonkhan Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please post the code coverage numbers for the Formatters and Parsers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8Formatter: 100%

FormattingHelper: 100%

Utf8Parser: 98.8% (the 1.2% are NotImplementedException and Debug.Assert paths.)

ParserHelper: 100%

MutableDecimal: 100%

System.Number: 98.2% (the 1.8% are paths that are unreachable in System.Memory but may be reachable in ProjectN where this code came from. I did not bother to analyze the much larger Project N codebase and I don't want to introduce unnecessary diffs here.)

System.NumberBuffer: 45.4% (the remaining 55.6% is the ToString() method which only exists so that NumberBuffer displays nicely in the debugger.)

StandardFormat: 100%

IntPtr pMemory;
try
{
pMemory = Marshal.AllocHGlobal(int.MaxValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


[Fact]
[OuterLoop]
public static void TestParser2GiBOverflow()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands right now, it's not specific to Windows...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting excluding running this test on Linux. We have seen an issue in the past with trying to use OOM exception to skip allocation on Linux:

On Linux, the allocation can succeed even if there is not enough memory but then the test may get killed by the OOM killer at the time the memory is accessed which triggers the full memory allocation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluded

// Intentionally placed in the global namespace so that we don't have to see an annoying long type name
// in the xunit output.
//
internal sealed class TestException : Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is general purpose and can potentially be used in other tests too. Move this up to src/System.Memory/tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

[InlineData("12837467")] // standard parse
[InlineData("-2147483648")] // min value
[InlineData("2147483647")] // max value
private static void ParserInt32(string text)
Copy link
Member

@ahsonkhan ahsonkhan Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding baseline to compare performance against int.TryParse?

Also, can we port over the variable length performance tests like https://github.com/dotnet/corefxlab/blob/master/tests/System.Text.Primitives.Tests/Parsing/PrimitiveParserInt32PerfTests.cs#L121?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how much usage the perf tests in CoreFx got so I only added the most critical scenarios and mostly so we'd have a set place to add more. We can beef this up for sure - but it might be better to do it on an as-needed basis. There's also a lot of stuff in that test assembly.

<Compile Include="..\Common\src\System\MutableDecimal.cs" />
</ItemGroup>
<ItemGroup>
<Compile Include="ParsersAndFormatters\Formatter\FormatterTestData.cs" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will likely have Encoders, it might make more sense to just have the following directories at the root, rather than nested within "ParsersAndFormatters":

  • Formatter
  • Parser
  • Encoder

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - the Formatter and Parser test-bed has a lot of shared infrastructure that wouldn't be useful for Encoders.

public char ParseSynonymFor { get; set; } = default;
}

internal static partial class TestData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into separate file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful for the SupportedFormats-related metadata to be in the same file as SupportedFormats struct.


byte[] dayAbbrev = DayAbbreviationsLowercase[(int)value.DayOfWeek];
Unsafe.Add(ref utf8Bytes, 0) = dayAbbrev[0];
Unsafe.Add(ref utf8Bytes, 1) = dayAbbrev[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason why we are using Unsafe code here? The JIT should be able to eliminate the bounds checks here - because of it should be able to see that Span is at least 29 bytes given the precondition above. If the JIT does not do this optimization, we should get it fixed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was that way from the start when it was added to CoreFxLab. It was likely a combination of following the pattern used by the variable-lengthed formatters and the fact that we needed the pinnable reference anyway to call the WriteDigits helpers. @shiftylogic - if you're still following this, was there a JIT-related reason for using Unsafe.Add here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to the Span indexer on this does seem to slow it down 1-2% (fast Span.)

/// </summary>
public static StandardFormat Parse(ReadOnlySpan<char> format)
{
return Parse(new string(format.ToArray()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building System.Memory for .NET Core, we should avoid the extra allocations here, e.g.

return Parse(new string(format));

Even better, invert these so that the string-based overload delegates to the ReadOnlySpan<char>-based overload, and then even the string allocation wouldn't be necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inverted.


if (format.Length > 1)
{
if (!byte.TryParse(format.Substring(1), out precision))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On .NET Core this can use format.AsReadOnlySpan().Slice(1) and the new byte.TryParse that accepts a span.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parse() method doesn't really merit an #if-bifurcated implementation - it was easier just to throw in the simple inline parser.

if (format.Length > 1)
{
if (!byte.TryParse(format.Substring(1), out precision))
throw new FormatException("format");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameof(format)... but why is the message just "format"? Shouldn't there be some resource string outlining the problem, or else this would be an ArgumentException of some kind?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added resource string.

throw new FormatException("format");

if (precision > MaxPrecision)
throw new FormatException("precision");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same questions as I had above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer.

{
long div = numerator / denominator;
modulo = numerator - (div * denominator);
return div;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the implementation of Math.DivRem in .NET Core:
https://github.com/dotnet/coreclr/blob/master/src/mscorlib/shared/System/Math.cs#L82-L90

            long div = a / b;
            result = a - (div * b);
            return div;

if (bookEnds && format.Symbol == 'B')
Unsafe.Add(ref utf8Bytes, idx++) = CloseBrace;
else if (bookEnds && format.Symbol == 'P')
Unsafe.Add(ref utf8Bytes, idx++) = CloseParen;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (bookEnds)
{
    if (format.Symbol == 'B')
    {
        Unsafe.Add(ref utf8Bytes, idx++) = CloseBrace;
    }
    else if (format.Symbol == 'P')
    {
        Unsafe.Add(ref utf8Bytes, idx++) = CloseParen;
    }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Fixes https://github.com/dotnet/corefx/issues/24607

Remaining debt (cut for time):

Parsing Intgers with the "N" format

  https://github.com/dotnet/corefx/issues/24986

  Some questions to be resolved as to whether to be compatible
  (BCL doesn't care where you put the commas) or correct.

Format of floating point is still a wrapper hack

  https://github.com/dotnet/corefx/issues/25077

  The portable DoubleToNumber() code was never ported
  to C# (though the big block comment advertising it
  was).
- Move StandardFormat to System.Buffers

- Mark it readonly

- Fix spelling: "Seperator"

- Deduplicate namespace in ref .cs
- Assert that a culture-invariant ToString() on double produced ASCII characters only

- Add magic literal comments for the 4-byte compares in DateTimeOffset parsing.

- More "seperator" vs. "separator"

- Rename formatter benchmarks to be you know, "formatter"-like.
- Improve perf of long.MinValue path in TryFormat(long)

- Make TryParseDateTimeOffset compare exact casing for Rfx1123 formats ('R' and 'l')
- Lots of small items in the last round of feedback.
Removed the extra allocations from this path
(though I still can't imagine anyone who cares
enough about perf to use this parser wanting
to use these apis) and addressed the outstanding
feedback surrounding these.

Removed the 1.2% false positive noise
from Utf8Parser code coverage. It's now at 100%.
@@ -121,7 +121,7 @@ public static bool TryFormat(DateTimeOffset value, Span<byte> buffer, out int by
return TryFormatDateTimeG(value.DateTime, offset, buffer, out bytesWritten);

default:
throw new FormatException(SR.Argument_BadFormatSpecifier);
return ThrowHelper.TryFormatThrowFormatException(out bytesWritten);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assign bytesWritten to default outside the switch statement so that we don't have to have an out parameter on the ThrowHelper. It looks strange that it is returning an out parameter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd be writing bytesWritten twice and we'd still need a strange looking return type on the helper to avoid ruining the 100% code coverage on Utf8Formatter. ThrowHelper is always going to look strange unless you're one the people deep into JIT-fu. It still looks strange to me.

// Officially, the default is "G" but "G without a precision is equivalent to "D" so that's why we're calling the "D" helper.
return TryFormatInt64D(value, format.Precision, buffer, out bytesWritten);
// Officially, the default is "G" but "G without a precision is equivalent to "D" and so that's why we're using "D" (eliminates an unnecessary HasPrecision check)
format = 'D';
Copy link
Member

@ahsonkhan ahsonkhan Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding the entire struct and calling the constructor here increases the method assembly size.

What is wrong with the following?

char formatChar = format.Symbol;
if (format.IsDefault)
{
   formatChar = 'D';
}
switch (formatChar)
{
   ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong is that you'll pass the wrong precision down to the subformatter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, if format.IsDefault is true, the precision == 0. But we want NoPrecision, not 0. Got it.

}

/// <summary>
/// Converts a classic .NET format string into a StandardFormat
/// </summary>
public static StandardFormat Parse(string format) => format == null ? default : Parse(SpanExtensions.AsReadOnlySpan(format)); //@todo: Change back to extension syntax once the ambiguous reference with CoreLib is eliminated.
Copy link
Member

@ahsonkhan ahsonkhan Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why this is causing an ambiguous reference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the issue. If there is an ambiguous reference, how will it be eliminated?

We reference CoreLib and then the extension methods forward to them.
https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/SpanExtensions.Fast.cs#L48

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One extension method is defined on SpanExtensions, the other on Span. Presumably, the fix is to fix CoreLib - either by making its own copy of the extension method internal or by moving it to the correct type. It's out of the scope of this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps the CoreLib AsReadOnlySpan shouldn't be an extension method.

byte precision = Precision;
if (precision != NoPrecision)
{
if (precision >= 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this considered valid if MaxPrecision is 99? I thought MaxLength would be 3, not 4.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToString() cannot throw exceptions and there's no way to prevent a mischievous app from hand-crafting a struct with illegal values in it. So yes, it has to be robust in this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding this argument. If you corrupt the data structure with unsafe code or the equivalent, then yeah, this might give you an invalid answer, e.g. 23 instead of 123. That's fine. It won't corrupt anything, it'll just give an invalid answer for invalid input.

else
{
uint parsedPrecision = 0;
for (int srcIndex = 1; srcIndex < format.Length; srcIndex++)
Copy link
Member

@ahsonkhan ahsonkhan Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for parsing of strings like "D000...0009". Is that desired behavior? Should we strictly only accept [char][1-9][0-9]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not - the regular BCL parsing routines allow leading zeros in their format string.

@@ -10,7 +10,7 @@ public static partial class Utf8Formatter
/// Formats a Boolean as a UTF8 string.
/// </summary>
/// <param name="value">Value to format</param>
/// <param name="buffer">Buffer to receive UTF8 string</param>
/// <param name="buffer">Buffer to write the UTF8-formatted value to</param>
/// <param name="bytesWritten">Receives the length of the formatted text in bytes</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same nit about using "receive" in the xml comments applies to other parameters, like bytesWritten.

@@ -78,7 +78,7 @@ public static partial class Utf8Formatter
/// Formats a DateTimeOffset as a UTF8 string.
/// </summary>
/// <param name="value">Value to format</param>
/// <param name="buffer">Buffer to receive UTF8 string</param>
/// <param name="buffer">Buffer to write the UTF8-formatted value to</param>
/// <param name="bytesWritten">Receives the length of the formatted text in bytes</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here and elsewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't block on this: feel free to submit a PR for the XML docs - probably faster if you do it with the wording you feel works best.

@ghost ghost merged commit d9924a5 into dotnet:master Nov 7, 2017
@ghost ghost deleted the stp branch November 7, 2017 23:40
@@ -0,0 +1,54 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in corefx/src/Common/ instead of System.Memory/Common/...?

Copy link
Author

@ghost ghost Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept is a bit too icky to advertise broadly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atsushikan, would you be ok with me moving this file within the src directory?

So, from System.Memory/Common/src/System/MutableDecimal.cs to System.Memory/src/System/MutableDecimal.cs? It seems strange to put this in a "Commons" directory and yet it is only accessed within System.Memory. At the very least, I think we should move it inside System.Memory/src, if we want to keep it in Common. Right now, System.Memory is the only one with a folder other than pkg/ref/src/tests at the root.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also weird for a test project to reach into a sibling src directory. If the current setup is really objectionable, then go ahead and move it to the global Common directory. We have other narrowly-targeted stuff up there so this isn't that different.

using System.Runtime.InteropServices;
using System.Runtime.CompilerServices;

namespace System
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these internal Number structs have to be in the System namespace or can they be moved to System.Buffers.Text?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's internal so namespace doesn't strictly matter but I'd also prefer not to introduce unnecessary diff's between these and the originals in CoreRT.

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Produce Utf8Parser and Utf8Formatter

Fixes https://github.com/dotnet/corefx/issues/24607

Remaining debt (cut for time):

Parsing Intgers with the "N" format

  https://github.com/dotnet/corefx/issues/24986

  Some questions to be resolved as to whether to be compatible
  (BCL doesn't care where you put the commas) or correct.

Format of floating point is still a wrapper hack

  https://github.com/dotnet/corefx/issues/25077

  The portable DoubleToNumber() code was never ported
  to C# (though the big block comment advertising it
  was).

* PR feedback.

- Move StandardFormat to System.Buffers

- Mark it readonly

- Fix spelling: "Seperator"

- Deduplicate namespace in ref .cs

* PR feedback.

- Assert that a culture-invariant ToString() on double produced ASCII characters only

- Add magic literal comments for the 4-byte compares in DateTimeOffset parsing.

- More "seperator" vs. "separator"

- Rename formatter benchmarks to be you know, "formatter"-like.

* PR feedback.

- Improve perf of long.MinValue path in TryFormat(long)

- Make TryParseDateTimeOffset compare exact casing for Rfx1123 formats ('R' and 'l')

* PR feedback.

- Lots of small items in the last round of feedback.

* PR feedback (ThrowHelper, AllocHelper, random easy stuff)

* PR feedback (StandardFormat.Parse/ToString())

Removed the extra allocations from this path
(though I still can't imagine anyone who cares
enough about perf to use this parser wanting
to use these apis) and addressed the outstanding
feedback surrounding these.

Removed the 1.2% false positive noise
from Utf8Parser code coverage. It's now at 100%.

* Replace 'buffer' text in XML docs


Commit migrated from dotnet/corefx@d9924a5
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
8 participants