Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations in VersionConverter #55564

Merged

Conversation

N0D4N
Copy link
Contributor

@N0D4N N0D4N commented Jul 13, 2021

Resolves #55179

I didn't find any info on creating custom benchmarks to measure performance of VersionConverter for current implementation of .NET runtime and custom build with proposed changes, so I would much appreciate if you can point to docs on how to do it, so I can create and run benchmark and attach its results here.

Few notes on implementation:

  • In proposed version, converter currently doesn't check for reader.TokenType being string, probably we can add this check, the only problem is ThrowHelper.GetInvalidOperationException_ExpectedString returns exception instead of throwing it, so we would have throw statement directly in Read method, but i guess it's fine since this method is overriden and won't be inlined anyway;
  • Unlike current implementation of Write method, proposed one is more fragile to changes inside Version class, since it needs to know what size of buffer it needs to allocate, ans therefore is more dependent on concrete implementation like, Version only having 4 components and components being Int32, so probably additional Debug.Assert's can be added to make sure they would fail in case of breaking changes inside Version class, but unfortunately I can not think of any.

@ghost
Copy link

ghost commented Jul 13, 2021

Tagging subscribers to this area: @eiriktsarpalis, @layomia
See info in area-owners.md if you want to be subscribed.

Issue Details

Resolves #55179

I didn't find any info on creating custom benchmarks to measure performance of VersionConverter for current implementation of .NET runtime and custom build with proposed changes, so I would much appreciate if you can point to docs on how to do it, so I can create and run benchmark and attach its results here.

Few notes on implementation:

  • In proposed version, checking for version component being valid (be greater or equal to zero) is done before creating new Version object, and therefore its "fragile" to changes in implementation of Version, another approach would be pass all version components to constructor of Version, even if they are negative numbers, catch all exceptions and rethrow them as JsonException.
    Also, in this case we can rent buffer from ArrayPool if reader has ValueSequence instead of calling ToArray on it to remove allocations completely, and return pooled buffer to ArrayPool in finally block, since we would already be having overhead of try/catch block;
  • In proposed version, converter currently doesn't check for reader.TokenType being string, probably we can add this check, the only problem is ThrowHelper.GetInvalidOperationException_ExpectedString returns exception instead of throwing it, so we would have throw statement directly in Read method, but i guess it's fine since this method is overriden and won't be inlined anyway;
  • Unlike current implementation this one if more fragile to changes inside Version class, since old implementation relies on common Parse(string) and ToString() methods, which determine correctness of input and output inside Version class, and proposed implementation is more dependent on concrete details like, Version only having 4 components and components being Int32, so probably additional Debug.Assert's can be added to make sure they would fail in case of breaking changes inside Version class, but unfortunately I can not think of any.
Author: N0D4N
Assignees: -
Labels:

area-System.Text.Json

Milestone: -

@dnfadmin
Copy link

dnfadmin commented Jul 13, 2021

CLA assistant check
All CLA requirements met.

int maxCharCount = JsonReaderHelper.s_utf8Encoding.GetMaxCharCount(source.Length);
char[]? pooledChars = null;
Span<char> charBuffer = maxCharCount * sizeof(char) <= JsonConstants.StackallocThreshold
? stackalloc char[maxCharCount]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the guidance is to just use JsonConstants.StackallocThreshold for the size as the jitter can optimize when calling the method.

cc @GrabYourPitchforks

Copy link
Contributor Author

@N0D4N N0D4N Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case Encoding.GetChars() can write not complete bytes source to char buffer. Indeed, max length of string representation of valid Version object is 43 chars aka 86 bytes which is always lower than current JsonConstants.StackallocThreshold value.
But if we want to get string representation only of valid Versions be successfully converted to chars, shouldn't we then just use max length of string of valid Version object * sizeof(char) (86 bytes), and so we will stackalloc even less than JsonConstants.StackallocThreshold

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you know that the maximum length is x bytes or chars for any specific data type, it's fine to hardcode that value here. It'll result in somewhat more efficient stack usage than relying on the fallback const.

Copy link
Contributor Author

@N0D4N N0D4N Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only know that maximum length of string of valid Version coming from CLR is 43 chars, but incoming data can be of any length or it can have whitespaces as pointed below, and potentially can cause exception when getting chars from encoding in case when char buffer cant hold all data from byte source. In case of netstandard2.0 target such exception wont be thrown since buffer will be allocated accounting not max length of version but max char count coming from byte source.

string source = $"{int.MaxValue}.{int.MaxValue}.{int.MaxValue}.{int.MaxValue}";
var s_utf8Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
byte[] ut8Bytes = s_utf8Encoding.GetBytes(source);
Span<char> chars = stackalloc char[source.Length - 1]; // Buffer too small to contain all incoming data
s_utf8Encoding.GetChars(ut8Bytes, chars);
//The line above fails with unhandled exception
//Unhandled exception. System.ArgumentException:
//The output char buffer is too small to contain the decoded characters,
//encoding 'Unicode (UTF-8)' fallback 'System.Text.DecoderExceptionFallback'. (Parameter 'chars')


#if BUILDING_INBOX_LIBRARY

ReadOnlySpan<byte> source = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan;
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're accessing the raw bytes and performing the conversion to chars yourself, this could introduce a security vulnerability because it could allow invalid JSON through the system. Version.TryParse accepts a wider range of characters than the JSON spec allows.

@layomia - You'll recognize this as the same feedback I gave for the recent TimeSpan optimizations. This is now the second PR I've seen that uses this pattern.

Layomi, please schedule some time to audit all the optimizations that were checked in as part of the 6.0 wave and ensure this type of pattern was not actually committed anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version.TryParse accepts a wider range of characters than the JSON spec allows

Example? Is this about whitespace?

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just whitespace, but nulls and other weirdness.

string s = "\n10\0.\v20\t\0\0.\r\n30\0"; // lots of invalid JSON
Version v = new Version(s);
Console.WriteLine(v); // prints 10.20.30

Edit: It's very much in the same spirit of the warning comment I put on string.ReplaceLineEndings. Methods like int.Parse, Version.Parse, and others are meant to parse human-readable representations of these values, not protocol-compliant representations of these values. It's up to the caller to ensure that the payload follows the requirements of whatever protocol is being parsed. The earlier version of this code forced the creation of a string, which meant it would ride atop the JSON stack's existing implicit validation logic. The new version bypasses this, going down to the raw bytes, which means that the caller (this routine) is now also taking responsibility for protocol validation.

Copy link
Member

@stephentoub stephentoub Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just whitespace, but nulls and other weirdness

This is what I meant by whitespace. Int32.{Try}Parse by default allows for a subset of what char.IsWhitespace considers valid, and also ignores trailing nulls. Version.TryParse just inherits that.

What, specifically, is the security vulnerability concern with not having validation fail because of such characters? e.g. how would that manifest as an attack?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make a guarantee that using default deserialization parameters and built-in converters, the incoming JSON payload is fully validated before a full object graph can be hydrated. This guarantee exists to provide support for heterogeneous environments. For instance: you have a frontend performing an initial pass over the input data, including basic consistency checks and request authentication; then that payload is forwarded to a backend for real processing. The frontend should not allow non-compliant data to reach the backend, as the backend might not be resilient against processing that data, or it might store the data in a manner that corrupts the database. (There were vulnerabilities in DasBlog, aspnet 4.x, and Exchange due to two parsers in a heterogeneous environment interpreting the same payload in different manners.)

One quasi-by-design hole in the System.Text.Json stack is that string data is not validated until the string itself is materialized. For most parsers and converters this behavior is just fine, as they get the string (which validates!), then parse it to create the final result. But in the case of hyper-optimized converters like the one under discussion here, they're dropping down to the raw byte level to get at the underlying data and avoid string materialization, which can suppress the normal correctness guarantees we make for the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In proposed version i added failing fast if string has any escaping.
If string data with escaping is still valid Version - then this failing should be removed, and replaced with unescaping in case if string has any escaping. This way byte data will basically follow the same path as when you call reader.GetString except calling Encoding.GetChars instead of Encoding.GetString in the end. But I can't reproduce validation on materialization that string does, that you are talking about, with the example invalid json you presented earlier.

string source = "\n10\0.\v20\t\0\0.\r\n30\0"; // lots of invalid JSON
var s_utf8Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true);
byte[] ut8Bytes = s_utf8Encoding.GetBytes(source);
string convertedString = s_utf8Encoding.GetString(ut8Bytes);
Span<char> chars = stackalloc char[source.Length];
s_utf8Encoding.GetChars(ut8Bytes, chars);
Console.WriteLine(chars.SequenceEqual(convertedString) && chars.SequenceEqual(source)); // Prints true
Version versionFromSource = new Version(source);
Version versionFromConvertedStr = new Version(convertedString);
Version versionFromChars = Version.Parse(chars);
Console.WriteLine(versionFromSource.Equals(versionFromConvertedStr) && versionFromSource.Equals(versionFromChars) 
                    && versionFromConvertedStr.Equals(versionFromChars)); // Prints true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks, thanks, I get all that, but I'm not clear how that's applicable to this specific case. We're not talking about taking unvalidated json and storing it somewhere that expects validated json; we're talking about ending up with a well-formed Version object that discarded the unnecessary whitespace/nulls. I'm not saying that's not divergent from the json spec, but I'm also not seeing how it results in a vulnerability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we're taking the approach of disallowing all leading or trailing trivia across all the primitive types that we support across the STJ stack. This involves implementing all the necessary validation prior to calling an underlying lower-level API that does the actual data parsing, e.g. Utf8Parser. This has been a manual process repeated for each supported data type, and we'll do the same for this PR.

please schedule some time to audit all the optimizations that were checked in as part of the 6.0 wave

@GrabYourPitchforks the TimeSpan PRs (#54186, #55350) and this PR are the only candidates for .NET 6 where we process data types at this level so I think these code reviews might suffice.

Here are some somewhat related PRs in case something jumps out:

@terrajobst terrajobst added the community-contribution Indicates that the PR has been added by a community member label Jul 19, 2021
Copy link
Contributor

@layomia layomia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The implementation needs to support parsing escaped data.

namespace System.Text.Json.Serialization.Converters
{
internal sealed class VersionConverter : JsonConverter<Version>
{
public override Version Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options)
{
if (reader._stringHasEscaping)
{
ThrowHelper.ThrowJsonException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to parse escaped data for all data types. You could use #54186 (comment) and commits on that PR (as well as #55350) as a reference

Copy link
Contributor Author

@N0D4N N0D4N Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in 84ab81a. Also added checks if first and last chars are digits, and ported this checks to netstandard2.0


#if BUILDING_INBOX_LIBRARY

ReadOnlySpan<byte> source = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we're taking the approach of disallowing all leading or trailing trivia across all the primitive types that we support across the STJ stack. This involves implementing all the necessary validation prior to calling an underlying lower-level API that does the actual data parsing, e.g. Utf8Parser. This has been a manual process repeated for each supported data type, and we'll do the same for this PR.

please schedule some time to audit all the optimizations that were checked in as part of the 6.0 wave

@GrabYourPitchforks the TimeSpan PRs (#54186, #55350) and this PR are the only candidates for .NET 6 where we process data types at this level so I think these code reviews might suffice.

Here are some somewhat related PRs in case something jumps out:

@N0D4N N0D4N force-pushed the #55179/reduce-allocations-in-version-converter branch from 07b6a9a to 84ab81a Compare July 20, 2021 11:38
@N0D4N N0D4N requested a review from layomia August 1, 2021 08:29
@eiriktsarpalis eiriktsarpalis self-assigned this Aug 9, 2021
@eiriktsarpalis
Copy link
Member

Hey @N0D4N, we're going to move this to 7.0.0 if that's ok. I expect to circle back and review this after we are done with pending 6.0.0 work.

@eiriktsarpalis eiriktsarpalis added this to the 7.0.0 milestone Aug 9, 2021
@N0D4N
Copy link
Contributor Author

N0D4N commented Aug 9, 2021

@eiriktsarpalis, It's alright with me. The only issue i see - later we would need to make sure that behavior of refactored converter's Deserialize method would be the same as publicly shipped behavior. Current implementation allows trailing, leading and whitespaces between digits, like 1. 1.1 , which isn't desired behavior, if i understood #55564 (comment) correctly, so we would need to make sure that we won't break the consumers code later on, since some of them can be locked on such behavior.

@eiriktsarpalis
Copy link
Member

Hi @N0D4N, main is now the .NET 7 development branch, so we should be able resume review. Would you be able to take a look at the PR feedback?

@N0D4N
Copy link
Contributor Author

N0D4N commented Sep 20, 2021

Hi @eiriktsarpalis. Can you be more specific what PR feedback you are talking about? Since I'm pretty sure that I've dealt with all PR feedback:

Though I'd need to update PR branch with commits from main to make sure that everything still works. Currently failed tests on CI are unrelated I believe.
Also can you please take a look at my comment #55564 (comment) and say if it is alright if Version in JSON has leading and trailing whitespace(s) since .NET 6 will be shipped with such behavior, and if it's desired behavior i would update checks so first and last char could possibly be not only digits, but also a whitespace.

@eiriktsarpalis
Copy link
Member

@layomia would you be able to take another look? @N0D4N would it be possible to rebase your changes on top of latest main? I can also do it for you if you wish.

throw ThrowHelper.GetFormatException(DataType.Version);
}

Span<byte> stackSpan = stackalloc byte[isEscaped ? MaximumEscapedVersionLength : MaximumVersionLength];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could maxLength be used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed same approach as in TimeSpanConverter

int maximumLength = isEscaped ? MaximumEscapedTimeSpanFormatLength : MaximumTimeSpanFormatLength;
ReadOnlySpan<byte> source = stackalloc byte[0];
if (reader.HasValueSequence)
{
ReadOnlySequence<byte> valueSequence = reader.ValueSequence;
long sequenceLength = valueSequence.Length;
if (!JsonHelpers.IsInRangeInclusive(sequenceLength, MinimumTimeSpanFormatLength, maximumLength))
{
throw ThrowHelper.GetFormatException(DataType.TimeSpan);
}
Span<byte> stackSpan = stackalloc byte[isEscaped ? MaximumEscapedTimeSpanFormatLength : MaximumTimeSpanFormatLength];

As far as i know JIT can optimize stackallocing const amount of bytes, but i'm not sure if it can optimize with maxLength, since it can be only one of two const values, but by itself it isn't const.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @GrabYourPitchforks do you reckon the JIT could be more optimal with the second pattern (N) here? maxLength could only be a const value.

@N0D4N N0D4N force-pushed the #55179/reduce-allocations-in-version-converter branch from d861306 to 11efcbd Compare September 21, 2021 07:22
@N0D4N
Copy link
Contributor Author

N0D4N commented Sep 21, 2021

Rebased on top of main
@eiriktsarpalis @layomia can you clarify if VersionConverter should be able to convert data with leading/trailing whitespaces or it should throw exception?
Because if it should be able to convert i would need to update this PR by adding additional checks and move few test cases where VersionConverter should throw to where it should be able to convert successfully.
And if it should throw, then fix should be done to .NET 6 before it's stable release, since current implementation in .NET 6, which this PR pretty much replaces, allows leading/trailing whitespaces.

@layomia layomia added the breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. label Sep 24, 2021
@ghost ghost added the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Sep 24, 2021
@ghost
Copy link

ghost commented Sep 24, 2021

Added needs-breaking-change-doc-created label because this PR has the breaking-change label.

When you commit this breaking change:

  1. Create and link to this PR and the issue a matching issue in the dotnet/docs repo using the breaking change documentation template, then remove this needs-breaking-change-doc-created label.
  2. Ask a committer to mail the .NET Breaking Change Notification DL.

Tagging @dotnet/compat for awareness of the breaking change.

@layomia
Copy link
Contributor

layomia commented Sep 24, 2021

@N0D4N we want to throw when we have leading/trailing white space, just like with DateTime, Guid, and related STJ support. Since this would break folks who are dependent on lenient behavior for whitespace, it is better to take the change early in .NET 7 (now) so we can watch out for feedback (to revert if necessary), rather than port the change to 6.0 now and potentially cause noise.

@layomia
Copy link
Contributor

layomia commented Sep 24, 2021

cc @dotnet/area-system-text-json thoughts on the breaking change highlighted in #55564 (comment)? I support it since it's an edge case, it's early in the release, and there's an easy workaround (custom converter with the previous implementation).

@steveharter
Copy link
Member

cc @dotnet/area-system-text-json thoughts on the breaking change highlighted in #55564 (comment)? I support it since it's an edge case, it's early in the release, and there's an easy workaround to get around it (custom converter with the previous implementation).

I favor consistency here, so if we are throwing in the majority of places already then we should throw here as well. Also, from a perf perspective, it is faster to write a parser that doesn't have to check for such whitespace, so I think that should be the default at least.

Although technically breaking, I think it would be very rare to have leading\trailing whitespace within the quotes. The work-around is as mentioned is to add a custom converter that allows it.

@eiriktsarpalis
Copy link
Member

@layomia I'd support it as well.

@N0D4N
Copy link
Contributor Author

N0D4N commented Sep 28, 2021

Should i open an issue at dotnet/docs that will describe breaking change in this PR?

@layomia layomia merged commit cb78a89 into dotnet:main Sep 28, 2021
@layomia
Copy link
Contributor

layomia commented Sep 28, 2021

@N0D4N feel free to add to dotnet/docs#26292. I've also mailed the .NET Breaking Change Notification DL.

@N0D4N N0D4N deleted the #55179/reduce-allocations-in-version-converter branch September 28, 2021 17:13
@N0D4N
Copy link
Contributor Author

N0D4N commented Sep 28, 2021

Thanks for your feedback, everyone!

@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
@eiriktsarpalis eiriktsarpalis removed the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Sep 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Json breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove string allocations in VersionConverter in System.Text.Json
9 participants