Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-allocating view on string properties values of JSON in System.Text.Json.Utf8JsonReader #54410

Closed
Tracked by #63762
N0D4N opened this issue Jun 18, 2021 · 24 comments · Fixed by #69580
Closed
Tracked by #63762
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release Team:Libraries User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@N0D4N
Copy link
Contributor

N0D4N commented Jun 18, 2021

Background and Motivation

Currently there is no way to get view on string property value of JSON without allocating string, except cases when string property is in fact number, DateTime or anything that System.Buffers.Text.Utf8Parser supports.
But many converters even inside System.Text.Json need string representation of string property only to parse it and don't use anywhere further, for example, such converters are:

Non-internally non-allocating view on string properties can be used for creating custom StringConverter which will be using custom StringPool for example, which will operate on small set of strings but not known at compile time.
My proposal is to add methods to Utf8JsonReader which will accept buffer of chars where value of string property will be written to.

Proposed API

namespace System.Text.Json
{
    public ref partial struct Utf8JsonReader
    {
        /* Existing APIs */
        public ReadOnlySpan<byte> ValueSpan { get; }
        public ReadOnlySequence<byte> ValueSequence { get; }

        public bool ValueIsEscaped { get; } // Whether the JSON string contains escaped characters
        public bool HasValueSequence { get; } // The string can either be stored in a span or a ReadOnlySequence
        
        public string? GetString(); // How we currently decode JSON strings

        /* Proposed new APIs */
        public void GetString(scoped Span<byte> utf8Destination, out int bytesWritten);
        public void GetString(scoped Span<char> destination, out int charsWritten);
    }

    public partial class JsonEncodedText
    {
         public static void Unescape(ReadOnlySpan<byte> utf8Value, Span<byte> utf8Destination, out int bytesWritten);
    }
}

Usage Examples

Get an allocation-free view of the unescaped UTF8 string

Span<byte> buffer = stackalloc byte[SomeUpperBound];
reader.GetString(buffer, out int bytesWritten); // handles both ValueSpan and ValueSequence representations, 
                                                // throws if source buffer length exceeds that of the target buffer.
ReadOnlySpan<byte> unescapedUtf8Value = buffer.Slice(0, bytesWritten);

Handling of ValueSpan representations only:

Debug.Assert(!reader.HasValueSequence);

ReadOnlySpan<byte> unescapedBuffer = stackalloc byte[0];
if (reader.ValueIsEscaped)
{
      Span<byte> buffer = stackalloc byte[SomeUpperBound];
      JsonEncodedText.Unescape(reader.ValueSpan, buffer, out int bytesWritten);
      unescapedBuffer = intermediate.Slice(0, bytesWritten);
}
else
{
      // avoid copying to an intermediate buffer if escaping is not needed
      unescapedBuffer = reader.ValueSpan;
}

Copying to char buffers

char[] buffer = ArrayPool<char>.Rent(maxLength);

// buffer length needs to be at least as long as reader.ValueSpan/ValueSequence to succeed
reader.GetString(buffer, out int charsWritten);

// consume & return the buffer as usual

Alternative Designs

Can't think of any.

Risks

Name GetChars can be confusing for some users, maybe there can be other, better fit for such method?

Notes

What should happen in case when provided buffer is not of sufficient length? Should exception be thrown or buffer should be written to max, and when its capacity is full method should return?

@N0D4N N0D4N added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jun 18, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Jun 18, 2021
@ghost
Copy link

ghost commented Jun 18, 2021

Tagging subscribers to this area: @eiriktsarpalis, @layomia
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

Currently there is no way to get view on string property value of JSON without allocating string, except cases when string property is in fact number, DateTime or anything that System.Buffers.Text.Utf8Parser supports.
But many converters even inside System.Text.Json need string representation of string property only to parse it and don't use anywhere further, for example, such converters are:

Non-internally non-allocating view on string properties can be used for creating custom StringConverter which will be using custom StringPool for example, which will operate on small set of strings but not known at compile time.
My proposal is to add methods to Utf8JsonReader which will accept buffer of chars where value of string property will be written to.

Proposed API

namespace System.Text.Json
{
    public ref partial struct Utf8JsonReader
    {
+     public void GetChars(Span<char> buffer, out int written){}
+     public void GetChars(char[] buffer, out int written){} // Overload for convenience 
+     public bool TryGetChars(Span<char> buffer, out int written){}  
+     public bool TryGetChars(char[] buffer, out int written){} // Overload for convenience

//     Optional
+     public char[] GetChars(){} // will allocate buffer inside itself
+     public bool TryGetChars(out char[] result){} // will allocate buffer inside itself
    }
}

Usage Examples

// somehow consumer of API need to figure out length of buffer
// may be it somehow can be deducted from Utf8JsonReader.ValueSpan / Utf8JsonReader.ValueSequence lengthes?
Span<char> buffer = stackalloc char[length];

// OR

char[] buffer = ArrayPool<char>.Rent(length);

reader.GetChars(buffer/*.AsSpan()*/, out int written);

// Case with parsing
var classInstance = ClassThatSupportsParsingFromSpanOfChars.Parse(buffer.Slice(0, written);
// Optionally return buffer to ArrayPool<char>
return classInstance;

// OR

// Case with `StringPool` from Microsoft.Toolkit.HighPerformance package 
// (https://github.com/windows-toolkit/WindowsCommunityToolkit/blob/main/Microsoft.Toolkit.HighPerformance)
string pooledNonAllocatedString = StringPool.Shared.GetOrAdd(buffer.Slice(0, written));
// Optionally return buffer to ArrayPool<char>
return pooledNonAllocatedString;

Alternative Designs

Can't think of any.

Risks

Name GetChars can be confusing for some users, maybe there can be other, better fit for such method?

Notes

What should happen in case when provided buffer is not of sufficient length? Should exception be thrown or buffer should be written to max, and when its capacity is full method should return?

Author: N0D4N
Assignees: -
Labels:

api-suggestion, area-System.Text.Json, untriaged

Milestone: -

@eiriktsarpalis
Copy link
Member

This is already possible using the ValueSpan and ValueSequence properties:

Utf8JsonReader reader = ...;
char[] rentedBuffer;
int charLength;

if (reader.HasValueSequence)
{
    ReadOnlySequence<byte> bytes = reader.ValueSequence;
    int maxCharLength = Encoding.UTF8.GetMaxCharCount(checked((int)bytes.Length));
    rentedBuffer = ArrayPool<char>.Shared.Rent(maxCharLength);
    charLength = Encoding.UTF8.GetChars(bytes, rentedBuffer);
}
else
{
    ReadOnlySpan<byte> bytes = reader.ValueSpan;
    int maxCharLength = Encoding.UTF8.GetMaxCharCount(bytes.Length);
    rentedBuffer = ArrayPool<char>.Shared.Rent(maxCharLength);
    charLength = Encoding.UTF8.GetChars(bytes, rentedBuffer);
}

Span<char> copiedChars = rentedBuffer.AsSpan(0, charLength);
try
{
    // consume the chars
}
finally
{
    copiedChars.Clear();
    ArrayPool<char>.Shared.Return(rentedBuffer);
}

I'm not sure how the proposed methods might simplify the above. As you have already mentioned allocating the right char buffer size would still require accessing the raw bytes and decoding the character length. One might also argue that Span<char> might not be the appropriate target type when the source uses ReadOnlySequence, and an IBufferWriter<char> is preferable.

@huoyaoyuan
Copy link
Member

This is already possible using the ValueSpan and ValueSequence properties

ValueSpan does not handle escapes, right?

@N0D4N
Copy link
Contributor Author

N0D4N commented Jun 18, 2021

This is already possible using the ValueSpan and ValueSequence properties

As @huoyaoyuan noticed, current implementation of Utf8JsonReader.GetString() method uses public method from internal class JsonReaderHelper.GetUnescapedString(ReadOnlySpan<byte> utf8Source, int idx) used for unescaping bytes, so i assume Encoding.UTF8.GetChars(ReadOnlySpan<byte> bytes, Span<char> chars) doesn't do that, so in case of Utf8JsonReader reading bytes that need unescaping your approach will fail, I assume, however I am not sure 100%.
So either JsonReaderHelper.GetUnescapedString method needs to be made public so it can be used by consumers, together with _stringHasEscaping field of Utf8JsonReader, which is also used by GetString() method, to determine if string property value has escaping and don't do this computation twice.
Or they can be left as is, and proposed above API can be added.

As you have already mentioned allocating the right char buffer size would still require accessing the raw bytes and decoding the character length.

In case of EnumConverter it seems to already enumerate all Enums values inside constructor so it can determine max length that string representation of enum value can have, store it and then allocate buffer of such size, indeed, EnumConverter instances will add in size needed to store such length, but i think it isn't outweighing the benefits of non allocating string every time you need to parse Enum from json not by numeric representation, but by text representation.
CharConverter can allocate buffer of size 1 obviously.
I am not sure about how big length of Version string can be but I am pretty sure it has some limit defined at compile time.
In case of non-internal consumption of proposed API, consumer needs to know max length of values that can come in JSON and create buffer accordingly to such knowledge, but i guess it's fine since proposed API is pretty niche, in my opinion.

One might also argue that Span might not be the appropriate target type when the source uses ReadOnlySequence, and an IBufferWriter is preferable.

I don't know much IBufferWriter<char> and its specifics, and that's why i proposed used Spans, that are widespread i think.
Also GetString() method internally calls .ToArray() in case when source uses ReadOnlySequence so i guess its fine, but I am more then welcome to have overload for IBufferWriter<char> OR make GetChars method accept only IBufferWriter<char> as long as it will be simple and obvious to use.

@N0D4N
Copy link
Contributor Author

N0D4N commented Jun 18, 2021

I've looked into IBufferWriter<T> and it seems, its "default" implementation is ArrayBufferWriter<T> which is class and allocates array inside itself, which kinda kills the idea of proposed API (no/less allocations), PipeWriter is another implementation of IBufferWriter<T> actually implements IBufferWriter<byte> so it can't be used in our case, since we need the buffer of chars.
I've also looked into ReadOnlySequence<byte> and it seems it can be enumerated and its segments can be escaped/transcoded/etc and easily written to Span<char>/char[] without any need for IBufferWriter<char>, and without need to call .ToArray() on it, like current implementation of GetString() does.
Probably another implementation of IBufferWriter<char> can be created but I can't see the point of it.

@eiriktsarpalis
Copy link
Member

This is already possible using the ValueSpan and ValueSequence properties

ValueSpan does not handle escapes, right?

Right, it does not.

Probably another implementation of IBufferWriter can be created but I can't see the point of it.

I suppose it addresses the theoretical concern that the source ReadOnlySequence<byte> is too large to fit in an array. In my example above it will materialize as an overflow exception.

@N0D4N
Copy link
Contributor Author

N0D4N commented Jun 18, 2021

I suppose it addresses the theoretical concern that the source ReadOnlySequence is too large to fit in an array. In my example above it will materialize as an overflow exception.

I guess in such cases ReadOnlySequence<byte> can be enumerated, its content, can be unescaped/transcoded and then written to buffer with checking for buffer size. In case when buffer is filled, but there still is content inside ValueSequence, method should return or throw exception, as I've written inside Notes section of issue.
In my opinion proposed API isn't supposed to work with very big strings, so either of above mentioned approaches is fine.
For future-proofing custom implementation of IBufferWriter<char>, which will incapsulate Span<char>/char[] inside, is fine, problem is it won't be able to have Span<char> as field, since Span<T> is ref struct and therefore can only be a field inside another ref structs and ref struct can't implement interfaces.
So overloads of GetChars method couldn't make non-allocating transformations on passed parameters, that will be passed to another overload with actual implementation, so method would need two implementations one for Span<char>/char[] and one for IBufferWriter<char>. And i think it's pretty important to have overload that accepts Span<char>, since it's the only easy way to stackalloc at the moment.

@eiriktsarpalis eiriktsarpalis added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Jul 16, 2021
@eiriktsarpalis eiriktsarpalis added this to the 7.0.0 milestone Jul 16, 2021
@eiriktsarpalis eiriktsarpalis added the Priority:1 Work that is critical for the release, but we could probably ship without label Oct 14, 2021
@eiriktsarpalis
Copy link
Member

We should also include an overload that accepts Span<byte> for unescaped Utf8 encoded strings.

See #1563 (comment)

@N0D4N
Copy link
Contributor Author

N0D4N commented Oct 25, 2021

We should also include an overload that accepts Span for unescaped Utf8 encoded strings.

See #1563 (comment)

Should i update proposal?

Also i thought of another usage for proposed API internally, it can replace usage of JsonReaderHelper.GetUnescapedSpan which allocates an array, in cases where we know max length of valid string value, in example such cases are JsonConverters for numbers, when they are converting value from string and are calling Get*WithQuotes which are internally calling GetUnescapedSpan, e. g. Utf8JsonReader.GetInt64WithQuotes

@eiriktsarpalis eiriktsarpalis removed the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 16, 2021
@eiriktsarpalis eiriktsarpalis added User Story A single user-facing feature. Can be grouped under an epic. Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release Team:Libraries and removed Priority:1 Work that is critical for the release, but we could probably ship without labels Jan 13, 2022
@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented May 11, 2022

The concern specifically raised in #23433 is set to be addressed in C# 11 via scoped parameter values. @jaredpar has indicated that it should be possible to add the scoped annotation post hoc, so we shouldn't be blocked on the language feature being shipped.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented May 11, 2022

I'm proposing the following API for this change:

namespace System.Text.Json
{
    public ref partial struct Utf8JsonReader
    {
        public bool ValueIsEscaped { get; } // API already approved via https://github.com/dotnet/runtime/issues/45167

        public void GetString(scoped Span<byte> utf8Destination, out int bytesWritten);
        public void GetString(scoped Span<char> destination, out int charsWritten);
    }

    public partial class JsonEncodedText
    {
         public static void Unescape(ReadOnlySpan<byte> utf8Value, Span<byte> utf8Destination, out int bytesWritten);
    }
}

Additional notes:

cc @davidfowl

@N0D4N
Copy link
Contributor Author

N0D4N commented May 12, 2022

StringHasEscaping property was proposed and approved in #45167, so i believe it's out of scope of this proposal.

Shouldn't JsonEncodedText.Unescape be in some more general place for unescaping UTF-8, or is escaped UTF-8 text coming to .NET - JSON formatted?

Perhaps both GetString and Unescape should return some sort of enum describing status of operation, something like

enum JsonUnescapingResult
{
  Success = 0,
  TooSmallDestinationSize = 1
}

or is intended behavior is to throw exception is such cases, like Encoding.GetChars does?

// Decodes a range of bytes in a byte array into a range of characters in a
// character array. An exception occurs if the character array is not large
// enough to hold the complete decoding of the bytes. The
// GetCharCount method can be used to determine the exact number of
// characters that will be produced for a given range of bytes.
// Alternatively, the GetMaxCharCount method can be used to
// determine the maximum number of characters that will be produced for a
// given number of bytes, regardless of the actual byte values.
//
public abstract int GetChars(byte[] bytes, int byteIndex, int byteCount,
char[] chars, int charIndex);

@eiriktsarpalis
Copy link
Member

StringHasEscaping property was proposed and approved in #45167, so i believe it's out of scope of this proposal.

Ah excellent, I had forgotten about that one. I'm updating its milestone given that the two feature are related.

Shouldn't JsonEncodedText.Unescape be in some more general place for unescaping UTF-8, or is escaped UTF-8 text coming to .NET - JSON formatted?

My understanding is that this functionality specifically concerns unescaping JSON strings, and different rules might govern unescaping in different contexts.

or is intended behavior is to throw exception is such cases, like Encoding.GetChars does?

That's the idea. Currently we only check that the destination buffer is at least as large as the source buffer and throw if it isn't. I think it's good enough for most intents and purposes.

@eiriktsarpalis
Copy link
Member

@N0D4N I've updated your original post and I'm marking the issue as ready for review.

@eiriktsarpalis eiriktsarpalis added api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels May 12, 2022
@terrajobst
Copy link
Member

terrajobst commented May 17, 2022

Video

  • GetString
    • We don't need scoped -- marking the methods readonly is sufficient
    • Should return the units written instead of providing them via out
    • Should be named CopyString
  • It seems we don't need JsonEncodedText.Unescape
    • CopyString already unescapes
namespace System.Text.Json;

public ref partial struct Utf8JsonReader
{
    // Existing APIs
    // public ReadOnlySpan<byte> ValueSpan { get; }
    // public ReadOnlySequence<byte> ValueSequence { get; }
    // 
    // public bool ValueIsEscaped { get; } // Whether the JSON string contains escaped characters
    // public bool HasValueSequence { get; } // The string can either be stored in a span or a ReadOnlySequence
    //
    // public string? GetString(); // How we currently decode JSON strings

    public readonly int CopyString(Span<byte> utf8Destination);
    public readonly int CopyString(Span<char> destination);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 17, 2022
@jaredpar
Copy link
Member

jaredpar commented May 17, 2022

We don't need scoped -- marking the methods readonly is sufficient

readonly and scoped do very different things here. For example lacking scoped you cannot do the following:

void M(Utf8JsonReader reader) {
  Span<byte> span = stackalloc byte[42];
  reader.CopyString(span); // Error
}

If the definition is int CopyString(scoped Span<byte> dest) though that code works fine

@terrajobst
Copy link
Member

terrajobst commented May 17, 2022

@tannergooding tested this and concluded the compiler accepts this when CopyString() is marked as readonly (because the type couldn't assign it to any fields). Of course, this wouldn't work for cases where we need to assign unrelated fields, but it seems CopyString() would never have to do this.

@jaredpar
Copy link
Member

Ah that's right cause the in forces it to ref readonly and that falls outside the method args must match

@jaredpar
Copy link
Member

Is the policy then to essentially use scoped as little as possible? Essentially if it doesn't enable stackalloc scenarios then omit it?

In my mind it's better to always use scoped when the argument isn't captured. It's more declarative about the situation. Having the readonly here allow the scenario feels more indirect and subtle. I don't feel incredibly strongly about this but want to have my head in the right place when reviewing future APIs

@tannergooding
Copy link
Member

Is the policy then to essentially use scoped as little as possible? Essentially if it doesn't enable stackalloc scenarios then omit it?

This was a consideration around S.T.Json needing to target .NET Standard and the possibility (even if niche) around what to do if scoped doesn't ship in C# 11 (how would it impact the API surface, etc).

I'd agree it'd be better to explicitly declare it as scoped still to help enforce that its non-capturing, but we can do/consider that once the feature exists and is usable.

@jaredpar
Copy link
Member

This was a consideration around S.T.Json needing to target .NET Standard

This won't be an issue. The compiler synthesizes the metadata supporting scoped parameters and hence it should work on any target framework.

the possibility (even if niche) around what to do if scoped doesn't ship in C# 11 (how would it impact the API surface, etc).

Understood. It's a sensible concern.

I'd agree it'd be better to explicitly declare it as scoped still to help enforce that its non-capturing, but we can do/consider that once the feature exists and is usable

👍

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 19, 2022
@eiriktsarpalis eiriktsarpalis removed the blocking Marks issues that we want to fast track in order to unblock other important work label May 19, 2022
@terrajobst
Copy link
Member

terrajobst commented May 19, 2022

If the compiler synthesizes the attribute then I don't think they are any (real) .NET Standard concerns (assuming scoped is just a custom attribute on the parameter). Worst case, the consumer doesn't use C# 11 which means the caller can't use the API with a stack allocated span. The caller can, of course, upgrade to C# 11 to use this API this way.

While readonly would work here too I dislike the idea that we sometimes use scoped and sometimes readonly. For one, we wouldn't always be able to use it and secondary we aren't really training ourselves or our consumers to be explicit about the provided contract.

Said differently, I'd say whenever we see an API accepting a Span<T> on a ref struct we should ask ourselves should this be marked as scoped? I think that's more useful.

@jaredpar
Copy link
Member

, I'd say whenever we see an API accepting a Span on a ref struct we should ask ourselves should this be marked as scoped? I think that's more useful.

💯

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 24, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release Team:Libraries User Story A single user-facing feature. Can be grouped under an epic.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants