Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Utf8JsonReader.ValueIsEscaped property #45167

Closed
Tracked by #63762
ycrumeyrolle opened this issue Nov 24, 2020 · 19 comments · Fixed by #69580
Closed
Tracked by #63762

Add Utf8JsonReader.ValueIsEscaped property #45167

ycrumeyrolle opened this issue Nov 24, 2020 · 19 comments · Fixed by #69580
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@ycrumeyrolle
Copy link

The Utf8JsonReader struct has an internal field _stringHasEscaping. It is useful for knowing whether the current string value need to be unescaped or not.

Proposed API

namespace System.Text.Json
{
    public ref struct Utf8JsonReader {
+        public bool StringHasEscaping => _stringHasEscaping;
     }
}

Usage Examples

This might not be useful for standard usages of the Utf8JsonReader.
But exposing this field would avoid for low-level scenarios to search for backslash with reader.ValueSpan.IndexOf((byte)'\\') != -1.
A possible scenario is a JsonDocument-like class that that does not require all the parsing complexity, like a FlatJsonDocument for JSON documents with a depth of 1.

    // hypothetical structure that record the position of the token
    JsonTokenPosition tokenPosition =  new JsonTokenPosition
    {
        Position = reader.TokenStartIndex, 
        Legth = reader.ValueSpan.Length, 
        NeedEscaping = reader.StringHasEscaping
    };

Afterward :

    var span = utf8String.Slice(tokenPosition.Position, tokenPosition.Length);
    if (tokenPosition.NeedEscaping) // <= currently it require a call to span.IndexOf((byte)'\\') != -1
    {
        Unescape(span)
    }

It is currently used in JsonSerializer.cs and in JsonDocument.cs.

Risks

No risk identified.

@ycrumeyrolle ycrumeyrolle added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Nov 24, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Nov 24, 2020
@layomia
Copy link
Contributor

layomia commented Dec 1, 2020

This could be done. We'd like to see concrete use-cases e.g. an app or library which is blocked on this functionality in order to prioritize something like this.

@layomia layomia removed the untriaged New issue has not been triaged by the area owner label Dec 1, 2020
@layomia layomia added this to the Future milestone Dec 1, 2020
@ycrumeyrolle
Copy link
Author

We have this case here
We parse in a similar way than the JsonDocument, but more specificaly to our use case.
In term of performance, the redundant call of the method .IndexOf() takes about 1% of the parsing time.

This is not a blocker, just an enhancement.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Oct 21, 2021

One use case might be authoring zero-allocation converters from JSON string values, like some of the ones we have inbox. However, that would require also potentially exposing the unescape helpers.

@ycrumeyrolle perhaps instead the API proposal in #54410 could serve to address your use case?

@ycrumeyrolle
Copy link
Author

Not really because I try to avoid to parse the data when not required. By using a zero-allocation converter, unescaping operation would take some time and I will not use the result except the length.
reader.ValueSpan.IndexOf((byte)'\\') != -1 will be faster than unescaping.

@eiriktsarpalis
Copy link
Member

I'd support exposing the property as proposed. @ahsonkhan @layomia @krwq any objections?

@krwq
Copy link
Member

krwq commented Oct 22, 2021

  1. If it's exposed I'd prefer we called it NeedsUnescaping/HasEncodedCharacters or similar
  2. I'd personally prefer we exposed utilities for how this would be used rather than this information directly (i.e. TryGetUnescapedValueSpan)

@eiriktsarpalis
Copy link
Member

I'd personally prefer we exposed utilities for how this would be used rather than this information directly (i.e. TryGetUnescapedValueSpan)

I tend to agree, but we're already exposing the raw escaped data directly via ValueSpan and ValueSequence. As such it becomes essential to supplement these with additional data for appropriate handling (as is the case with HasValueSequence currently)

@krwq krwq added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Oct 25, 2021
@terrajobst
Copy link
Member

terrajobst commented Oct 26, 2021

Video

  • It should return false when the current token isn't a string.
  • It would be beneficial to align the name with ValueSpan. This also solves the issue if we ever have any other token types that are escaped, so let's rename this to ValueIsEscaped
  • Do we need an Unescape API that allows unescaping in a non-allocating way. We can treat this a separate API request.
namespace System.Text.Json
{
    public partial ref struct Utf8JsonReader
    {
        public bool ValueIsEscaped { get; }
    }
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Oct 26, 2021
@ahsonkhan
Copy link
Member

I'd support exposing the property as proposed. @ahsonkhan @layomia @krwq any objections?

Nope. The conclusion looks good, based on the naming and behavioral discussion :)

@ahsonkhan
Copy link
Member

.... Although the name ValueIsEscaped might be interpreted by a caller that all the characters are escaped (or that all characters from some character set, like non-ASCII, are escaped). The actual semantics here are actually something like ValueContainsSomeEscapedCharacters, i.e. just that at least one character happened to be escaped.

@terrajobst Is that worth word-smithing some more? Is something like ValueContainsEscapedBytes too verbose and not worth the clarity?

@eiriktsarpalis
Copy link
Member

Is something like ValueContainsEscapedBytes too verbose and not worth the clarity?

The use case is somewhat esoteric so I wouldn't object to making it longer if it does improve clarity.

@eiriktsarpalis eiriktsarpalis added the help wanted [up-for-grabs] Good issue for external contributors label Oct 27, 2021
@krwq
Copy link
Member

krwq commented Oct 27, 2021

@ycrumeyrolle are you interested in adding the new API + tests?

@terrajobst
Copy link
Member

@ahsonkhan @eiriktsarpalis

Is something like ValueContainsEscapedBytes too verbose and not worth the clarity?

The use case is somewhat esoteric so I wouldn't object to making it longer if it does improve clarity.

Keep in mind that clarity in naming has a trade-off: the more specific you are in describing the operation, the more you constrain yourself in what you can do moving forward without violating the promise the name is making. I don't have a good intuition here, but I generally agree with @eiriktsarpalis' point regarding this being advanced enough that making the name longer is generally not a bad thing.

@ycrumeyrolle
Copy link
Author

@ycrumeyrolle are you interested in adding the new API + tests?

Ok!

@eiriktsarpalis eiriktsarpalis assigned ycrumeyrolle and unassigned krwq Oct 29, 2021
@ahsonkhan
Copy link
Member

ahsonkhan commented Oct 29, 2021

Is something like ValueContainsEscapedBytes too verbose and not worth the clarity?

The use case is somewhat esoteric so I wouldn't object to making it longer if it does improve clarity.

Keep in mind that clarity in naming has a trade-off: the more specific you are in describing the operation, the more you constrain yourself in what you can do moving forward without violating the promise the name is making. I don't have a good intuition here, but I generally agree with @eiriktsarpalis Eirik Tsarpalis FTE' point regarding this being advanced enough that making the name longer is generally not a bad thing.

If we want it to be a bit smaller, with the same semantic meaning, this could be better: ValueContainsEscapedBytes -> ValueHasEscapedBytes.

The only downside I see naming it as ValueHasEscapedBytes, instead of ValueIsEscaped, is that we are now putting "Bytes" in the name, but I think that aspect is already there given the type has Utf8 in it's name, so doesn't really constrain things all that much.

If we care, and since we are open to dropping the "Is" infix for a bool property (e.g. we already have HasValueSequence), we could also just do something like: ValueHasEscaping (which matches the name of the original private field _stringHasEscaping, just generalized to any token value).

Imo, the area owners (and others) can make a call between:

  • ValueHasEscapedBytes or
  • ValueHasEscaping

@eiriktsarpalis, @terrajobst, @ycrumeyrolle thoughts between the two?

@eiriktsarpalis
Copy link
Member

I don't have a strong opinion on the naming. The existing naming conventions in Utf8JsonReader APIs seem to suggest that "Value" implies raw UTF-8 bytes and as such probably qualifying it with a "Bytes" suffix seems redundant. To @terrajobst's original point, this API would only concern itself with raw bytes so I don't think adding the qualification would introduce damage from the perspective of potential generalizations in the future.

OTOH I fail to see a huge difference between ValueHasEscaping and ValueIsEscaped. If there's an argument to be made about precedent vis a vis the HasValueSequence name, I suppose we'd be talking about HasValueEscaping but that doesn't feel too right.

@ahsonkhan
Copy link
Member

I think either ValueHasEscaping and ValueIsEscaped is OK.

The small concern I see with the latter, is that it might be interpreted to promise some validation/escaping on the token value, which doesn't match user expectations, and then the user might incorrectly use this property to do something like:

// I have a "safe" encoder that I want to run all JSON tokens through that need to be escaped.

if (reader.ValueIsEscaped)
{
  // Incorrectly skip escaping it using S.T.E.W, since apparently it is already fully escaped, and safe to send as is
  // When, in reality, all this checked for, is if *some* character happened to be escaped within it with `\`.
}
else
{
  string actualValue = reader.GetString(); // unescapes it
  var mySafeEncoder = JavaScriptEncoder.Create(UnicodeRanges.BasicLatin);
  string safeEscaped = mySafeEncoder.Encode(actualValue); // now this string is safe to send for some particular use case
}

@GrabYourPitchforks do you think such a concern with the name is unfounded here?

@deeprobin
Copy link
Contributor

@ycrumeyrolle are you interested in adding the new API + tests?

Ok!

Still working on this?

@ycrumeyrolle
Copy link
Author

I should be able to do this within the next two weeks.

@eiriktsarpalis eiriktsarpalis modified the milestones: Future, 7.0.0 May 12, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 19, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 24, 2022
@eiriktsarpalis eiriktsarpalis changed the title Expose the field _stringHasEscaping of the Utf8JsonReader struct Add Utf8JsonReader.ValueIsEscaped property May 24, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants