Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Enable JsonEncodedText to be initialized with already encoded bytes. #67850

Closed
olivier-spinelli opened this issue Apr 11, 2022 · 12 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json

Comments

@olivier-spinelli
Copy link

olivier-spinelli commented Apr 11, 2022

Background and motivation

I'm trying to efficiently serialize/deserialize JsonEncodedText: I just need to write the EncodedUtf8Bytes and read then back as bytes, letting the constructor expanding the string. Unfortunately, the constructor that does this is private (called by the Encode
methods and this makes perfect sense since this ctor would be clearly ambiguous):

private JsonEncodedText(byte[] utf8Value)
{
   Debug.Assert](utf8Value != null);
   _value = JsonReaderHelper.GetTextFromUtf8(utf8Value);
   _utf8Value = utf8Value;
}

There is no way to implement a fake encoder to override the default encoder of the Encode method without going unsafe:

sealed class NoEncoder : JavaScriptEncoder
{
   public override int MaxOutputCharactersPerInputCharacter => 1;

   public override unsafe int FindFirstCharacterToEncode( char* text, int textLength ) => -1;

   public override unsafe bool TryEncodeUnicodeScalar( int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten )
   {
       Throw.NotSupportedException();
   }

   public override bool WillEncode( int unicodeScalar ) => false;
}

A simple static method with an explicit name such as the one below will be great and as safe as it can be:

public static JsonEncodedText CreateFromEncodedUtf8Bytes( ReadOnlySpan<byte> bytes!! ) => new JsonEncodedText( bytes );

API Proposal

Add these 2 public static methods to the JsonEncodedText class.

public static JsonEncodedText CreateFromEncodedUtf8Bytes( ReadOnlySpan<byte> bytes ) => new JsonEncodedText( bytes.ToArray() );

public static JsonEncodedText CreateFromEncodedUtf8Bytes( byte[] bytes!! ) => new JsonEncodedText( bytes );

API Usage

The first goal is to avoid serializing a string in UTF-16 (that will be encoded in UTF-8) and restore a string (that will be read as UTF-8 and encoded in UTF-16 that will need be encoded back to UTF-8!)...

Net effect: 0 vs. 1 encoding during write and 1 encoding vs. 2 during read...
(and only the required allocations since with the CreateFromEncodedUtf8Bytes( byte[] bytes!! ) the ownership of the buffer is
given to the JsonEncodedText.)

        /// <summary>
        /// Writes a <see cref="JsonEncodedText"/>.
        /// </summary>
        /// <param name="this">This writer.</param>
        /// <param name="t">The text.</param>
        public static void Write( this ICKBinaryWriter @this, JsonEncodedText t )
        {
            @this.WriteNonNegativeSmallInt32( t.EncodedUtf8Bytes.Length );
            @this.Write( t.EncodedUtf8Bytes );
        }

        /// <summary>
        /// Reads a <see cref="JsonEncodedText"/>.
        /// </summary>
        /// <param name="this">This reader.</param>
        /// <returns>The read text.</returns>
        public static JsonEncodedText ReadJsonEncodedText( this ICKBinaryReader @this )
        {
            int len = @this.ReadNonNegativeSmallInt32();
            return JsonEncodedText.Encode( @this.ReadBytes( len ) );
        }

Alternative Designs

N/A

Risks

None that I can imagine...

@olivier-spinelli olivier-spinelli added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 11, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Encoding untriaged New issue has not been triaged by the area owner labels Apr 11, 2022
@ghost
Copy link

ghost commented Apr 11, 2022

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

I'm trying to efficiently serialize/deserialize JsonEncodedText: I just need to write the EncodedUtf8Bytes and read then back as bytes, letting the constructor expanding the string. Unfortunately, the constructor that does this is private (called by the Encode
methods and this makes perfect sense since this ctor would be clearly ambiguous):

private JsonEncodedText(byte[] utf8Value)
{
   Debug.Assert](utf8Value != null);
   _value = JsonReaderHelper.GetTextFromUtf8(utf8Value);
   _utf8Value = utf8Value;
}

There is no way to implement a fake encoder to override the default encoder of the Encode method without going unsafe:

sealed class NoEncoder : JavaScriptEncoder
{
   public override int MaxOutputCharactersPerInputCharacter => 1;

   public override unsafe int FindFirstCharacterToEncode( char* text, int textLength ) => -1;

   public override unsafe bool TryEncodeUnicodeScalar( int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten )
   {
       Throw.NotSupportedException();
   }

   public override bool WillEncode( int unicodeScalar ) => false;
}

A simple static method with an explicit name such as the one below will be great and as safe as it can be:

public static JsonEncodedText CreateFromEncodedUtf8Bytes( ReadOnlySpan<byte> bytes!! ) => new JsonEncodedText( bytes );

API Proposal

Add these 2 public static methods to the JsonEncodedText class.

public static JsonEncodedText CreateFromEncodedUtf8Bytes( ReadOnlySpan<byte> bytes ) => new JsonEncodedText( bytes.ToArray() );

public static JsonEncodedText CreateFromEncodedUtf8Bytes( byte[] bytes!! ) => new JsonEncodedText( bytes );

API Usage

The use of the span below is totally optional. The first goal is to avoid serializing a string in UTF-16 (that will be encoded in UTF-8) and restore a string (that will be read as UTF-8 and encoded in UTF-16 that will need be encoded back to UTF-8!)...

Net effect: 0 vs. 1 encoding during write and 1 encoding vs. 2 during read...
(and only the required allocations since with the CreateFromEncodedUtf8Bytes( byte[] bytes!! ) the ownership of the buffer is
given to the JsonEncodedText.)

        /// <summary>
        /// Writes a <see cref="JsonEncodedText"/>.
        /// </summary>
        /// <param name="this">This writer.</param>
        /// <param name="t">The text.</param>
        public static void Write( this ICKBinaryWriter @this, JsonEncodedText t )
        {
            @this.WriteNonNegativeSmallInt32( t.EncodedUtf8Bytes.Length );
            @this.Write( t.EncodedUtf8Bytes );
        }

        /// <summary>
        /// Reads a <see cref="JsonEncodedText"/>.
        /// </summary>
        /// <param name="this">This reader.</param>
        /// <returns>The read text.</returns>
        public static JsonEncodedText ReadJsonEncodedText( this ICKBinaryReader @this )
        {
            int len = @this.ReadNonNegativeSmallInt32();
            return JsonEncodedText.Encode( @this.ReadBytes( len ) );
        }

Alternative Designs

N/A

Risks

None that I can imagine...

Author: olivier-spinelli
Assignees: -
Labels:

api-suggestion, area-System.Text.Encoding, untriaged

Milestone: -

@ghost
Copy link

ghost commented May 23, 2022

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

I'm trying to efficiently serialize/deserialize JsonEncodedText: I just need to write the EncodedUtf8Bytes and read then back as bytes, letting the constructor expanding the string. Unfortunately, the constructor that does this is private (called by the Encode
methods and this makes perfect sense since this ctor would be clearly ambiguous):

private JsonEncodedText(byte[] utf8Value)
{
   Debug.Assert](utf8Value != null);
   _value = JsonReaderHelper.GetTextFromUtf8(utf8Value);
   _utf8Value = utf8Value;
}

There is no way to implement a fake encoder to override the default encoder of the Encode method without going unsafe:

sealed class NoEncoder : JavaScriptEncoder
{
   public override int MaxOutputCharactersPerInputCharacter => 1;

   public override unsafe int FindFirstCharacterToEncode( char* text, int textLength ) => -1;

   public override unsafe bool TryEncodeUnicodeScalar( int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten )
   {
       Throw.NotSupportedException();
   }

   public override bool WillEncode( int unicodeScalar ) => false;
}

A simple static method with an explicit name such as the one below will be great and as safe as it can be:

public static JsonEncodedText CreateFromEncodedUtf8Bytes( ReadOnlySpan<byte> bytes!! ) => new JsonEncodedText( bytes );

API Proposal

Add these 2 public static methods to the JsonEncodedText class.

public static JsonEncodedText CreateFromEncodedUtf8Bytes( ReadOnlySpan<byte> bytes ) => new JsonEncodedText( bytes.ToArray() );

public static JsonEncodedText CreateFromEncodedUtf8Bytes( byte[] bytes!! ) => new JsonEncodedText( bytes );

API Usage

The first goal is to avoid serializing a string in UTF-16 (that will be encoded in UTF-8) and restore a string (that will be read as UTF-8 and encoded in UTF-16 that will need be encoded back to UTF-8!)...

Net effect: 0 vs. 1 encoding during write and 1 encoding vs. 2 during read...
(and only the required allocations since with the CreateFromEncodedUtf8Bytes( byte[] bytes!! ) the ownership of the buffer is
given to the JsonEncodedText.)

        /// <summary>
        /// Writes a <see cref="JsonEncodedText"/>.
        /// </summary>
        /// <param name="this">This writer.</param>
        /// <param name="t">The text.</param>
        public static void Write( this ICKBinaryWriter @this, JsonEncodedText t )
        {
            @this.WriteNonNegativeSmallInt32( t.EncodedUtf8Bytes.Length );
            @this.Write( t.EncodedUtf8Bytes );
        }

        /// <summary>
        /// Reads a <see cref="JsonEncodedText"/>.
        /// </summary>
        /// <param name="this">This reader.</param>
        /// <returns>The read text.</returns>
        public static JsonEncodedText ReadJsonEncodedText( this ICKBinaryReader @this )
        {
            int len = @this.ReadNonNegativeSmallInt32();
            return JsonEncodedText.Encode( @this.ReadBytes( len ) );
        }

Alternative Designs

N/A

Risks

None that I can imagine...

Author: olivier-spinelli
Assignees: -
Labels:

api-suggestion, area-System.Text.Json, untriaged

Milestone: -

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented May 24, 2022

Do the JsonEncodedText.Encode methods not suit your use case? Nevermind I misread your description.

There is no way to implement a fake encoder to override the default encoder of the Encode method without going unsafe:

I believe @GrabYourPitchforks might have thoughts about making it possible to extend the STEW abstractions without writing unsafe code.

the ownership of the buffer is given to the JsonEncodedText.

It seems clear to me that the current design is very explicitly trying to avoid this. I wouldn't expect the performance of creating JsonEncodedText issue to be an issue since they are generally expected to be pre-encoded and cached for future serialization loops. If that's not the case, what prevents you from passing your pre-encoded utf8 buffer directly to the underlying writer?

@olivier-spinelli
Copy link
Author

The data is alreay encoded (since it's coming from a UTF-8 serialized stream/pipe). Calling Encode on it (even the one with
ReadOnlySpan<byte>) will trigger a double encoding (it calls NeedsEscaping that has good chances to return a "yes") that will definitely destoy the data .

https://source.dot.net/#System.Text.Json/System/Text/Json/JsonEncodedText.cs,105

@eiriktsarpalis
Copy link
Member

@olivier-spinelli I just updated my response as you were posting yours. Please see my new feedback, in particular the final paragraph.

@olivier-spinelli
Copy link
Author

olivier-spinelli commented May 24, 2022

Seen and thanks!

If that's not the case, what prevents you from passing your pre-encoded utf8 buffer directly to the underlying writer?

I surely can for write but then to read it back I need to convert the utf-8 to string before calling the Encode (that converts back the string to... the same content as the original one) to resurect a JsonEncodedText.

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented May 24, 2022

to read it back I need to convert the utf-8 to string before calling the Encode (that converts back the string to... the same content as the original one) to resurect a JsonEncodedText.

#54410 would make it easier to read JSON strings the need to convert to a string. So you might be able to do something like

char[] myBuffer = ArrayPool<char>.Shared.Rent(128);
int charsRead = reader.CopyString(myBuffer);
ReadOnlySpan<char> source = myBuffer.Slice(bytesRead);

if (_charText.SequenceEquals(source))
{
   /* handle the value */
}

ArrayPool<char>.Shared.Return(myBuffer);

@olivier-spinelli
Copy link
Author

Unfortunately I'm writing/reading from a binary reader and not through the Utf8JsonReader. I'm trying to efficiently serialize/deserialize JsonEncodedText...

@eiriktsarpalis
Copy link
Member

I don't think such a scenario is within scope for this particular type. It's meant as a helper struct aiding serialization performance of System.Text.Json, but it's not itself intended for serialization/deserialization. If performance is important enough, I would recommend writing your own custom struct instead.

@ghost ghost removed the untriaged New issue has not been triaged by the area owner label May 24, 2022
@olivier-spinelli
Copy link
Author

olivier-spinelli commented May 24, 2022

"helper struct aiding serialization performance" can be for other kind of serializers. Another struct will not interoperate with the rest of the code/library: a conversion to/from JsonEncodedText will be needed (and since there is no way to initialize a JsonEncodedText with both its bytes and string without conversions, this is a dead end).

(Ooops I didn't notice you closed the issue.)
Sad.
And for 2 poor static helpers that will be jitted leading to 0 size impacts... Is it the fact that buffer validation is skipped in such case that is annoying?

@eiriktsarpalis
Copy link
Member

"helper struct aiding serialization performance" can be for other kind of serializers.

I don't believe that is the case. We won't support scenaria where this struct is used outside of System.Text.Json, unless used within the contracts of the already exposed APIs.

Ultimately, the reason that I closed this is that we generally don't ship factory methods that transfer ownership of buffers to newly created instances, since that would be fairly error prone.

@olivier-spinelli
Copy link
Author

Noted. Thanks for your time!

@ghost ghost locked as resolved and limited conversation to collaborators Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json
Projects
None yet
Development

No branches or pull requests

3 participants