Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should base64 encode byte[] when writing Json #29299

Closed
BrennanConroy opened this issue Apr 17, 2019 · 17 comments · Fixed by dotnet/corefx#38048
Closed

We should base64 encode byte[] when writing Json #29299

BrennanConroy opened this issue Apr 17, 2019 · 17 comments · Fixed by dotnet/corefx#38048
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@BrennanConroy
Copy link
Member

When writing/serializing a byte[] to Json we were expecting to see a base64 encoded string.

Similarly when reading/deserializing, a base64 encoded string should be readable into a byte[].

@joshfree
Copy link
Member

we'll need to weigh this JSON API against the existing 3.0 json backlog; I know @ahsonkhan had it on his original proposal but it was pushed out of 3.0 based on (no-need-)feedback originally

/cc @rynowak @layomia @steveharter

@ycrumeyrolle
Copy link

With JWT/JWK, the byte|] data is expected to be Base64-URL encode.
For example https://tools.ietf.org/html/rfc7518#section-6.4.1

The "k" (key value) parameter contains the value of the symmetric (or
other single-valued) key. It is represented as the base64url
encoding of the octet sequence containing the key value.

@ahsonkhan
Copy link
Member

ahsonkhan commented May 9, 2019

I will fish out and modify the original base64 api proposal from https://github.com/dotnet/corefx/issues/33552 on the writer which can then be leveraged by the serializer.

public void WriteValueAsBase64(ReadOnlySpan utf8Bytes);

Similarly when reading/deserializing, a base64 encoded string should be readable into a byte[].

On deserializing, sure, since it would know the type is byte[]. I am not sure if it makes sense to provide such an API on the reader since as far as it is concerned, the token is just a JSON string. The caller would know that it's base64 encoded, and could use the Base64 UTF-8 parser to decode it (using Base64.DecodeFromUtf8 and passing it the ValueSpan property).

@ahsonkhan ahsonkhan self-assigned this May 9, 2019
@ahsonkhan
Copy link
Member

ahsonkhan commented May 30, 2019

API Proposal:

public sealed partial class Utf8JsonWriter : IDisposable
{
    // Alternate names:
    // - WriteBytesAsString{Value}
    // - Bytes/Base64 to be suffix isn't as clear - WriteStringAsBase64?
    // Chose bytes as parameter name following the existing Base64 APIs, maybe bytesValue?
    // Can't overload existing WriteString APIs since we already have existing APIs that take ROS<byte>

    public void WriteAsBase64String(ReadOnlySpan<byte> utf8PropertyName, ReadOnlySpan<byte> bytes) { }
    public void WriteAsBase64String(JsonEncodedText propertyName, ReadOnlySpan<byte> bytes) { }
    public void WriteAsBase64String(ReadOnlySpan<char> propertyName,ReadOnlySpan<byte> bytes) { }
    public void WriteAsBase64String(string propertyName, ReadOnlySpan<byte> bytes) { }

    public void WriteAsBase64StringValue(ReadOnlySpan<byte> bytes) { }
}

public ref partial struct Utf8JsonReader
{
    // Alternate names: GetBase64EncodedBytes

    // Throws invalid operation exception for invalid base 64 string, should it throw format exception?
    public byte[] GetBytes() { throw null; }
    public bool TryGetBytes(out byte[] value) { throw null; }
}

public readonly partial struct JsonElement
{
    // Alternate names: GetBase64EncodedBytes

    // Throws invalid operation exception for invalid base 64 string, should it throw format exception?
    public byte[] GetBytes() { throw null; }
    public bool TryGetBytes(out byte[] value) { throw null; }
}

cc @JamesNK, @bartonjs

@ycrumeyrolle
Copy link

Do you expect to escape the base64 string?

@ahsonkhan
Copy link
Member

Do you expect to escape the base64 string?

Good question, and yes, we would follow the default escaping behavior.

@terrajobst
Copy link
Member

terrajobst commented May 30, 2019

Video

  • Utf8JsonWriter.WriteAsBase64String() -> WriteBase64String()
  • Utf8JsonReader.GetBytes() -> GetBytesFromBase64()
  • Utf8JsonReader.TryGetBytes() -> TryGetBytesFromBase64()
  • JsonElement.GetBytes() -> GetBytesFromBase64()
  • JsonElement.TryGetBytes() -> TryGetBytesFromBase64()

@ahsonkhan
Copy link
Member

Re-opening since the (de)serializer work is still left. The APIs have been added.

@gfoidl
Copy link
Member

gfoidl commented May 31, 2019

we would follow the default escaping behavior.

this means data -> base64 -> escaping?
If so, are there any plans to go data -> escaped base64? This would be better perf-wise (especially for large(r) data), as it's O(n) instead of O(2n).

@ahsonkhan
Copy link
Member

ahsonkhan commented May 31, 2019

If so, are there any plans to go data -> escaped base64? This would be better perf-wise (especially for large(r) data), as it's O(n) instead of O(2n).

Not currently, but if you have suggestions or perf improvement strategy, I would be interested in how we can leverage them. The only concern is we need to be able to use the user-provided JavascriptEncoder as a writer option to do the escaping.

@gfoidl
Copy link
Member

gfoidl commented May 31, 2019

Outline of the idea:
If the user doesn't provide a JavaScriptEncoder, so that the DefaultJavaScriptEncoder will be used (is that correct?), hence we can use a modified base64-encoding*, where the lookup will incorporate the escaping.
If the user provides a JavaScriptEncoder fallback to the current behaviour.

When the base64-PR got merged into the json-serializer, I'll try to prototype something.

* base64 is [a-z|A-Z|0-9|+|/|=], base64Url is [a-z|A-Z|0-9|-|_|]. So if base64Url is used for encoding/decoding, no further escaping with the JavaScriptEncoder is needed. In https://github.com/gfoidl/Base64 I've implemented base64Url, so that code could be ported to corefx and used here.

@ahsonkhan
Copy link
Member

Remainder of the behavioral change was fixed in dotnet/corefx#38106.

@ahsonkhan
Copy link
Member

ahsonkhan commented May 31, 2019

So if base64Url is used for encoding/decoding, no further escaping with the JavaScriptEncoder is needed.

We likely want to stick to base64 encoding as the default (even though base64url has the benefit of not requiring escaping by default), to remain consistent with user expectations here. These encodings are not easily interchangeable, so it will break existing usage which depends on reading/writing base64.

When the base64-PR got merged into the json-serializer, I'll try to prototype something.

Sounds good. If ready, we can use it as an internal implementation detail of the Json library for 3.0, especially if the pef wins are obvious.

@gfoidl
Copy link
Member

gfoidl commented Jun 1, 2019

Hm, my initial thought of special handling while base64-encoding may not be the best route. Especially since we got simd-based base64, as this relies on a 3 input bytes -> four output bytes, and this isn't the case with javascript escaping anymore.
While still possible to build this into a base64-encoder, I believe this will be bad perf-wise.

A better solution may be to introduce a Base64JavaScriptEncoder, as this one only needs special handling for + (-> \u002B) and / (-> \/). The rest of the base64-characters remain.
This should give a perf-boost over the DefaultJavaScriptEncoder.

So if the user doesn't provide a custom JavaScriptEncoder, the new Base64JavaScriptEncoder will be used, otherwise fallback to the default / current escaping.

Is this a way to pursue?

@gfoidl
Copy link
Member

gfoidl commented Jun 1, 2019

For the Base64JavaScriptEncoder I made a quick iteration and compared this with the DefaultJavaScriptEncoder.

Method RawDataLen Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Default 10 136.85 ns 1.2497 ns 1.1689 ns 1.00 0.0100 - - 64 B
Base64JavaScriptEncoder 10 63.90 ns 0.4735 ns 0.4197 ns 0.47 0.0101 - - 64 B
Default 100 1,296.50 ns 20.8229 ns 19.4778 ns 1.00 0.5722 - - 3592 B
Base64JavaScriptEncoder 100 398.44 ns 1.6652 ns 1.4761 ns 0.31 0.0482 - - 304 B
Default 1000 11,807.81 ns 214.1532 ns 189.8412 ns 1.00 5.5542 0.0458 - 34960 B
Base64JavaScriptEncoder 1000 3,158.43 ns 16.1928 ns 14.3545 ns 0.27 0.4539 0.0038 - 2872 B

So this looks quite promising, and if this route can be gone, the plan would be:

  1. make Base64JavaScriptEncoder available in corefx (which namespace should this be -- it's a internal class)
  2. plumb this into the json-writer (as outlined above)
  3. improve the perf of Base64JavaScriptEncoder (maybe vectorize it a bit) (in a follow-up PR)

@ahsonkhan
Copy link
Member

So this looks quite promising, and if this route can be gone, the plan would be

Sounds good to me. Can you also establish a baseline by submitting writer (and serializer/deserializer) perf test related to base64 encoded byte arrays to dotnet/performance? We should check that in and then add the perf improvements.

Add something similar to these:
https://github.com/dotnet/performance/blob/021411d229876d5208bbce9983a284949f67a210/src/benchmarks/micro/corefx/System.Text.Json/Serializer/ReadJson.cs#L17
https://github.com/dotnet/performance/blob/021411d229876d5208bbce9983a284949f67a210/src/benchmarks/micro/corefx/System.Text.Json/Serializer/WriteJson.cs#L17
And more targeted:
https://github.com/dotnet/performance/tree/master/src/benchmarks/micro/corefx/System.Text.Json/Utf8JsonWriter

@gfoidl
Copy link
Member

gfoidl commented Jun 4, 2019

Will do this during this week. I'll CC you on the PR then.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants