Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: UnicodeJsonEncoder #87153

Open
davidmatson opened this issue Jun 5, 2023 · 4 comments
Open

[API Proposal]: UnicodeJsonEncoder #87153

davidmatson opened this issue Jun 5, 2023 · 4 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json wishlist Issue we would like to prioritize, but we can't commit we will get to it yet
Milestone

Comments

@davidmatson
Copy link

davidmatson commented Jun 5, 2023

Background and motivation

There's no built-in implementation that allows characters from all languages to be kept readable, and unnecessary escaping to be avoided when the caller knows recipients parse JSON correctly.

For additional context, see:
#42847
#86800
#87138

API Proposal

namespace System.Text.Encodings.Web
{
    internal sealed class UnicodeJsonEncoder : JavaScriptEncoder
    {
        internal static readonly UnicodeJsonEncoder Singleton = new UnicodeJsonEncoder();

        private readonly bool _preferHexEscape;
        private readonly bool _preferUppercase;

        public UnicodeJsonEncoder()
            : this(preferHexEscape: false, preferUppercase: false)
        {
        }

        public UnicodeJsonEncoder(bool preferHexEscape, bool preferUppercase)
        {
            _preferHexEscape = preferHexEscape;
            _preferUppercase = preferUppercase;
        }

        // Implementations of base class members.
    }
}
namespace System.Text.Encodings.Web
{
    public abstract class JavaScriptEncoder : TextEncoder
    {
        // Existing members

        public static JavaScriptEncoder Unicode => UnicodeJsonEncoder.Singleton;
    }
}

PR #87147 has additional implementation details.

API Usage

// some typed variable with the JSON object to serialize, called "data"
string json = JsonSerializer.Serialize(data new JsonSerializerOptions {  Encoder = JavaScriptEncoder.Unicode });

Or, to force hex escapes (\uxxxx) rather than two-character escapes (for example, "):

// some typed variable with the JSON object to serialize, called "data"
string json = JsonSerializer.Serialize(data new JsonSerializerOptions {  Encoder = new UnicodeJsonEncoder(preferHexEscape: true, preferUppercase; false) }); // or other values for those bools

Alternative Designs

No response

Risks

Similar to UnsafeRelaxedJsonEncoder, but see #87138.

Callers need to ensure two things:

  1. If this JSON output is embedded inside another language (HTML, SQL, C#, etc.), the text is correctly escaped according to that language's requirements.
  2. The recipient and intermediaries follow the JSON spec correctly (can handle any Unicode characters that are unescaped like they do when such characters are escaped).
@davidmatson davidmatson added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jun 5, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 5, 2023
@ghost
Copy link

ghost commented Jun 5, 2023

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

There's no built-in implementation that allows characters from all languages to be kept readable, and unnecessary escaping to be avoided when the caller knows recipients parse JSON correctly.

For additional context, see:
#42847
#86800
#87138

API Proposal

namespace System.Text.Encodings.Web
{
    internal sealed class UnicodeJsonEncoder : JavaScriptEncoder
    {
        internal static readonly UnicodeJsonEncoder Singleton = new UnicodeJsonEncoder();

        private readonly bool _preferHexEscape;
        private readonly bool _preferUppercase;

        public UnicodeJsonEncoder()
            : this(preferHexEscape: false, preferUppercase: false)
        {
        }

        public UnicodeJsonEncoder(bool preferHexEscape, bool preferUppercase)
        {
            _preferHexEscape = preferHexEscape;
            _preferUppercase = preferUppercase;
        }

        // Implementations of base class members.
    }
}
namespace System.Text.Encodings.Web
{
    public abstract class JavaScriptEncoder : TextEncoder
    {
        // Existing members

        public static JavaScriptEncoder Unicode => UnicodeJsonEncoder.Singleton;
    }
}

PR #87147 has additional implementation details.

API Usage

// some typed variable with the JSON object to serialize, called "data"
string json = JsonSerializer.Serialize(data new JsonSerializerOptions {  Encoder = JavaScriptEncoder.Unicode });

Or, to force hex escapes (\uxxxx) rather than two-character escapes (for example, "):

// some typed variable with the JSON object to serialize, called "data"
string json = JsonSerializer.Serialize(data new JsonSerializerOptions {  Encoder = new UnicodeJsonEncoder(preferHexEscape: true, preferUppercase; false) }); // or other values for those bools

Alternative Designs

No response

Risks

Similar to UnsafeRelaxedJsonEncoder, but see ##87138.

Callers need to ensure two things:

  1. If this JSON output is embedded inside another language (HTML, SQL, C#, etc.), the text is correctly escaped according to that language's requirements.
  2. The recipient and intermediaries follow the JSON spec correctly (can handle any Unicode characters that are unescaped like they do when such characters are escaped).
Author: davidmatson
Assignees: -
Labels:

api-suggestion, area-System.Text.Json

Milestone: -

@eiriktsarpalis eiriktsarpalis added wishlist Issue we would like to prioritize, but we can't commit we will get to it yet and removed untriaged New issue has not been triaged by the area owner labels Jun 6, 2023
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Jun 6, 2023
@eiriktsarpalis
Copy link
Member

Thanks. Because we have too many open issues describing variations of the same problem, I'm going to keep this one open and close #42847, #87138, #86810 and #86805. Feel free to incorporate any context from the other issues that you feel is missing so that it can be reviewed holistically.

@rjgotten
Copy link

rjgotten commented Jun 6, 2023

This API proposal is for a minimal encoder, stating it leaves it up to the caller to further escape content correctly for embedding in whatever other container language they need.

However, there's a problem with that for embedding in HTML inside 'script islets' - i.e. inside <script type="application/json"> tags. Anything goes inside such tags, except a closing </script> tag. (Including all possible permutations involving white-space etc.)

The way to escape HTML content is to use entity encoding, e.g. to escape < use either the named &lt; or the numeric &#60; escape. However, this will specifically not work inside <script> tags.

E.g. while

<script type="application/json">
{ "foo" : "&lt;/script>" }
</script>

is suitably enough escaped, when the content is read back via the DOM (e.g. via .innerText; .text or .textContent ) it will still verbatim produce the string { "foo" : "&lt;/script>" } with the encoded character remaining.

It's decidedly non-trivial to decide what entity-encoding signifies an encoded parameter that needs decoding - and what doesn't. Maybe your JSON contains a series of resource string translations for a technical editing application that talks about how to represent HTML entities and was meant to contain an entity-encoded example?

So to get this right, your encoder has to encode all occurences of entity-like sequences and then you have to decode them again when attempting to read this stuff back.

Would be better if the new API proposal would be extended to allow specifying additional characters that should be encoded with \u encoded sequences. In that case, callers that know the JSON is destined to be placed inside such HTML script islets could add < to the to-be-encoded set of characters and have them be encoded directly within the JSON itself - and then that would suffice.

@davidmatson
Copy link
Author

@rjgotten - that's a very interesting case.

From looking at the API surface, I believe the same question applies to the existing JavaScriptEncoder.UnsafeRelaxedJsonEscaping.

string data = "</script>";
string json = JsonSerializer.Serialize(data, new JsonSerializerOptions {
    Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping });
Console.WriteLine(json);

produces:

"</script>"

Would be better if the new API proposal would be extended to allow specifying additional characters that should be encoded with \u encoded sequences. In that case, callers that know the JSON is destined to be placed inside such HTML script islets could add < to the to-be-encoded set of characters and have them be encoded directly within the JSON itself - and then that would suffice.

I think that's an interesting option to consider. I'd tend to leave that functionality out of this API for simplicity. As far as I can tell, JavaScriptEncoder.UnsafeRelaxedJsonEscaping also does not escape these characters and cannot be customized to escape them (without subclassing) - I'd tend to do the same here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json wishlist Issue we would like to prioritize, but we can't commit we will get to it yet
Projects
None yet
Development

No branches or pull requests

3 participants