Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YamlStream.Load with JSON with emojis (even escaped) fails: "While scanning a quoted scalar, found invalid Unicode character escape code." #838

Closed
cmeeren opened this issue Aug 24, 2023 · 13 comments · Fixed by #841

Comments

@cmeeren
Copy link

cmeeren commented Aug 24, 2023

I am serializing something to JSON with System.Text.Json, and then converting it to YAML using YamlStream. However, if the JSON contains an emoji, even if it's escaped, YamlStream.Load throws:

(Line: 1, Col: 1, Idx: 0) - (Line: 1, Col: 4, Idx: 3): While scanning a quoted scalar, found invalid Unicode character escape code.

Code to reproduce:

open System.IO
open System.Text.Json
open YamlDotNet.RepresentationModel

let input = "👍"

let json = JsonSerializer.Serialize input

assert (json = "\"\\uD83D\\uDC4D\"")

let yamlStream = YamlStream()
yamlStream.Load(new StringReader(json))
@cmeeren cmeeren changed the title Loading JSON with emojis (even escaped) fails YamlStream.Load with JSON with emojis (even escaped) fails: While scanning a quoted scalar, found invalid Unicode character escape code. Aug 24, 2023
@cmeeren cmeeren changed the title YamlStream.Load with JSON with emojis (even escaped) fails: While scanning a quoted scalar, found invalid Unicode character escape code. YamlStream.Load with JSON with emojis (even escaped) fails: "While scanning a quoted scalar, found invalid Unicode character escape code." Aug 24, 2023
@EdwardCooke
Copy link
Collaborator

Those utf-8 codes are actually invalid codes for utf-8. According to Wikipedia anyways. I wonder if the dot net core json library sees your character as 2 separate characters because it might be utf-16. I’m not able to actually dig in to this too much right now, but I’ll try and take a closer look tonight. I wonder if you can specify the encoding in the json serializer and set it to utf-16 and see what happens.

@cmeeren
Copy link
Author

cmeeren commented Aug 24, 2023

UTF-16? But JSON is always UTF-8 by definition, isn't it? In any case, I can find nothing in System.Text.Json about UTF-16. Let me know if I'm wrong.

@cmeeren
Copy link
Author

cmeeren commented Aug 24, 2023

Also, AFAIK many emojis are composed of 2 or even more unicode characters. This SO question and answers may be helpful.

Finally, I highly doubt that System.Text.Json (the official .NET JSON serialization API) is doing things wrong. That would require very concrete proof.

@EdwardCooke
Copy link
Collaborator

This will also help me in getting a fix in

dotnet/runtime#42847
There’s some other linked issues that should help clarify the nuances in Unicode.

not sure when I’ll get it done though.

@cmeeren
Copy link
Author

cmeeren commented Aug 25, 2023

Is there any workaround I can apply now, for converting JSON to YAML? The emojis etc. doesn't have to be unescaped; I'm OK with anything that preserves the escape codes.

I simply want my JSON content converted to YAML (tweaked using a YAML visitor). Now it seems that the YAML conversion is attempting to decode escaped stuff, and is failing at that.

@cmeeren
Copy link
Author

cmeeren commented Aug 25, 2023

Possibly relevant: The Wikipedia article on JSON, section "Character encoding", says:

JSON exchange in an open ecosystem must be encoded in UTF-8. The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+0000 to U+FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs. For example, to include the Emoji character U+1F610 😐 NEUTRAL FACE in JSON:

{ "face": "😐" }
// or
{ "face": "\uD83D\uDE10" }

The latter is exactly what System.Text.Json does. And since YAML is a superset of JSON, I would expect any YAML implementation, such as YamlDotNet, to support such surrogate pairs.

@cmeeren
Copy link
Author

cmeeren commented Aug 25, 2023

Ideally I would like unescaped output (i.e., emojis in the YAML), but at least the following workaround lets me preserve the surrogate pair escape codes:

let unicodeEscapeCodePlaceholder = Guid.NewGuid().ToString()

let escapeUnicodeEscapeCodes (str: string) =
    str.Replace(@"\u", unicodeEscapeCodePlaceholder)

let unEscapeUnicodeEscapeCodes (str: string) =
    str.Replace(unicodeEscapeCodePlaceholder, @"\u")

let formatAsYaml json =
    let json = escapeUnicodeEscapeCodes json
    let yaml = (* load with YamlStream and transform to YAML *)
    unEscapeUnicodeEscapeCodes yaml

@EdwardCooke
Copy link
Collaborator

I definitely do agree that utf8 surrogate pairs should be supported. I’m hoping to have some time this weekend to look at it. I did find where it was throwing a fit though.

@cmeeren
Copy link
Author

cmeeren commented Aug 25, 2023

Happy to hear it! 😁

@edwardcookemacu
Copy link

I've been doing a lot of research on this, through the unicode spec and all that. The surrogate pairs are a way of putting utf-16 and utf-32 characters in a utf-8 file. JSON does this by escaping them, in raw UTF-8 files, its done at the byte level (from my understanding). We just need to support the escaped version, just like the YAML spec shows (if I recall). It has a lot to deal with bit masking and what not, it's pretty complicated so it may take some time, but it may go quicker than I think. I'll let you know when I have a PR ready for it by linking it to this issue.

@cmeeren
Copy link
Author

cmeeren commented Aug 28, 2023

Thanks! I have no idea what the internals of YamlDotNet is doing, but wouldn't it work to just keep the loaded text as-is and not attempting to decode the escape sequences, which is seems like it's doing now?

@EdwardCooke
Copy link
Collaborator

This fix is going to be released in the latest nuget package which should be available in about 10-15 minutes.

@cmeeren
Copy link
Author

cmeeren commented Aug 29, 2023

It works. Thanks a lot! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants