-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default JSON escaping is biased against other languages #86805
Comments
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis Issue DetailsDescriptionDefault JSON escaping doesn't escape English but does escape other languages. Reproduction Stepsusing System.Text;
using System.Text.Encodings.Web;
using System.Text.Json;
class Program
{
static void Main()
{
Console.OutputEncoding = Encoding.UTF8;
List<string> data = new List<string>();
data.Add("English");
data.Add("日本語");
data.Add("עברית");
data.Add("ქართული ენა");
data.Add("中文");
data.Add("𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴");
string actual = JsonSerializer.Serialize(data);
Console.WriteLine(actual);
}
} Expected behaviorAll languages are treated equally. Actual behaviorSome languages are not. For English speakers, try using the following encoder for a while to get a feel for the significance of this impact on other languages: using System.Text;
using System.Text.Encodings.Web;
using System.Text.Json;
class MaximalJsonEncoder : JavaScriptEncoder
{
public override int MaxOutputCharactersPerInputCharacter => 12;
public override unsafe int FindFirstCharacterToEncode(char* text, int textLength)
{
if (textLength == 0)
{
return -1;
}
return 0;
}
public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength,
out int numberOfCharactersWritten)
{
Rune rune = new Rune(unicodeScalar);
const int unicodeEscapeLength = 6; // "\\u0000".Length
int lengthNeeded = rune.Utf16SequenceLength * unicodeEscapeLength;
if (bufferLength < lengthNeeded)
{
numberOfCharactersWritten = 0;
return false;
}
const int maxCharsPerScalar = 2;
char* utf16Buffer = stackalloc char[maxCharsPerScalar];
int utf16Length = rune.EncodeToUtf16(new Span<char>(utf16Buffer, maxCharsPerScalar));
Span<char> span = new Span<char>(buffer, bufferLength);
for (int index = 0; index < utf16Length; ++index)
{
char toEncode = utf16Buffer[index];
span[0] = '\\';
span[1] = 'u';
span[2] = ToHexDigit((toEncode & 0xf000) >> 12);
span[3] = ToHexDigit((toEncode & 0xf00) >> 8);
span[4] = ToHexDigit((toEncode & 0xf0) >> 4);
span[5] = ToHexDigit(toEncode & 0xf);
span = span.Slice(unicodeEscapeLength);
}
numberOfCharactersWritten = lengthNeeded;
return true;
}
public override bool WillEncode(int unicodeScalar)
{
return true;
}
static char ToHexDigit(int value)
{
if (value > 0xf)
{
throw new ArgumentOutOfRangeException(nameof(value));
}
if (value <= 10)
{
return (char)(value + '0');
}
else
{
return (char)(value - 0xa + 'a');
}
}
}
class Program
{
static void Main()
{
Console.OutputEncoding = Encoding.UTF8;
Dictionary<string, string> data = new Dictionary<string, string>();
data.Add("text", "Hello, World!");
string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = new MaximalJsonEncoder() });
Console.WriteLine(actual);
}
} Regression?No response Known WorkaroundsA relaxed encoder helps with some languages but not all: string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping }); A custom encoder is currently needed to handle all languages equally. ConfigurationNo response Other informationIt could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint. But consider the following JSON files: {"text":"Hello, World!"} vs. {"\u0074\u0065\u0078\u0074":"\u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021"} Even though a parser will treat them as meaning the same thing, these are clearly not equally good JSON files, for two key reasons:
.NET should treat other languages equally, and not relegate some to be treated like the second example above.
|
I think you should measure the difference in UTF-8 as used on the web instead of 16 bit or 32 bit Unicode... That would greatly reduced the encoding difference. And if you happen to use plain ASCII instead of UTF-8, the requested form of not escaping would not be supported at all. My guess would be that this is the reason things work this way right now. |
Right, but I think to say "is biased against other languages" because it only leaves English alone is misleading. ASCII is used as a basis for a lot of different encoding schemes and is therefore much less likely to be misinterpreted even if the encoding scheme is mixed up. Furthermore, its unique characteristic of having the MSB always being zero make it easy for people to detect that the string consists only of ASCII characters. Decoding errors due to mismatching encoding is actually something you do observe at times, especially if you use Windows which uses different encoding based on the locale selected. Referred to as 'mojibake' - literally "broken letters" in Japanese (or here in Korea, '뷁어', literally meaning '뷁 language' because the character occurs quite commonly if you were to misinterpret SHIFT_JIS-encoded strings, the default encoding for Japanese Windows, as CP949, the default encoding for Korean Windows), you may encounter this problem if you were to transfer text files encoded in a non-Unicode encoding then try opening that in a different machine that uses a different default encoding. While it is true that UTF-8 is becoming the de-facto standard for encoding anything, making the outputted string to be encoded only in ASCII helps avoid the problem altogether. It is an unfortunate side-effect of the fact that we did not have a single standard that ruled over all the fragmented sets of encodings for all the non-English languages for a good while. The basis for making this aggressive escape the default behaviour appears to be 'defence in depth' - though, I'm not sure if escaping non-ASCII characters has something to do with this (as only attacks I find regarding JSON escaping is XSS attacks in html codes) However for a 'default behaviour' which is very hard to change I do think it is better to be safe than sorry.
From what I see, most languages should be fine when using I am not suggesting that this is not a problem that needs solving; I do think it would be very nice to leave these unescaped as well for easier debugging without having to juggle the JSON string to an un-escape tool or something. But I think it deserves a separate issue if we really want that (or could revive the issue above as well perhaps).
In my opinion, this is what matters the most. JSON is intended for machine-consumption, and the less we make it fragile the better, especially if it is a default behaviour.
Right if you count the characters, but not if you consider the actual size of the encoded characters. With UTF-8, a non-ASCII character will take around 2 to 4 'code units' (in other words, bytes), depending on where the character is located within the Unicode table. With CJK characters that would be 3 code units compared to 6 code units if encoded escaped, so really you're looking at a x2 size difference, and for things outside of the BMP it would be 4 code units unescaped and 6 code units escaped. It would be 6 times size-wise if you were to escape ASCII characters as well, which is technically allowed, but we don't do that by default and no sane person would do that so it would be pointless to argue about this case.
Right, but how often would you need to read the raw JSON file, apart from during development / debugging? We could also argue we should enable indentation by default if we were to consider readability. Again, for a 'default' behaviour covering the widest demand, I think it is acceptable to produce escaped strings. All those points aside, it has been 4 major releases since |
JSON must always be sent as Unicode. So both of the following are irrelevant:
|
Not true, for two reasons:
A "Unicode code point" means does not mean it has to be encoded in any of the 'Unicode Encoding Forms' (basically, UTF-8 / UTF-16 / UTF-32). Direct quote from the Unicode Glossary:
While confusing as the term 'Unicode' is used interchangeably for multitude of things, what it really means is that only values that has a space on the Unicode codespace is allowed in a valid JSON. It does not imply a codepoint that are not assigned a character is disallowed (implication of which would make additions of characters in newer Unicode revisions make every single spec-conforming library non-conforming until they add support for newer Unicode revisions! - also see '9. String' of ECMA-404 which mentions that the JSON standard allows unassigned codepoints within the string) nor limit its use with only one of the 'Unicode Encoding Forms'. Yes, for practical reasons, the use of UTF-8 / UTF-16 would be the most prevalent, however this does not stop you from using other encodings as long as you can map characters that are encoded in a different encoding to a Unicode codepoint. |
Had a conversation with @GrabYourPitchforks, who gave some interesting insights on some of the reasons why these decisions were made:
If you know that the client you're talking to handles these cases correctly, you can configure the escaping behaviour to not escape any of the characters. For example, ASP.NET can get away with turning off most of the escaping as they can be sure the client is interpreting things correctly by means of Apparently many of the "Big tech" services such as Facebook / Twitter / Google / Instagram etc does this as well just in case something bad does happen.
The implication of this, is that if we wanted to write characters outside of the BMP without escaping, this allow-list need to include the entirety of the supplementary plane (which is 16x the size of BMP), and that would get pretty huge and would lead to a considerable increase in the assembly size of the library, for a feature without enough demand. |
Does this give any reason why languages other than English need to be treated as "unsafe" by default?
Is there a reason there has to be a list at all? Can all JSON characters that aren't required by the RFC to be encoded not be encoded? |
Yeah, the only reason I've seen is to help with devs who put JSON in HTML and forget to HTML-escape. Saying other languages aren't as first-class as English doesn't seem justified by that scenario.
I think that's exactly what this issue is though - a separate issue to treat other languages as much first-class as we do English.
I'd suggest trying to use the above MaximalJsonEncoder to feel the impact. A major reason to use JSON is that it's human readable, unless you use an encoder that's biased against other languages and happen to use one of those languages.
As mentioned in other issues, folks keep using Newtonsoft rather than switching to this API because of these issues. |
Tagging subscribers to this area: @dotnet/area-system-text-encodings-web Issue DetailsDescriptionDefault JSON escaping doesn't escape English but does escape other languages. Reproduction Stepsusing System.Text;
using System.Text.Encodings.Web;
using System.Text.Json;
class Program
{
static void Main()
{
Console.OutputEncoding = Encoding.UTF8;
List<string> data = new List<string>();
data.Add("English");
data.Add("日本語");
data.Add("עברית");
data.Add("ქართული ენა");
data.Add("中文");
data.Add("𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴");
string actual = JsonSerializer.Serialize(data);
Console.WriteLine(actual);
}
} Expected behaviorAll languages are treated equally. ["English","日本語","עברית","ქართული ენა","中文","𑄌𑄋𑄴𑄟𑄳𑄦 𑄃𑄧𑄏𑄛𑄖𑄴"] Actual behaviorSome languages are not. ["English","\u65E5\u672C\u8A9E","\u05E2\u05D1\u05E8\u05D9\u05EA","\u10E5\u10D0\u10E0\u10D7\u10E3\u10DA\u10D8 \u10D4\u10DC\u10D0","\u4E2D\u6587","\uD804\uDD0C\uD804\uDD0B\uD804\uDD34\uD804\uDD1F\uD804\uDD33\uD804\uDD26 \uD804\uDD03\uD804\uDD27\uD804\uDD0F\uD804\uDD1B\uD804\uDD16\uD804\uDD34"] For English speakers, try using the following encoder for a while to get a feel for the significance of this impact on other languages: using System.Text;
using System.Text.Encodings.Web;
using System.Text.Json;
class MaximalJsonEncoder : JavaScriptEncoder
{
public override int MaxOutputCharactersPerInputCharacter => 12;
public override unsafe int FindFirstCharacterToEncode(char* text, int textLength)
{
if (textLength == 0)
{
return -1;
}
return 0;
}
public override unsafe bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength,
out int numberOfCharactersWritten)
{
Rune rune = new Rune(unicodeScalar);
const int unicodeEscapeLength = 6; // "\\u0000".Length
int lengthNeeded = rune.Utf16SequenceLength * unicodeEscapeLength;
if (bufferLength < lengthNeeded)
{
numberOfCharactersWritten = 0;
return false;
}
const int maxCharsPerScalar = 2;
char* utf16Buffer = stackalloc char[maxCharsPerScalar];
int utf16Length = rune.EncodeToUtf16(new Span<char>(utf16Buffer, maxCharsPerScalar));
Span<char> span = new Span<char>(buffer, bufferLength);
for (int index = 0; index < utf16Length; ++index)
{
char toEncode = utf16Buffer[index];
span[0] = '\\';
span[1] = 'u';
span[2] = ToHexDigit((toEncode & 0xf000) >> 12);
span[3] = ToHexDigit((toEncode & 0xf00) >> 8);
span[4] = ToHexDigit((toEncode & 0xf0) >> 4);
span[5] = ToHexDigit(toEncode & 0xf);
span = span.Slice(unicodeEscapeLength);
}
numberOfCharactersWritten = lengthNeeded;
return true;
}
public override bool WillEncode(int unicodeScalar)
{
return true;
}
static char ToHexDigit(int value)
{
if (value > 0xf)
{
throw new ArgumentOutOfRangeException(nameof(value));
}
if (value <= 10)
{
return (char)(value + '0');
}
else
{
return (char)(value - 0xa + 'a');
}
}
}
class Program
{
static void Main()
{
Console.OutputEncoding = Encoding.UTF8;
Dictionary<string, string> data = new Dictionary<string, string>();
data.Add("text", "Hello, World!");
string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = new MaximalJsonEncoder() });
Console.WriteLine(actual);
}
} Regression?No response Known WorkaroundsA relaxed encoder helps with some languages but not all: string actual = JsonSerializer.Serialize(data, new JsonSerializerOptions { Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping }); A custom encoder is currently needed to handle all languages equally. ConfigurationNo response Other informationIt could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint. But consider the following JSON files: {"text":"Hello, World!"} vs. {"\u0074\u0065\u0078\u0074":"\u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021"} Even though a parser will treat them as meaning the same thing, these are clearly not equally good JSON files, for two key reasons:
.NET should treat other languages equally, and not relegate some to be treated like the second example above.
|
In order to understand your requirements better, is this issue proposing something different compared to what #42847 is tracking? |
Frankly, I'm not sure why you're keep bringing the 'language bias' onto the discussion. It has nothing to do with the language. It is precisely just the encoding, about how characters from those other languages are encoded. Yes, We escape every non-ASCII characters. Yes, ASCII only has English letters in it. Does it mean we escape every non-ASCII characters because 'we are biased against other languages'? No. It is because having everything encoded in ASCII minimise any chances of anything potentially going wrong. People make all sorts of weird assumptions about characters and encodings which end up with misbehaving programs due to their wrong assumptions. This kind of mishandling happens all the time with non-English characters everywhere, and is not limited to just the JSON space. One similar example is path handling, when programs don't work when the path includes any space / non-ASCII characters. So common that when you get some program written abroad and it doesn't work, one of the suggested solution apart from "have you tried turning it off then on again" is "make sure the path doesn't include any Hangul". I have no clue how people manage to get this wrong all the time, yet it happens. What happens if the JSON-encoded string goes through some generic text pre-processor before letting the JSON get transmitted to the client, which doesn't know how to process non-ASCII characters properly and end up mangling those characters? Same thing with the path handling happens. It shouldn't happen, I don't even know how they get it wrong, but due to people making wrong assumptions it happens and ends up not working correctly. The wrong assumption doesn't have to be in your code - it could be in someone else's, and that's what we're trying to protect against by escaping by default. If anything, I would argue having it escape by default protects us from the impact of the "biases" that other people have against non-English languages resulting in non-ASCII characters getting handled correctly. Imagine this situation: You want to communicate with an API that takes a JSON text as the body. You use the default escaping behaviour of not escaping anything because "what can possibly go wrong". Only to find out that somewhere in the client's code they manage to mishandle non-ASCII characters and your request is now broken. Now you have to either: A: Make them fix their mistake, or B: Escape your JSON so it can get handled even with their flawed system. The opposite (of client code having problem handling escaped, all-ASCII text), however, is much less likely to happen, and is therefore a better behaviour as "the default behaviour" even though it means your string is now less human readable. You could argue that we shouldn't need to work around someone else's problem, but that's the way they chose when designing
I disagree on this one. JSON being a subset of JavaScript has a nice side-effect of being mostly human readable, but when exactly do you actually require it to be human readable, and how much of the total use case of JSON consists of that? I don't have any numbers but I bet the majority of the use case is for data interchange between entities or data storage, not in a direct human-interfacing way, like configuration files that people take a look into. I'm curious why you need JSON to be human-readable in your case? If you really care about human-readability, JSON lacks a lot of features for a true human readable format like single-quoted strings (to allow for literal double quotes within a string), multi-line strings and comments. For a human-readable format, you might want to use JSON5 or YAML instead which is actually designed with human-readability in mind. |
To summarize, I see two separate requests in this issue:
I disagree with the first issue. The key point is this: The implications of having something set as the default behaviour has a huge implication. We don't necessarily know where the code is going to be used at, and there is a lot of people that assume the default behaviour is good for everything without thinking about what can go wrong, only to run into problems / vulnerabilities later due to clients that mishandle the data. And there are lots of people that mishandle text encoding / JSON data. Therefore, for a core framework code like The alternative to the escape-by-default behaviour is to add warnings on the API documents saying "you should escape if you fall under this category...". But people don't read documents and inevitably will run into issues. It is better to design APIs in a way that it just works without causing any major trouble - The escape-everything-by-default behaviour works fine with any spec-compliant deserializers and also makes non-compliant deserializers work instead of break. The only inconvenience is the human readability, but for the 'default behaviour' causing the least amount of issue should be first goal, not human-readability. Regarding the second issue, I do think it would be nice to solve, because even with all what I've said I myself have used JSON in a human-interfacing situation a few times (for a config file). However, I'm not too sure how beneficial it would be given the cost involved for implementing / maintaining such features compared to the benefit it brings which is really just a nicety feature of being able to produce more human-friendly JSONs. In the original issue talking about emojis (#42847) two people who actually had problem with the escaping behaviour turned out to be having problems elsewhere that had nothing to do with |
Yes, that issue proposes having emoji/non-BMP characters be able not to be escaped, whereas this issue is about what is escaped by default. |
Because this decision negatively impacts other languages, in terms of size-on-disk, size-in-memory, and readability, and we wrote code that goes out of our way to treat such languages differently.
Having ASCII as the basis is more understandable than most of the way this is documented (which just talks about some Latin characters being the only ones allowed).
We can't have it both ways here. If the justification for this approach is, "it shouldn't be needed, but we do it to avoid bugs in other people's code," that reasoning also applies to the emoji bug mentioned as well. The extra escaping also causes problems due to bugs in other people's code (in addition to the performance and readability impact). If we want to avoid such problems, we need to avoid unnecessary escaping.
Let's avoid the "if you really care about" tone. Choosing JSON given its human readability is fully compatible with choosing JSON rather than another language that might be have additional advantages but also disadvantages.
I don't think .NET should be making assumptions about how customers use JSON. I used it for test data from other languages, and I couldn't read the data, making .NET's default unusable for my purpose. But one customer's specific purpose shouldn't be the issue - .NET shouldn't assume customers don't care about human-readability of JSON. RE: "I disagree on this one. JSON being a subset of JavaScript has a nice side-effect of being mostly human readable," Note that the RFC errata indicates JSON isn't fully a subset of JavaScript. More significantly, key sources say otherwise on regarding the human-readability of JSON. Wikipedia's first sentence in the article about JSON: JSON (JavaScript Object Notation, pronounced /ˈdʒeɪsən/; also /ˈdʒeɪˌsɒn/) is an open standard file and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays or other serializable values). And JSON.org: JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. Note that "easy for humans to read and write" comes before "easy for machines to parse and generate." Prior art here (Json.NET) maintains this property of "human-readable text" / being "easy for humans to read" in all languages. |
This issue is specifically about the first item. I filed a separate bug regarding precisely the second item, #86463, and then when the team repeatedly declined to re-open that issue, a request to provide an implementation that doesn't have this extra escaping, #86800, which the team closed as well.
The performance and human-readability implications are the issue here - because .NET "doesn't know where the code is going to be used," concluding performance and human-readability of non-English characters are unimportant seems unwarranted.
As the second issue indicates, there is currently no easy way to do so.
Again, the issue of correctness goes both ways.
We can't both claim these problems "had nothing to do with S.T.Json" and also have us "guard against" "bad programs by default." As noted above, key descriptions of JSON emphasize human-readability, even over ease of machine generation. Trying out the maximal JSON encoder for a while may help to understand the significance of this issue. I'm hoping that if those making a decision primarily use English, we can work to get the same outcome that would have occurred if a non-English speaker were considering this issue. |
I'm still struggling to understand what this issue is proposing compared to #42847. We clearly can't make this the default since beyond the concerns already voiced in this thread it would also be a breaking change. So if this is about exposing a new |
This issue is about changing the default, given the negative impact on other languages. |
It is documented as escaping all non-ASCII characters by default:
I think we're just having different opinions here - and I can totally understand where you are coming from. I'll stop arguing about this point since I think this would lead to an endless debate.
While I do agree that we generally shouldn't really care about someone else's problem, that kind of decisions can have a wide impact on the consumers of the API as it's used by millions of people. Security folks at MSFT apparently decided it is an easy-enough mistake to make & important enough to take such drastic measures, I guess (I also would like to hear more about the exact details / reasonings behind such measures, just for curiosity's sake).
Technically true, I did omit saying that because it's pretty much an exact subset apart from those two, the later versions of the language was changed to allow those, and I thought it was kind of irrelevant for our discussion. |
And regarding better options outside the default, #42847 is only a subset and more focused on changing the existing UnsafeRelaxedJsonEscaping implemnetation. #86800 is about providing a new implementation designed not to do more escaping than necessary - but that issue was closed, and no issue currently tracks the "provide an implementation with no unnecessary escaping" part. |
Yeah, that's the main point here. I think the defense-in-depth security decision incorrectly chose to penalize other languages, just for a defense-in-depth measure that is widely avoided elsewhere (such as Json.NET and XmlSerializer). I think treating other languages equally is more important in choosing a default than extra defense-in-depth. |
.... Apart from all the reasons why / why not it should escape by default, the thing is, I am 99% certain the idea of changing the default is simply not going to happen. .NET has a strict guideline for breaking changes, which include:
Given that the security folks made the explicit decision to make it escape a lot of characters by default & it is very well documented all around, I think it is very unlikely that this would pass the bar for making breaking changes. |
I should also correct/clarify one other statement earlier - this issue is not about only escaping required characters by default. The Default implementation chose to escape some characters to provide defense-in-depth for devs forgetting to escape when placing JSON inside HTML, for example, by escaping < and >. The choice to escape some symbols that are handled specially in other languages is not the issue here. The only issue here is the choice to escape characters in other languages by default, treating them as second-class compared to English characters. |
How likely would folks have taken a dependency on Greek characters being escaped by default? Is that likelihood more important than the negative default impact on users in other languages? I understand back-compat is an important consideration. I'm hoping that decision is made after first fully understanding the impact on other languages, and then after providing a new implementation that would be the default if back-compat hadn't required otherwise. |
Note also that docs also indicate users should not take a dependency on some details here and expect a change in every release: The global block list is an implementation detail that has changed in every release of .NET Core and in .NET 5. Don't take a dependency on a character being a member of (or not being a member of) the global block list. If we can change some aspects of what characters are escaped in each release, I think we should also consider whether changing Greek, for example, between escaped and non-escaped is the right decision. |
It doesn't just have to be Greek / any other languages for that matter, if anyone took dependency on its aggressive escaping behaviour & made assumptions / depended on it their code will now be broken!
Right, but that only applies to the global block list which is totally out of scope for what we are talking about (characters for other languages in general). As the document I quoted above says, there's documentations that clearly states the default behaviour is that it escapes all non-ASCII characters.
I'd like to think differently - XML has ways to specify which charset is used to encode the encoding used ( |
Irrespective of what you believe should have been the correct default behavior, is there a specific use case where you are being blocked by the current defaults? |
Default behavior is not blocking - one can opt out of the default. However, getting to what I believe should be the correct default is currently very difficult. 1) Even opting in to the other built-in implementation doesn't resolve it (#86463), 2) and even if it did, folks will wrongly thing it's "unsafe" (#87138), 3) there is no built-in implementation that just follows the specification's requirements (#86800), and 4) writing such an implementation is unnecessarily complicated (#86810), including needing to go back to the spec to encode some characters or get broken JSON output, as the current interface mixes concerns between choosing which optional characters to encode and how to do so vs getting actually valid JSON output. |
I really do think most languages get covered by using var options3 = new JsonSerializerOptions
{
Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
};
jsonString = JsonSerializer.Serialize(weatherForecast, options3); If you think typing Putting the discussion about what should be the correct default behaviour aside, is there anything else that prevents you from just using |
My point is about what if we changed specifically Greek, not if we changed things like >, ', " (as hex,) and <. To be very clear, this issue is not about changing the aggressive escaping behavior of such symbols.
The point is we do expect some default encoding behavior to change with each release. Given the impact on other languages, I think it's worth considering whether this specific default encoding behavior can also change - I think communicating "stuff in this overall space changes all the time; don't take dependencies" likely makes back-compat less severe of a question than usual.
I don't think that understanding is correct. Both RFC and ECMA specs define JSON as a sequence of Unicode characters/code points; it doesn't really need an "encoding" attribute to make it Unicode since that's part of the spec. (The RFC, unlike ECMA, even goes so far as to say that it must always use specifically the UTF-8 encoding of Unicode.)
As noted above, we're trading one set of bugs (in other's code) for another, at the expense of other languages. |
UnsafeRelaxedJsonEscaping doesn't handle all the languages that matter, and simply adding Emoji won't fix that problem. As the above comment notes, folks wrongly think it's wrong to use that class anyway, and implementing the desired behavior correctly is complicated, and involves the risk of generating non-JSON output. |
I believe .NET can do better than only covering most languages, and even then not via the default, especially when we're writing in a data format defined as Unicode. The only reason we're not covering all languages is because of extra code we've written. |
But this issue is specifically about what the default behavior should be. The other issues cover other questions, such as what non-default implementation is available, and how hard a custom implementation is to write. Let's consider those questions there. |
So the decision really falls down to:
The approach taken in What you are suggesting appears to be 1 (not the languages one, the minimal escaping APIs and alikes) , but it can be hard to even know that such problems might exist in the first place without someone telling them, and when the problem does surface, it's going to be late (and can have serious implications). For personal projects / third-party libraries providing such decisions might be fine but I don't think it meets the bar of standards in the BCL.
I think the title of that issue should be renamed - the underlying limitation is that we don't encode anything outside of BMP because the cost of adding the table of allowed characters would be too immense, not just because they're emojis. If we fix it for emojis, we ought to fix it for other SMP characters as well. Whether or not to make that the default behaviour aside, to be crystal clear: it is unclear whether you want an implementation that escapes nothing or implementations that doesn't escape common, reasonable alphabet characters used in many different countries. I believe you want the latter and I would agree on that, but your other issues and suggestions seem to suggest that you also want the former so I'm confused. I assume most of the "unsafed-ness" comes from Html escapes and potentially other invisible / control characters (assuming the encoding is handled properly, correct me if I'm wrong) - would providing a built-in encoder implementation that leaves most (not necessarily English) alphabets and alikes alone but escapes those potentially dangerous special characters solve your issue? But then again, docs for the |
Thinking about it while writing, I guess it not being the default, the name having |
To clarify, I think that's a question about what we name other implementations, rather than a question of what the default behavior is. I'll leave that discussion for #87138 to avoid making the discussion here, about the default behavior, even longer : )
I think either option 1 or 2 is compatible with this question - which is just whether we escape characters in other languages, not whether we escape symbols or not. I'm not suggesting that .NET change the default behavior of escaping things like HTML symbols.
Yeah, if we could retitle that issue, say, to match #86800, that would be great!
In this issue, an implementation that doesn't escape alphabet characters of any language. I'd also like an inbox implementation that doesn't escape anything beyond what's required (#86800), but that's different from this question of what the default should be.
Yeah, I think that a question about #87138 rather than the question of the default here. Agreed docs are part of the answer. |
Closing in favor of #87153. |
Description
Default JSON escaping doesn't escape English but does escape other languages.
Reproduction Steps
Expected behavior
All languages are treated equally.
Actual behavior
Some languages are not.
For English speakers, try using the following encoder for a while to get a feel for the significance of this impact on other languages:
Regression?
No response
Known Workarounds
A relaxed encoder helps with some languages but not all:
A custom encoder is currently needed to handle all languages equally.
Configuration
No response
Other information
It could be easy to think that encoding doesn't matter, because it's the same semantic meaning from a programmatic standpoint. But consider the following JSON files:
vs.
Even though a parser will treat them as meaning the same thing, these are clearly not equally good JSON files, for two key reasons:
.NET should treat other languages equally, and not relegate some to be treated like the second example above.
The text was updated successfully, but these errors were encountered: