Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode with utf8 by default for non-text (or all?) content types #175

Open
cdvv7788 opened this issue Jul 13, 2018 · 20 comments
Open

Decode with utf8 by default for non-text (or all?) content types #175

cdvv7788 opened this issue Jul 13, 2018 · 20 comments
Labels
type-enhancement A request for a change that isn't a bug

Comments

@cdvv7788
Copy link

I am requesting some information from the a server, which returns the following (using postman):

Headers:
Allow →GET, HEAD, OPTIONS
CF-RAY →439e4801cf3db955-MIA
Connection →keep-alive
Content-Encoding →gzip
Content-Type →application/json

Body:
{
"name": "SARA LUCIA OSSA PEÑA",
}

But, using http, I am getting the following (using a simple request get('url')):
{
"name": "SARA LUCIA OSSA PE�A",
}

To fix this, I had to do something like:
UTF8.decode(response.bodyBytes)

This works as expected, and the information is retrieved fine. This, however, is a pain to setup (and inconsistent with post, where utf8 is used as default encoding).

Is there a better way to handle this? An argument to the get parameter to force encoding? shouldn't application/json assume utf8 by default?

I came up with the solution after reading https://pub.dartlang.org/documentation/http/latest/http/Response-class.html and the body property. Probably it is encoding the body with a wrong format.

Anyway, thanks for the hard work. Awesome library.

@cdvv7788 cdvv7788 changed the title GET utf8 json Response application/json utf8 not decoding correctly Jul 13, 2018
@zoechi
Copy link

zoechi commented Jul 15, 2018

Good comments about this in
https://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean

I don't think it's the HTTP clients job to do the decoding. The header is no guarantee that the content will be of that type.
You can always build your own wrapper that does the UTF8-decoding for you so you don't have to repeat yourself.

@cdvv7788
Copy link
Author

@zoechi The client has to try to decode using a best effort approach. The http client is already decoding, but using the wrong encoding. There are 2 attributes in the response, body and bodyBytes.
In the same link you send, it is mentioned:

Designating the encoding is somewhat redundant for JSON, since the default (only?) encoding for JSON is UTF-8

Knowing that the content type is json should be enough for the client to interpret it, or at least to avoid intepreting it with the wrong encoding. This is not an edge case, json is probably the most popular format for api communication at the moment. You are right on the wrapper part, but if the headers are coherent, the client should be able to handle the body properly too.

@ghost
Copy link

ghost commented Aug 24, 2018

So, the situation is a bit murky, so bear with me..

The first layer in any of this is HTTP.
HTTP, AFAICT, defines the default character set to be ISO-8859-1, which is why the Content-Type charset parameter exists as an explicit override.
In Content-Type all parameters (charset included) are optional, but specific types may define their own required parameters.

JSON, in turn, explicitly does not define the charset parameter.
JSON originally declared that it "shall be encoded in Unicode" with the default byte encoding being UTF-8.
But that's a rather vague declaration, which is perhaps why they later amended it to be a requirement that JSON be encoded in UTF-8.

Next, the Dart http API defines (as mentioned above) two ways of interacting with the response data:

body → String
The body of the response as a string. This is converted from bodyBytes using the charset parameter of the Content-Type header field, if available. If it's unavailable or if the encoding name is unknown, latin1 is used by default, as per RFC 2616.

bodyBytes → Uint8List
The bytes comprising the body of this response.

So as discussed above, HTTP in absence of a defined charset is assumed to be encoded in ISO-8859-1 (Latin-1). And body from its description is consistent with this behaviour.
If the server response sets the Content-Type header to application/json; charset=utf-8 the body should work as expected.

The problem of course is that there are servers out there that do not set charset for JSON (which is valid), but which is also a bit of a grey area in between the two specs:

  • JSON is always supposed to be UTF-8, and for that reason says you don't need to set charset, but ..
  • HTTP is always by default ISO-8859-1, unless the charset is explicitly set.

A "smart" HTTP client could choose to follow the JSON definition closer than the HTTP definition and simply say any application/json is by default UTF-8 - technically violating the HTTP standard.
However, the most robust solution is ultimately for the server to explicitly state the charset which is valid according to both standards.

As for this bug I'm inclined to say that http is working as intended, though the standards are a bit at odds with each other on this one.
@cdvv7788, if you are able to you could add charset to your Content-Type on the server which should fix your issue.
Alternatively if you're stuck with your server as-is, I recommend you try something like the httpserver example:

  HttpClientRequest request = await HttpClient().post(_host, 4049, path) /*1*/
    ..headers.contentType = ContentType.json /*2*/
    ..write(jsonEncode(jsonData)); /*3*/
  HttpClientResponse response = await request.close(); /*4*/
  await response.transform(utf8.decoder /*5*/).forEach(print);

Hope it helps.

I'll close this issue assuming all open questions are resolved.

@ghost ghost closed this as completed Aug 24, 2018
@ghost ghost added the closed-as-intended Closed as the reported issue is expected behavior label Aug 24, 2018
@cdvv7788
Copy link
Author

Got it. Thanks for this.

@tomchristie
Copy link

So as discussed above, HTTP in absence of a defined charset is assumed to be encoded in ISO-8859-1

Note that applies to "text" media types. JSON is "application/json".

Correct clients should treat JSON (or any other non-text media type) responses as bytestrings, rather than text. (ie. use response.bodyBytes.)

@ghost
Copy link

ghost commented Aug 27, 2018

@tomchristie, right. I also elaborated a bit on this in #186, but basically saying the same but with more words. :)

@fabiocarneiro
Copy link

I still believe this is a bad behavior and added comments to #186

cimadai pushed a commit to cimadai/openapi-generator that referenced this issue Dec 18, 2018
…o avoid text garbling.

The original processing is using `response.body` to deserialize as json.
However, this is decoded by latin1 if the header contains only "application/json" instead of "application/json; charset=utf-8".

Because of this behavior, if the response body is encoded UTF-8 but the headers doesn't contain charset, the body will garbling.

cf: dart-lang/http#175

Since playframework 2.6 returns "Content-Type: application/json" without "charset=utf-8", I changed this parsing algolithm.
wing328 pushed a commit to OpenAPITools/openapi-generator that referenced this issue Jan 9, 2019
…o avoid text garbling. (#1700)

* fix: force to decode as utf-8 when header contains application/json to avoid text garbling.

The original processing is using `response.body` to deserialize as json.
However, this is decoded by latin1 if the header contains only "application/json" instead of "application/json; charset=utf-8".

Because of this behavior, if the response body is encoded UTF-8 but the headers doesn't contain charset, the body will garbling.

cf: dart-lang/http#175

Since playframework 2.6 returns "Content-Type: application/json" without "charset=utf-8", I changed this parsing algolithm.

* fix: force to decode as utf-8 when header contains application/json to avoid text garbling on error.
A-Joshi pushed a commit to ihsmarkitoss/openapi-generator that referenced this issue Feb 27, 2019
…o avoid text garbling. (OpenAPITools#1700)

* fix: force to decode as utf-8 when header contains application/json to avoid text garbling.

The original processing is using `response.body` to deserialize as json.
However, this is decoded by latin1 if the header contains only "application/json" instead of "application/json; charset=utf-8".

Because of this behavior, if the response body is encoded UTF-8 but the headers doesn't contain charset, the body will garbling.

cf: dart-lang/http#175

Since playframework 2.6 returns "Content-Type: application/json" without "charset=utf-8", I changed this parsing algolithm.

* fix: force to decode as utf-8 when header contains application/json to avoid text garbling on error.
MarcoSavaglia added a commit to MarcoSavaglia/protoc-gen-twirp_dart that referenced this issue Oct 28, 2019
By default, the Dart HTTP decoding will assume that the charset is ISO-8859-1 (latin-1). This causes emojis, certain apostrophe characters, etc to not function correctly. 
See here for extended discussion on why:
dart-lang/http#175

Essentially, this occurs when the content-type header returns without a charset.
mterring pushed a commit to happy-co/protoc-gen-twirp_dart that referenced this issue Oct 28, 2019
By default, the Dart HTTP decoding will assume that the charset is ISO-8859-1 (latin-1). This causes emojis, certain apostrophe characters, etc to not function correctly. 
See here for extended discussion on why:
dart-lang/http#175

Essentially, this occurs when the content-type header returns without a charset.
@renatoathaydes
Copy link

This error also happens when the content-type is text/html (even when the HTML content says it's encoded as utf-8), and image/svg+xml (which also declares utf-8 in the content), for example. The fact that HTTP establishes a default encoding that's not UTF-8 is a sign of its age: today, I doubt you could do better than use UTF-8 as default for any text you get online.

@gsouf
Copy link

gsouf commented Apr 26, 2020

According to the standard for json, you are not actually allowed to use latin1 for the encoding of the contents. JSON content must be encoded as unicode, be it UTF-8, UTF-16, or UTF-32 (big or little endian).
(https://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean)

I'm stuck with a server I don't have hands on and that does not return header specifying that the json content is using utf8. That is implicitely expected

@renatoathaydes
Copy link

@cskau-g the interpretation that HTTP uses ISO for text content is outdated and that requirement has been removed from the HTTP spec:

Appendix B of RFC-7231:

 The default charset of ISO-8859-1 for text media types has been
   removed; the default is now whatever the media type definition says.

Furthermore, the relevant part of the current spec does not mention at all a default charset to be applied to textual representations for any media-type:

https://tools.ietf.org/html/rfc7231#section-3.1.1.2

The JSON RFC, meanwhile, determines that the charset when used in conjunction with application/json should have no effect:

Note:  No "charset" parameter is defined for this registration.
      Adding one really has no effect on compliant recipients.

It has also been amended to make UTF-8 mandatory in the case of data transmitted over a network, which is the primary use-case for HTTP:

https://tools.ietf.org/html/rfc8259#appendix-A

 Section 8.1 was changed to require the use of UTF-8 when
      transmitted over a network.

Section 8.1:

JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8
   when transmitting JSON text.  However, the vast majority of JSON-
   based software implementations have chosen to use the UTF-8 encoding,
   to the extent that it is the only encoding that achieves
   interoperability.

Hopefully, this is enough to show that the currently most widely used data exchange format on the internet is not supported correctly by the Dart HTTP Server. Please consider changing this behavior as keeping it as it is is only going to hurt Dart's standing for no good reason.

@natebosch natebosch changed the title Response application/json utf8 not decoding correctly Decode with utf8 by default for non-text (or all?) content types Jul 15, 2021
@natebosch
Copy link
Member

Reopening to track - I do think we should consider changing the defaults since most users are likely to benefit.

Note that the expected pattern to use today when you know the result is json is jsonDecode(utf8.decode(response.bodyBytes)). We should consider changing it so that jsonDecode(response.body) works as well.

@natebosch natebosch reopened this Jul 15, 2021
@natebosch natebosch added type-enhancement A request for a change that isn't a bug and removed closed-as-intended Closed as the reported issue is expected behavior labels Jul 16, 2021
@natebosch
Copy link
Member

Changing the default for all responses, or even for non-text responses, is breaking. At least one internal usage is impacted.

Changing the default only when the content type is application/json is more narrow and may be safer.

@fabiocarneiro
Copy link

As stated before in #186, this behavior is wrong and should be corrected. It doesn't matter if it breaks bc or not. Release a new major if that is necessary.

In 2018 we were talking about this with a lot of effort on explaining HTTP and it was just ignored. If it was taken into consideration at the time, everything would have been adopted today. How many years more do we need to wait?

@crimsonvspurple
Copy link

I believe both FF/Chrome, for quite a while, treats application/json as utf-8 by default ( e.g., https://bugzilla.mozilla.org/show_bug.cgi?id=741776 ).

Some systems have even deprecated application/json; charset=utf-8 such as Spring Boot ( https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/http/MediaType.html#APPLICATION_JSON_UTF8_VALUE ).

Processing JSON as non-UTF8 by default makes no reasonable sense. Please make utf-8 as default. Thank you.

@0xNF
Copy link

0xNF commented Feb 18, 2022

I'd like to raise this issue again -- like @renatoathaydes noted, RFC7231 (circa 2014) supersedes 2616 (circa 1999) to make interpretation of application/json as anything except utf-8 an incorrect implementation of the specification.

I understand the suggested way to access json data from a response is to use the jsonDecode(utf8.decode(response.bodyBytes)) pattern, but this part of the implementation not only bites dart beginners who don't know that particular piece of lore, but is also flatly wrong from the perspective of modern RFC compliance.

@miDeb
Copy link

miDeb commented Aug 28, 2023

Hi, is there any status update on making utf8 the default for decoding json responses? It's not fun to discover that jsonDecode(response.body) is not standards compliant and should always have been jsonDecode(utf8.decode(response.bodyBytes)) everywhere in our application. Maybe the addition of a .json getter on http.Response that does the right thing could also be a possible improvement.

@0xNF
Copy link

0xNF commented Aug 28, 2023

You should be using response.bodyBytes instead of response.body, because the latter will try to decode into a string, which may cause exceptions that you aren't expecting.

@miDeb
Copy link

miDeb commented Aug 28, 2023

Thanks @0xNF for the correction, I mistyped (wouldn't have made sense to utf8.decode(response.body)), as that wouldn't even compile)

@daenney
Copy link

daenney commented Feb 2, 2024

RFC 8259, the current RFC reference for application/json in the IANA Media Type Registry, obsoletes 7159 and states in section 8.1 Character encoding

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [[RFC3629](https://www.rfc-editor.org/rfc/rfc3629)].

Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.

@renatoathaydes
Copy link

@daenney I mentioned this almost 4 years ago: #175 (comment)

I suspect Google would have too much work to do if this was changed, hence it will probably stay as it is even when it's clearly failing to follow the specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-enhancement A request for a change that isn't a bug
Projects
None yet
Development

No branches or pull requests