Decode with utf8 by default for non-text (or all?) content types #175

cdvv7788 · 2018-07-13T20:19:00Z

I am requesting some information from the a server, which returns the following (using postman):

Headers:
Allow →GET, HEAD, OPTIONS
CF-RAY →439e4801cf3db955-MIA
Connection →keep-alive
Content-Encoding →gzip
Content-Type →application/json

Body:
{
"name": "SARA LUCIA OSSA PEÑA",
}

But, using http, I am getting the following (using a simple request get('url')):
{
"name": "SARA LUCIA OSSA PEÃ�A",
}

To fix this, I had to do something like:
UTF8.decode(response.bodyBytes)

This works as expected, and the information is retrieved fine. This, however, is a pain to setup (and inconsistent with post, where utf8 is used as default encoding).

Is there a better way to handle this? An argument to the get parameter to force encoding? shouldn't application/json assume utf8 by default?

I came up with the solution after reading https://pub.dartlang.org/documentation/http/latest/http/Response-class.html and the body property. Probably it is encoding the body with a wrong format.

Anyway, thanks for the hard work. Awesome library.

The text was updated successfully, but these errors were encountered:

zoechi · 2018-07-15T16:56:33Z

Good comments about this in
https://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean

I don't think it's the HTTP clients job to do the decoding. The header is no guarantee that the content will be of that type.
You can always build your own wrapper that does the UTF8-decoding for you so you don't have to repeat yourself.

cdvv7788 · 2018-07-16T13:09:13Z

@zoechi The client has to try to decode using a best effort approach. The http client is already decoding, but using the wrong encoding. There are 2 attributes in the response, body and bodyBytes.
In the same link you send, it is mentioned:

Designating the encoding is somewhat redundant for JSON, since the default (only?) encoding for JSON is UTF-8

Knowing that the content type is json should be enough for the client to interpret it, or at least to avoid intepreting it with the wrong encoding. This is not an edge case, json is probably the most popular format for api communication at the moment. You are right on the wrapper part, but if the headers are coherent, the client should be able to handle the body properly too.

ghost · 2018-08-24T10:36:00Z

So, the situation is a bit murky, so bear with me..

The first layer in any of this is HTTP.
HTTP, AFAICT, defines the default character set to be ISO-8859-1, which is why the Content-Type charset parameter exists as an explicit override.
In Content-Type all parameters (charset included) are optional, but specific types may define their own required parameters.

JSON, in turn, explicitly does not define the charset parameter.
JSON originally declared that it "shall be encoded in Unicode" with the default byte encoding being UTF-8.
But that's a rather vague declaration, which is perhaps why they later amended it to be a requirement that JSON be encoded in UTF-8.

Next, the Dart http API defines (as mentioned above) two ways of interacting with the response data:

body → String
The body of the response as a string. This is converted from bodyBytes using the charset parameter of the Content-Type header field, if available. If it's unavailable or if the encoding name is unknown, latin1 is used by default, as per RFC 2616.

bodyBytes → Uint8List
The bytes comprising the body of this response.

So as discussed above, HTTP in absence of a defined charset is assumed to be encoded in ISO-8859-1 (Latin-1). And body from its description is consistent with this behaviour.
If the server response sets the Content-Type header to application/json; charset=utf-8 the body should work as expected.

The problem of course is that there are servers out there that do not set charset for JSON (which is valid), but which is also a bit of a grey area in between the two specs:

JSON is always supposed to be UTF-8, and for that reason says you don't need to set charset, but ..
HTTP is always by default ISO-8859-1, unless the charset is explicitly set.

A "smart" HTTP client could choose to follow the JSON definition closer than the HTTP definition and simply say any application/json is by default UTF-8 - technically violating the HTTP standard.
However, the most robust solution is ultimately for the server to explicitly state the charset which is valid according to both standards.

As for this bug I'm inclined to say that http is working as intended, though the standards are a bit at odds with each other on this one.
@cdvv7788, if you are able to you could add charset to your Content-Type on the server which should fix your issue.
Alternatively if you're stuck with your server as-is, I recommend you try something like the httpserver example:

  HttpClientRequest request = await HttpClient().post(_host, 4049, path) /*1*/
    ..headers.contentType = ContentType.json /*2*/
    ..write(jsonEncode(jsonData)); /*3*/
  HttpClientResponse response = await request.close(); /*4*/
  await response.transform(utf8.decoder /*5*/).forEach(print);

Hope it helps.

I'll close this issue assuming all open questions are resolved.

cdvv7788 · 2018-08-24T13:09:35Z

Got it. Thanks for this.

tomchristie · 2018-08-27T09:36:28Z

So as discussed above, HTTP in absence of a defined charset is assumed to be encoded in ISO-8859-1

Note that applies to "text" media types. JSON is "application/json".

Correct clients should treat JSON (or any other non-text media type) responses as bytestrings, rather than text. (ie. use response.bodyBytes.)

ghost · 2018-08-27T10:44:35Z

@tomchristie, right. I also elaborated a bit on this in #186, but basically saying the same but with more words. :)

fabiocarneiro · 2018-08-28T02:09:02Z

I still believe this is a bad behavior and added comments to #186

…o avoid text garbling. The original processing is using `response.body` to deserialize as json. However, this is decoded by latin1 if the header contains only "application/json" instead of "application/json; charset=utf-8". Because of this behavior, if the response body is encoded UTF-8 but the headers doesn't contain charset, the body will garbling. cf: dart-lang/http#175 Since playframework 2.6 returns "Content-Type: application/json" without "charset=utf-8", I changed this parsing algolithm.

…o avoid text garbling. (#1700) * fix: force to decode as utf-8 when header contains application/json to avoid text garbling. The original processing is using `response.body` to deserialize as json. However, this is decoded by latin1 if the header contains only "application/json" instead of "application/json; charset=utf-8". Because of this behavior, if the response body is encoded UTF-8 but the headers doesn't contain charset, the body will garbling. cf: dart-lang/http#175 Since playframework 2.6 returns "Content-Type: application/json" without "charset=utf-8", I changed this parsing algolithm. * fix: force to decode as utf-8 when header contains application/json to avoid text garbling on error.

…o avoid text garbling. (OpenAPITools#1700) * fix: force to decode as utf-8 when header contains application/json to avoid text garbling. The original processing is using `response.body` to deserialize as json. However, this is decoded by latin1 if the header contains only "application/json" instead of "application/json; charset=utf-8". Because of this behavior, if the response body is encoded UTF-8 but the headers doesn't contain charset, the body will garbling. cf: dart-lang/http#175 Since playframework 2.6 returns "Content-Type: application/json" without "charset=utf-8", I changed this parsing algolithm. * fix: force to decode as utf-8 when header contains application/json to avoid text garbling on error.

By default, the Dart HTTP decoding will assume that the charset is ISO-8859-1 (latin-1). This causes emojis, certain apostrophe characters, etc to not function correctly. See here for extended discussion on why: dart-lang/http#175 Essentially, this occurs when the content-type header returns without a charset.

renatoathaydes · 2020-04-22T15:13:39Z

This error also happens when the content-type is text/html (even when the HTML content says it's encoded as utf-8), and image/svg+xml (which also declares utf-8 in the content), for example. The fact that HTTP establishes a default encoding that's not UTF-8 is a sign of its age: today, I doubt you could do better than use UTF-8 as default for any text you get online.

gsouf · 2020-04-26T16:17:26Z

According to the standard for json, you are not actually allowed to use latin1 for the encoding of the contents. JSON content must be encoded as unicode, be it UTF-8, UTF-16, or UTF-32 (big or little endian).
(https://stackoverflow.com/questions/9254891/what-does-content-type-application-json-charset-utf-8-really-mean)

I'm stuck with a server I don't have hands on and that does not return header specifying that the json content is using utf8. That is implicitely expected

renatoathaydes · 2020-04-26T17:46:30Z

@cskau-g the interpretation that HTTP uses ISO for text content is outdated and that requirement has been removed from the HTTP spec:

Appendix B of RFC-7231:

 The default charset of ISO-8859-1 for text media types has been
   removed; the default is now whatever the media type definition says.

Furthermore, the relevant part of the current spec does not mention at all a default charset to be applied to textual representations for any media-type:

https://tools.ietf.org/html/rfc7231#section-3.1.1.2

The JSON RFC, meanwhile, determines that the charset when used in conjunction with application/json should have no effect:

Note:  No "charset" parameter is defined for this registration.
      Adding one really has no effect on compliant recipients.

It has also been amended to make UTF-8 mandatory in the case of data transmitted over a network, which is the primary use-case for HTTP:

https://tools.ietf.org/html/rfc8259#appendix-A

 Section 8.1 was changed to require the use of UTF-8 when
      transmitted over a network.

Section 8.1:

JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].

Previous specifications of JSON have not required the use of UTF-8
   when transmitting JSON text.  However, the vast majority of JSON-
   based software implementations have chosen to use the UTF-8 encoding,
   to the extent that it is the only encoding that achieves
   interoperability.

Hopefully, this is enough to show that the currently most widely used data exchange format on the internet is not supported correctly by the Dart HTTP Server. Please consider changing this behavior as keeping it as it is is only going to hurt Dart's standing for no good reason.

natebosch · 2021-07-15T22:16:53Z

Reopening to track - I do think we should consider changing the defaults since most users are likely to benefit.

Note that the expected pattern to use today when you know the result is json is jsonDecode(utf8.decode(response.bodyBytes)). We should consider changing it so that jsonDecode(response.body) works as well.

natebosch · 2021-07-20T20:26:47Z

Changing the default for all responses, or even for non-text responses, is breaking. At least one internal usage is impacted.

Changing the default only when the content type is application/json is more narrow and may be safer.

fabiocarneiro · 2021-07-20T20:34:20Z

As stated before in #186, this behavior is wrong and should be corrected. It doesn't matter if it breaks bc or not. Release a new major if that is necessary.

In 2018 we were talking about this with a lot of effort on explaining HTTP and it was just ignored. If it was taken into consideration at the time, everything would have been adopted today. How many years more do we need to wait?

crimsonvspurple · 2022-01-10T13:27:16Z

I believe both FF/Chrome, for quite a while, treats application/json as utf-8 by default ( e.g., https://bugzilla.mozilla.org/show_bug.cgi?id=741776 ).

Some systems have even deprecated application/json; charset=utf-8 such as Spring Boot ( https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/http/MediaType.html#APPLICATION_JSON_UTF8_VALUE ).

Processing JSON as non-UTF8 by default makes no reasonable sense. Please make utf-8 as default. Thank you.

0xNF · 2022-02-18T10:55:43Z

I'd like to raise this issue again -- like @renatoathaydes noted, RFC7231 (circa 2014) supersedes 2616 (circa 1999) to make interpretation of application/json as anything except utf-8 an incorrect implementation of the specification.

I understand the suggested way to access json data from a response is to use the jsonDecode(utf8.decode(response.bodyBytes)) pattern, but this part of the implementation not only bites dart beginners who don't know that particular piece of lore, but is also flatly wrong from the perspective of modern RFC compliance.

miDeb · 2023-08-28T07:15:55Z

Hi, is there any status update on making utf8 the default for decoding json responses? It's not fun to discover that jsonDecode(response.body) is not standards compliant and should always have been jsonDecode(utf8.decode(response.bodyBytes)) everywhere in our application. Maybe the addition of a .json getter on http.Response that does the right thing could also be a possible improvement.

0xNF · 2023-08-28T07:20:05Z

You should be using response.bodyBytes instead of response.body, because the latter will try to decode into a string, which may cause exceptions that you aren't expecting.

miDeb · 2023-08-28T07:21:43Z

Thanks @0xNF for the correction, I mistyped (wouldn't have made sense to utf8.decode(response.body)), as that wouldn't even compile)

daenney · 2024-02-02T12:05:47Z

RFC 8259, the current RFC reference for application/json in the IANA Media Type Registry, obsoletes 7159 and states in section 8.1 Character encoding

JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [[RFC3629](https://www.rfc-editor.org/rfc/rfc3629)].

Previous specifications of JSON have not required the use of UTF-8
when transmitting JSON text. However, the vast majority of JSON-
based software implementations have chosen to use the UTF-8 encoding,
to the extent that it is the only encoding that achieves
interoperability.

renatoathaydes · 2024-02-07T13:15:23Z

@daenney I mentioned this almost 4 years ago: #175 (comment)

I suspect Google would have too much work to do if this was changed, hence it will probably stay as it is even when it's clearly failing to follow the specs.

cdvv7788 changed the title ~~GET utf8 json~~ Response application/json utf8 not decoding correctly Jul 13, 2018

ghost closed this as completed Aug 24, 2018

ghost added the closed-as-intended Closed as the reported issue is expected behavior label Aug 24, 2018

ghost mentioned this issue Aug 24, 2018

Misinterpretation of rfc2616 in response.dart #186

Closed

cdvv7788 mentioned this issue Aug 27, 2018

JSONRenderer charset encode/django-rest-framework#2891

Closed

ifndefdeadmau5 mentioned this issue Dec 12, 2018

Expose bodyBytes from http.response in client.dart zino-hofmann/graphql-flutter#141

Closed

cimadai mentioned this issue Dec 18, 2018

Bug fix: force to decode as utf-8 when header contains application/json to avoid text garbling. OpenAPITools/openapi-generator#1700

Merged

4 tasks

MaikuB mentioned this issue May 30, 2019

fix decoding of response from HNPWA API so it is read as UTF-8 brianegan/hnpwa_client#1

Merged

MarcoSavaglia mentioned this issue Oct 28, 2019

Update API json decoding to use utf8 by default happy-co/protoc-gen-twirp_dart#2

Merged

luiz-simples mentioned this issue Mar 5, 2020

Response application/json utf8 not decoding correctly furaiev/amazon-cognito-identity-dart-2#29

Merged

sanekyy mentioned this issue May 27, 2020

UTF-8 as default charset for response body decoding f3ath/json-api-dart#94

Closed

damianham mentioned this issue Sep 8, 2020

Explicitly declare the UTF-8 charset for json responses amberframework/amber#1231

Merged

fearhq mentioned this issue Nov 19, 2020

Encoding issues while generating schema from endpoint JetBrains/js-graphql-intellij-plugin#414

Open

FabulousGee mentioned this issue Feb 21, 2021

Wrong/missing charset on JSON response atk4/ui#1607

Closed

se-bastiaan mentioned this issue Jul 11, 2021

Charset content-type parameter in API responses svthalia/concrexit#1808

Closed

This was referenced Jul 13, 2021

Need to override application/json default content decoder to UTF-8 #494

Closed

Default encoding for Content-Type application/json should be UTF-8 #367

Closed

natebosch changed the title ~~Response application/json utf8 not decoding correctly~~ Decode with utf8 by default for non-text (or all?) content types Jul 15, 2021

natebosch reopened this Jul 15, 2021

natebosch added type-enhancement A request for a change that isn't a bug and removed closed-as-intended Closed as the reported issue is expected behavior labels Jul 16, 2021

natebosch mentioned this issue Jul 16, 2021

Json Content-Type and explicity encoding declaration #455

Closed

natebosch mentioned this issue Aug 5, 2021

add default charset of application/json to utf8 dart-lang/http_parser#51

Closed

natebosch mentioned this issue Apr 13, 2022

is there a bug related to "Content-Type": "application/x-www-form-urlencoded"? #686

Closed

daniel-jones-deepl mentioned this issue May 5, 2022

encoding DeepLcom/deepl-python#21

Closed

tsmethurst mentioned this issue Feb 2, 2024

[feature] Include charset in Content-Type header superseriousbusiness/gotosocial#2598

Closed

KRTirtho mentioned this issue Jun 1, 2024

Some text is garbled KRTirtho/spotube#1463

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode with utf8 by default for non-text (or all?) content types #175

Decode with utf8 by default for non-text (or all?) content types #175

cdvv7788 commented Jul 13, 2018

zoechi commented Jul 15, 2018

cdvv7788 commented Jul 16, 2018

ghost commented Aug 24, 2018

cdvv7788 commented Aug 24, 2018

tomchristie commented Aug 27, 2018

ghost commented Aug 27, 2018

fabiocarneiro commented Aug 28, 2018

renatoathaydes commented Apr 22, 2020

gsouf commented Apr 26, 2020 •

edited

Loading

renatoathaydes commented Apr 26, 2020

natebosch commented Jul 15, 2021

natebosch commented Jul 20, 2021

fabiocarneiro commented Jul 20, 2021

crimsonvspurple commented Jan 10, 2022

0xNF commented Feb 18, 2022

miDeb commented Aug 28, 2023 •

edited

Loading

0xNF commented Aug 28, 2023

miDeb commented Aug 28, 2023

daenney commented Feb 2, 2024

renatoathaydes commented Feb 7, 2024

Decode with utf8 by default for non-text (or all?) content types #175

Decode with utf8 by default for non-text (or all?) content types #175

Comments

cdvv7788 commented Jul 13, 2018

zoechi commented Jul 15, 2018

cdvv7788 commented Jul 16, 2018

ghost commented Aug 24, 2018

cdvv7788 commented Aug 24, 2018

tomchristie commented Aug 27, 2018

ghost commented Aug 27, 2018

fabiocarneiro commented Aug 28, 2018

renatoathaydes commented Apr 22, 2020

gsouf commented Apr 26, 2020 • edited Loading

renatoathaydes commented Apr 26, 2020

natebosch commented Jul 15, 2021

natebosch commented Jul 20, 2021

fabiocarneiro commented Jul 20, 2021

crimsonvspurple commented Jan 10, 2022

0xNF commented Feb 18, 2022

miDeb commented Aug 28, 2023 • edited Loading

0xNF commented Aug 28, 2023

miDeb commented Aug 28, 2023

daenney commented Feb 2, 2024

renatoathaydes commented Feb 7, 2024

gsouf commented Apr 26, 2020 •

edited

Loading

miDeb commented Aug 28, 2023 •

edited

Loading