Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1 #1778

jrast · 2018-05-08T21:24:07Z

Applications, like flask, which expect that the headers are using latin-1 encoding as specified in rfc5987 will break gunicorn because gunicorn uses ASCII encoding.

Issue #1353 was closed with the comment

I'm going to close this as it's well documented around the web that HTTP headers should be ASCII, unfortunately.

which is, according to rfc5987 wrong. Headers should be encoded ISO-8859-1 (aka latin-1). However, a comment in the issue also mentioned commit 5f4ebd2, in which the encoding of the headers was switched back from latin-1 to ascii based on a section in rfc7230 which states that US-ASCII should be used:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. [...]

Now, I'm not into all this RFC specifications and don't know which is the right way to do it. However, as other applications expect latin-1 as specified in rfc5987 but gunicorn implements the version given in rfc7230, it might be worth to discuss this, even if it's just for a clarification on which track gunicorn is.

Related Issues in gunicorn:
#1353 : wsgi.py > send_headers: encoding problem.

Related Issues in Flask:
pallets/flask#2766 : send_file: latin-1 encoding not compatible with gunicorn
pallets/flask#2223 : Fix send_file's attachment_filename to work with non-ascii filenames

Related RFC:
https://tools.ietf.org/html/rfc5987#section-1

The text was updated successfully, but these errors were encountered:

benoitc · 2018-05-09T00:20:56Z

gunicorn implements the last specification (and it used to work with previous versions of flask).

If we want to relax it for old applications maybe we may want yo try to encode to latin1 if encoding to ascii didn't work. Thoughts ? cc @tilgovi

benoitc · 2018-05-09T00:23:11Z

to be more complete, gunicorn use the updated HTTP 1.1 spec (7230 and related). That could be indeed documented.

tilgovi · 2018-05-09T01:21:48Z

From my position, the spec says "SHOULD" and PEP 333 mentions latin-1 so I think Gunicorn could support latin-1 even if applications are advised not to use non-ASCII characters.

That's my initial reading, but I'll wait for others to weigh in.

tilgovi · 2018-05-09T01:25:21Z

I think Gunicorn could support latin-1

By this I mean only that Gunicorn would not be breaking any spec. Right now Gunicorn is more strict than it needs to be. It is not necessarily wrong to be so strict. We can decide together.

jrast · 2018-05-09T06:28:28Z

RFC 7320 says that new headers "SHOULD" use ASCII. That most headers use ASCII right now seems to be just information about the current situation. But I think this sentence is more important:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding.

So, latin-1 is clearly allowed.

If RFC7320 says that new headers "SHOULD" use ASCII, the question is, who defines the headers? Is it the application using gunicorn or is it gunicorn itself. As I understand, the application defines the headers.

benoitc · 2018-05-09T09:53:55Z

What annoyed me is that change has been done 3 years ago without much people beeing blocked by it on the different frameworks until that recent change in flask. I would have thought that in 2018 all applications would follow a specification published in 2014. Part of the issue is that PEP3333 allowing ISO-8859-1 aka latin 1.

Anyway I don't disagree to be more relaxed on it. I propose to do the following :

allow encoding to latin1
if latin 1 is spotted , log a warning to let the developer know he should instead encode the filenames using the RFC 2616 or use ascii to comply with the RFC 7320

Thoughts ?

davidism · 2018-05-09T14:00:49Z

@benoitc sorry for causing trouble! It's hard to keep track of all the different specs. 😓

If you show that warning, I predict a regular stream of issues being open on Flask about it though. I think it makes more sense to just allow Latin-1 silently, as it's part of the PEP.

benoitc · 2018-05-18T22:10:29Z

@davidism well that's kind of the point: bugging the user until it makes the correct change ;) I think that would be better but I have no strong opinion on it. @tilgovi @berkerpeksag ?

tilgovi · 2018-05-18T23:01:29Z

@davidism no worries :) Thanks for bringing it up.

I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.

berkerpeksag · 2018-05-19T09:18:58Z

I'm inclined towards being tolerant in this case, since it aligns with the PEP and with popular frameworks.

+1

benoitc · 2018-05-19T12:28:15Z

Until when the status quo would be kept though... yes users will open some bugs which prompted our change. I don’t see why it couldn’t be also fixed in other libraries and the pep updated.

At least we should document it somewhere so users using a strict client won’t be surprised as well. Also if we do the change it will have to be in a major version since it’s a breaking change.

tilgovi · 2018-05-23T01:30:15Z

I'm happy to make this change. Do we need a major version? It makes requirements looser for applications. Existing applications should continue to work. Applications which did not work before, but which violate the "SHOULD" will now work.

In any case, I can document it somehow.

tilgovi · 2018-05-23T01:31:16Z

Maybe 19.9? We can keep making 19.9.x or 19.x.y releases for Python 2 maintenance critical bug fixes for a bit, but still keep 20 for Python 3 only?

davidism · 2018-05-23T03:41:42Z

I've been thinking about this more, and I'm ok switching Flask and Werkzeug to use ASCII here.

I still think it's probably good if Gunicorn supports it, since it's part of the PEP. I'm assuming you've never had an issue reported about this before now so it seems like it's probably not well known anyway. If you're worried about breaking changes, I can handle it on my side only.

I do plan on implementing ASGI in Werkzeug, which leaves the header encoding to the application, so that would be a good time for me to fix inconsistent use of ASCII vs Latin-1 on my side.

jrast · 2018-05-23T21:02:37Z

I support the idea of @davidism: Flask should follow the recommendation and use ASCII, but gunicorn should also follow PEP3333 and allow Latin-1, even if other specifications do not recommend to do so.

davidism · 2018-05-24T14:32:34Z

I guess I should have done this sooner, but I just tried sending a unicode filename with Flask and Gunicorn and it worked fine: pallets/flask#2766 (comment). Now I'm not clear where the issue is.

benoitc · 2018-05-26T20:18:55Z

@tilgovi 19.9 would be ok. also #1151 that was closed by the latest change will need to be addressed.

Related to the last comment of @davidism , @jrast what is the filename that was failing? I think we should have a test to reproduce the issue before making any change.

jrast · 2018-05-28T06:41:10Z

I can't remeber exactly and I don't have a logfile at hand at the moment. But the problem arised with filenames containing german umlauts, like äöü. For example a filename like "Sterntour-Glärnischhütte.xlsx" would cause problems, caused by the "ä" and "ü" in the filename.
I can search for the logfile if you need more details.

davidism · 2018-05-28T12:58:06Z

Thanks, that filename caused the issue. The name used in the test in Flask triggered the filename* behavior, which ended up as ASCII. Your example encoded to Latin-1, so there was no extra encoding done and it was passed on as-is.

jrast · 2018-05-29T22:13:35Z

Just for the reference: @davidism fixed the encoding in flask with pull request pallets/flask#2804, which will be released in flask 1.0.3. This will make flask compatible with the current behaviour of gunicorn.

At this occasion, i want to thank @benoitc , @tilgovi and @davidism (and all other involved devs!) for their work on gunicorn and flask. I am positively surprised about the discussion which unfold in this issue! The issue was addressed very quick, the discussion was based on facts and a good solution for both packages was found very quickly! I'm allways surprised how quick issues are addressed in the open source world and that, most of the time, good solutions can be found for all involved parties. I think a fix would have taken much longer if both packages would have been closed source libraries, owned by some big coorperations. Thank you very much and keep up the great work!

georgexsh · 2018-07-12T08:44:50Z

flask/werkzeug will encode than decode cookie value to Unicode with latin1:
https://github.com/pallets/werkzeug/blob/master/werkzeug/http.py#L1107-L1108

If a unicode string is returned it’s tunneled through latin1 as required by PEP 3333.

benoitc · 2018-07-12T13:22:56Z

@georgexsh so what's your expectation?

georgexsh · 2018-07-13T05:47:09Z

@benoitc to point out flask/werkzeug is still had incompatible behavior with gunicorn. I will be happier if #1778 (comment) get implemented.

btw, your reply is very agile, I admire your diligence.

davidism · 2018-07-13T06:32:29Z

Given that it's never had a problem before between that code and Gunicorn, I'm not too concerned, although I may switch more things to ASCII eventually. Do you have a real failing example?

georgexsh · 2018-07-13T10:14:56Z

from werkzeug.wrappers import Request, Response

class MyResp(Response):
    charset = 'utf-8-sig'

@Request.application
def application(request):
    resp = MyResp()
    resp.set_cookie('foo', 'bar')
    return resp

run it with gunicorn yields an error:

Traceback (most recent call last):
  File "gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "gunicorn/workers/sync.py", line 183, in handle_request
    resp.close()
  File "gunicorn/http/wsgi.py", line 409, in close
    self.send_headers()
  File "gunicorn/http/wsgi.py", line 329, in send_headers
    util.write(self.sock, util.to_bytestring(header_str, "ascii"))
  File "gunicorn/util.py", line 507, in to_bytestring
    return value.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 155-157: ordinal not in range(128)

the error is from this header:

Set-Cookie: ï»¿foo="\357\273\277bar"; Path=ï»¿%EF%BB%BF/?%EF%BB%BF#%EF%BB%BF

when werkzeug dump cookies with utf-8-sig, BOM is added, it could be decoded with 8bit latin1 but can not reencode with as 7bit ASCII:

>>> ''.encode('utf-8-sig').decode('latin1')
'ï»¿'

>>> ''.encode('utf-8-sig').decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

there is more detail in apache/superset#5377

but I think this is a bug in the application (superset), it should not rely on flask/werkzeug to encode CSV string to bytes but encode by itself.

davidism · 2018-07-13T12:38:15Z

That's definitely an issue with that application.

javabrett · 2018-11-09T00:51:22Z

Some notes on the behaviour differences between Werkzeug(+Flask) and Gunicorn(+Flask):

Application attempts to set header value	Werkzeug/0.14.1+Flask	Gunicorn 19.9/master
`b'f\xc3\xb6o'.decode("utf-8", "strict") = "föo"`	200 OK, header value is sent, contains bytes `\x66\xf6\x6f = föo`	500, `UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 164: ordinal not in range(128)`
`b'Swan: \xF0\x9F\xA6\xA2'.decode("utf-8", "strict") = "Swan: (swan glyph)"`	200 but empty response, errors out while encoding headers - `UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f9a2' in position 19: ordinal not in range(256)`	500, `UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f9a2' in position 169: ordinal not in range(128)`

Note that when writing-out the response headers, Werkzeug delegates to Python builtin BaseHTTPRequestHandler:

https://github.com/pallets/werkzeug/blob/7477be2853df70a022d9613e765581b9411c3c39/werkzeug/serving.py#L235

... which uses latin-1:

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Lib/http/server.py#L516

That will explain the behaviour differences.

My reading of the considerable number and history of RFCs on this suggests that applications and servers should be restricting themselves to US-ASCII, but that (perhaps for historical reasons) latin-1 is allowed, but clients should treat such octets as opaque.

Either way, ideally Gunicorn and Werkzeug should behave the same here if possible, and it seems that a latin-1 character should not error the response. Gunicorn has a superior way of serializing the header values, so if there is a non-latin-1 character present, it detects this before committing 200 OK (like happens in Werkzeug), and can send 500. But it appears to me that it should allow latin-1.

javabrett · 2018-11-09T01:56:16Z

Another possibly interesting check in corelibs:

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Lib/wsgiref/validate.py#L118

header_re = re.compile(r'^[a-zA-Z][a-zA-Z0-9\-_]*$')
bad_header_value_re = re.compile(r'[\000-\037]')

So Python WSGI is very strict on characters allowed in the header name (alpha-num, hyphen, underscore), and bans non-printables from values.

This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed benoitc#1778. Signed-off-by: Brett Randall <javabrett@gmail.com>

This commit reverts one aspect changed by 5f4ebd2 (#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed #1778. Signed-off-by: Brett Randall <javabrett@gmail.com>

This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed benoitc#1778. Signed-off-by: Brett Randall <javabrett@gmail.com> (cherry picked from commit 879651b)

This commit reverts one aspect changed by 5f4ebd2 (#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed #1778. Signed-off-by: Brett Randall <javabrett@gmail.com> (cherry picked from commit 879651b)

jrast mentioned this issue May 24, 2018

send_file: latin-1 encoding not compatible with gunicorn pallets/flask#2766

Closed

georgexsh mentioned this issue Jul 12, 2018

CSV_EXPORT.encoding config result in gunicorn exception apache/superset#5377

Closed

javabrett mentioned this issue Nov 9, 2018

Encode header values using latin-1, not ascii #1914

Merged

javabrett mentioned this issue Nov 11, 2018

new bug: 500 code returned in case request with utf-8 symbols sent (see link below) postmanlabs/httpbin#446

Open

zhaoyongjie mentioned this issue Mar 12, 2019

fix: Gunicorn raise encoding error apache/superset#7010

Closed

berkerpeksag closed this as completed in #1914 Apr 18, 2019

crazyplum mentioned this issue Feb 18, 2020

CSV download fails if query tab has atypical characters apache/superset#9141

Closed

3 tasks

bojiang mentioned this issue Jul 3, 2020

[FIX] decode headers with latin1 bentoml/BentoML#864

Merged

18 tasks

carltongibson mentioned this issue Jul 13, 2020

CookieMiddleware should decode() using latin-1. django/channels#1450

Closed

euri10 mentioned this issue Jul 22, 2020

Fix issues surrounding X-Forwarded-For header in ProxyHeadersMIddleware encode/uvicorn#701

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1 #1778

Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1 #1778

jrast commented May 8, 2018

benoitc commented May 9, 2018

benoitc commented May 9, 2018

tilgovi commented May 9, 2018

tilgovi commented May 9, 2018

jrast commented May 9, 2018

benoitc commented May 9, 2018

davidism commented May 9, 2018

benoitc commented May 18, 2018

tilgovi commented May 18, 2018

berkerpeksag commented May 19, 2018

benoitc commented May 19, 2018 •

edited

Loading

tilgovi commented May 23, 2018

tilgovi commented May 23, 2018

davidism commented May 23, 2018 •

edited

Loading

jrast commented May 23, 2018

davidism commented May 24, 2018

benoitc commented May 26, 2018

jrast commented May 28, 2018

davidism commented May 28, 2018

jrast commented May 29, 2018

georgexsh commented Jul 12, 2018 •

edited

Loading

benoitc commented Jul 12, 2018 •

edited

Loading

georgexsh commented Jul 13, 2018

davidism commented Jul 13, 2018

georgexsh commented Jul 13, 2018 •

edited

Loading

davidism commented Jul 13, 2018

javabrett commented Nov 9, 2018

javabrett commented Nov 9, 2018

Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1 #1778

Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1 #1778

Comments

jrast commented May 8, 2018

benoitc commented May 9, 2018

benoitc commented May 9, 2018

tilgovi commented May 9, 2018

tilgovi commented May 9, 2018

jrast commented May 9, 2018

benoitc commented May 9, 2018

davidism commented May 9, 2018

benoitc commented May 18, 2018

tilgovi commented May 18, 2018

berkerpeksag commented May 19, 2018

benoitc commented May 19, 2018 • edited Loading

tilgovi commented May 23, 2018

tilgovi commented May 23, 2018

davidism commented May 23, 2018 • edited Loading

jrast commented May 23, 2018

davidism commented May 24, 2018

benoitc commented May 26, 2018

jrast commented May 28, 2018

davidism commented May 28, 2018

jrast commented May 29, 2018

georgexsh commented Jul 12, 2018 • edited Loading

benoitc commented Jul 12, 2018 • edited Loading

georgexsh commented Jul 13, 2018

davidism commented Jul 13, 2018

georgexsh commented Jul 13, 2018 • edited Loading

davidism commented Jul 13, 2018

javabrett commented Nov 9, 2018

javabrett commented Nov 9, 2018

benoitc commented May 19, 2018 •

edited

Loading

davidism commented May 23, 2018 •

edited

Loading

georgexsh commented Jul 12, 2018 •

edited

Loading

benoitc commented Jul 12, 2018 •

edited

Loading

georgexsh commented Jul 13, 2018 •

edited

Loading