Encode header values using latin-1, not ascii #1914

javabrett · 2018-11-09T01:58:28Z

This commit reverts one aspect changed by 5f4ebd2 (#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed #1778.

Per question in #791, is there a best way to automate the test?

tilgovi · 2018-11-18T22:20:57Z

Just to check: your investigation showed wsgiref being much more strict. There seemed to be some consensus that Gunicorn could support latin-1, but did we agree that it should? Or should we remain strict?

javabrett · 2018-11-19T00:49:56Z

wsgiref

Just to check: your investigation showed wsgiref being much more strict.

My main observation on wsgiref is per #1778:

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Lib/wsgiref/validate.py#L118

header_re = re.compile(r'^[a-zA-Z][a-zA-Z0-9\-_]*$')
bad_header_value_re = re.compile(r'[\000-\037]')

... and my recollection is that it is allowing non-ASCII, latin-1 characters in header-values. The header name indeed seems very strict per the regex above - no non-ASCII allowed there. The values regex is banning non-printables but that's all. We probably need to check the Gunicorn blocks the non-printables too.

PEP 3333

PEP 3333, which appears to be current, replacing PEP 333 (although it does refer to historical RFCs namely RFC 2616), says:

Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.

... and later, perhaps non-normative:

Do not be confused however: even if Python's str type is actually Unicode "under the hood", the content of native strings must still be translatable to bytes via the Latin-1 encoding! (See the section on Unicode Issues later in this document for more details.)

I can't see where PEP 3333 allows implementations to choose to be more-strict with regards to encoding, and therefore ban non-ASCII values, so I suppose that any server which does is not strictly PEP 3333? Maybe this is opinion rather than fact, and I don't intend it to be dramatic. But if I follow PEP 3333 in my application and go close to the rails and use (deprecated) latin-1 non-ASCII characters in my header values, this will fail on Gunicorn as things stand.

RFC 2616

Obsoleted RFC, replaced by multiple current RFCs, especially RFC 7230. Mentioned specifically here because it is mentioned in PEP 3333, perhaps due to timing or some error.

In terms of header-values, allows any octet other than unprintable control characters.

RFC 7230

RFC 7230 seeks to clarify header-value encoding and allowed characters:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

Nonetheless, despite being frowned-upon and deprecated, obs-text remains in the BNF production for the headers. Maybe it depends on how you interpret obsolescence in the RFCs. Recall that PEP 3333 does not yet refer to this newer RFC.

Summary

Although non-ASCII characters are now frowned-upon officially in current RFCs, they remain in the BNF productions.
There are newer, official ways of encoding unicode in-general, with various levels of adoption and support.
PEP 3333 remains bound to the obsoleted RFC 2616, and explicitly states latin-1 encoding for header values. There doesn't appear to be an official option to elect to support a smaller set of octets e.g. US-ASCII only and remain within the spec.

daghoidahl · 2018-12-18T16:32:00Z

OWASP Secure Coding Practices Checklist recommends:

Verify that header values in both requests and responses contain only ASCII characters

I have tried to search for actual exploits if one doesn't follow this practice, but failed so far. Is the OWASP recommendation to strict?

kngenie · 2019-01-30T22:58:49Z

I'd like to cast a vote for this change, as encoding with "ascii" is preventing us from running our application, Wayback Machine, on Python 3. Wayback Machine needs to "play back" archived HTTP responses that often contain non-ascii characters (not even latin-1) in their header. We cannot do it transparently under ASCII requirement. I know this is rather a marginal kind of application, but I hope it is worthwhile for Gunicorn to support.
Thank you so much for your work.

tilgovi · 2019-01-31T00:05:49Z

@kngenie are you going to need even more leniency than latin-1, or is it acceptable for you to strip or encode the non-latin-1 headers?

This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed benoitc#1778. Signed-off-by: Brett Randall <javabrett@gmail.com>

kngenie · 2019-04-17T20:22:11Z

Sorry for slow response - latin1 is sufficient. We don't even bother identifying the actual encoding of archived header values, and treating them as latin1 encoded string (equivalent of bytes, effectively). latin1 should be able to cover all cases. Thank you.

berkerpeksag · 2019-04-17T21:38:06Z

The values regex is banning non-printables but that's all.

This might be a bug in wsgiref caused by Python 3 migration (I don't remember what PEP 3333 says about this at the moment) In Python 2, bad_header_value_re was more strict because its default mode was ASCII-only. Due to changes in Python 3 (unicode -> str) the re.ASCII flag should be passed to emulate Python 2's ASCII-only mode.

javabrett · 2019-04-17T23:47:51Z

Just checking-in on whether there are any outstanding asks here, concerns etc.

berkerpeksag · 2019-04-18T01:23:30Z

Thank you!

benoitc · 2019-04-18T04:02:38Z

That's quite a breaking change agains the HTTP 1.1 spec and the last years. I would rather think it as an option that re-introduce latin for those who need it. like latin1_headers or something. Thoughts?

javabrett · 2019-04-18T04:30:59Z

@benoitc thanks for your comment.

The PR was made with reliance on RFC 7230, which whilst clearly deprecating non-ascii characters in header-values, retains the (deprecated) obs-text, production %x80-FF, in allowed header-values. The deprecation also specifically mentions new headers/values, perhaps suggesting that existing ones are somehow exempt-from the deprecation, or are to be treated less strictly? No definition of old/new either.

So perhaps it goes to the question of how should implementations deal with such deprecation in the spec. Since the spec allows the characters, I assume we have to support such header values without choking. The spec warns that values containing such characters should be treated as "opaque", but that is an application concern.

Maybe Gunicorn could log something for non-ascii values, but that is possibly an extra cost to go to for small return.

You might have a different reading of the HTTP 1.1 RFC.

As you suggest, an option also seems like a reasonable compromise.

tilgovi · 2019-04-28T01:07:55Z

I would not oppose an option, but I like having a tolerant default. Frameworks might take a stricter stance, but I think it's okay that a server, such as Gunicorn, be tolerant and support the deprecated characters, by default.

This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed benoitc#1778. Signed-off-by: Brett Randall <javabrett@gmail.com> (cherry picked from commit 879651b)

This commit reverts one aspect changed by 5f4ebd2 (#1151); header-values are again encoded as latin-1 and not ascii. Test is restored but uses a latin-1-mappable test-character, not a general utf8 character. Fixed #1778. Signed-off-by: Brett Randall <javabrett@gmail.com> (cherry picked from commit 879651b)

javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from 8903dc3 to d892185 Compare November 9, 2018 02:09

tilgovi self-requested a review November 29, 2018 05:05

javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from d892185 to fdc8423 Compare January 23, 2019 03:50

benoitc force-pushed the master branch from 5a43f72 to fe7632f Compare January 24, 2019 22:19

javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from fdc8423 to 5b3456c Compare January 31, 2019 00:02

tilgovi approved these changes Jan 31, 2019

View reviewed changes

javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from 5b3456c to 63c6861 Compare February 22, 2019 05:04

berkerpeksag approved these changes Apr 18, 2019

View reviewed changes

berkerpeksag merged commit 879651b into benoitc:master Apr 18, 2019

javabrett deleted the 1778-encode-headers-using-latin-1 branch April 18, 2019 01:34

twoyang0917 mentioned this pull request Mar 14, 2024

Encode header values using latin-1, not ascii pgjones/hypercorn#204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode header values using latin-1, not ascii #1914

Encode header values using latin-1, not ascii #1914

javabrett commented Nov 9, 2018

tilgovi commented Nov 18, 2018

javabrett commented Nov 19, 2018 •

edited

Loading

daghoidahl commented Dec 18, 2018

kngenie commented Jan 30, 2019

tilgovi commented Jan 31, 2019

kngenie commented Apr 17, 2019 •

edited

Loading

berkerpeksag commented Apr 17, 2019

javabrett commented Apr 17, 2019

berkerpeksag commented Apr 18, 2019

benoitc commented Apr 18, 2019

javabrett commented Apr 18, 2019 •

edited

Loading

tilgovi commented Apr 28, 2019

Encode header values using latin-1, not ascii #1914

Encode header values using latin-1, not ascii #1914

Conversation

javabrett commented Nov 9, 2018

tilgovi commented Nov 18, 2018

javabrett commented Nov 19, 2018 • edited Loading

wsgiref

PEP 3333

RFC 2616

RFC 7230

Summary

daghoidahl commented Dec 18, 2018

kngenie commented Jan 30, 2019

tilgovi commented Jan 31, 2019

kngenie commented Apr 17, 2019 • edited Loading

berkerpeksag commented Apr 17, 2019

javabrett commented Apr 17, 2019

berkerpeksag commented Apr 18, 2019

benoitc commented Apr 18, 2019

javabrett commented Apr 18, 2019 • edited Loading

tilgovi commented Apr 28, 2019

javabrett commented Nov 19, 2018 •

edited

Loading

kngenie commented Apr 17, 2019 •

edited

Loading

javabrett commented Apr 18, 2019 •

edited

Loading