Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode header values using latin-1, not ascii #1914

Merged

Conversation

javabrett
Copy link
Collaborator

This commit reverts one aspect changed by 5f4ebd2 (#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed #1778.

Per question in #791, is there a best way to automate the test?

@javabrett javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from 8903dc3 to d892185 Compare November 9, 2018 02:09
@tilgovi
Copy link
Collaborator

tilgovi commented Nov 18, 2018

Just to check: your investigation showed wsgiref being much more strict. There seemed to be some consensus that Gunicorn could support latin-1, but did we agree that it should? Or should we remain strict?

@javabrett
Copy link
Collaborator Author

javabrett commented Nov 19, 2018

wsgiref

Just to check: your investigation showed wsgiref being much more strict.

My main observation on wsgiref is per #1778:

https://github.com/python/cpython/blob/fd512d76456b65c529a5bc58d8cfe73e4a10de7a/Lib/wsgiref/validate.py#L118

header_re = re.compile(r'^[a-zA-Z][a-zA-Z0-9\-_]*$')
bad_header_value_re = re.compile(r'[\000-\037]')

... and my recollection is that it is allowing non-ASCII, latin-1 characters in header-values. The header name indeed seems very strict per the regex above - no non-ASCII allowed there. The values regex is banning non-printables but that's all. We probably need to check the Gunicorn blocks the non-printables too.

PEP 3333

PEP 3333, which appears to be current, replacing PEP 333 (although it does refer to historical RFCs namely RFC 2616), says:

Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.

... and later, perhaps non-normative:

Do not be confused however: even if Python's str type is actually Unicode "under the hood", the content of native strings must still be translatable to bytes via the Latin-1 encoding! (See the section on Unicode Issues later in this document for more details.)

I can't see where PEP 3333 allows implementations to choose to be more-strict with regards to encoding, and therefore ban non-ASCII values, so I suppose that any server which does is not strictly PEP 3333? Maybe this is opinion rather than fact, and I don't intend it to be dramatic. But if I follow PEP 3333 in my application and go close to the rails and use (deprecated) latin-1 non-ASCII characters in my header values, this will fail on Gunicorn as things stand.

RFC 2616

Obsoleted RFC, replaced by multiple current RFCs, especially RFC 7230. Mentioned specifically here because it is mentioned in PEP 3333, perhaps due to timing or some error.

In terms of header-values, allows any octet other than unprintable control characters.

RFC 7230

RFC 7230 seeks to clarify header-value encoding and allowed characters:

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

Nonetheless, despite being frowned-upon and deprecated, obs-text remains in the BNF production for the headers. Maybe it depends on how you interpret obsolescence in the RFCs. Recall that PEP 3333 does not yet refer to this newer RFC.

Summary

  • Although non-ASCII characters are now frowned-upon officially in current RFCs, they remain in the BNF productions.
  • There are newer, official ways of encoding unicode in-general, with various levels of adoption and support.
  • PEP 3333 remains bound to the obsoleted RFC 2616, and explicitly states latin-1 encoding for header values. There doesn't appear to be an official option to elect to support a smaller set of octets e.g. US-ASCII only and remain within the spec.

@tilgovi tilgovi self-requested a review November 29, 2018 05:05
@daghoidahl
Copy link

OWASP Secure Coding Practices Checklist recommends:

  1. Verify that header values in both requests and responses contain only ASCII characters

I have tried to search for actual exploits if one doesn't follow this practice, but failed so far. Is the OWASP recommendation to strict?

@kngenie
Copy link

kngenie commented Jan 30, 2019

I'd like to cast a vote for this change, as encoding with "ascii" is preventing us from running our application, Wayback Machine, on Python 3. Wayback Machine needs to "play back" archived HTTP responses that often contain non-ascii characters (not even latin-1) in their header. We cannot do it transparently under ASCII requirement. I know this is rather a marginal kind of application, but I hope it is worthwhile for Gunicorn to support.
Thank you so much for your work.

@javabrett javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from fdc8423 to 5b3456c Compare January 31, 2019 00:02
@tilgovi
Copy link
Collaborator

tilgovi commented Jan 31, 2019

@kngenie are you going to need even more leniency than latin-1, or is it acceptable for you to strip or encode the non-latin-1 headers?

This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <javabrett@gmail.com>
@javabrett javabrett force-pushed the 1778-encode-headers-using-latin-1 branch from 5b3456c to 63c6861 Compare February 22, 2019 05:04
@kngenie
Copy link

kngenie commented Apr 17, 2019

Sorry for slow response - latin1 is sufficient. We don't even bother identifying the actual encoding of archived header values, and treating them as latin1 encoded string (equivalent of bytes, effectively). latin1 should be able to cover all cases. Thank you.

@berkerpeksag
Copy link
Collaborator

The values regex is banning non-printables but that's all.

This might be a bug in wsgiref caused by Python 3 migration (I don't remember what PEP 3333 says about this at the moment) In Python 2, bad_header_value_re was more strict because its default mode was ASCII-only. Due to changes in Python 3 (unicode -> str) the re.ASCII flag should be passed to emulate Python 2's ASCII-only mode.

@javabrett
Copy link
Collaborator Author

Just checking-in on whether there are any outstanding asks here, concerns etc.

@berkerpeksag berkerpeksag merged commit 879651b into benoitc:master Apr 18, 2019
@berkerpeksag
Copy link
Collaborator

Thank you!

@javabrett javabrett deleted the 1778-encode-headers-using-latin-1 branch April 18, 2019 01:34
@benoitc
Copy link
Owner

benoitc commented Apr 18, 2019

That's quite a breaking change agains the HTTP 1.1 spec and the last years. I would rather think it as an option that re-introduce latin for those who need it. like latin1_headers or something. Thoughts?

@javabrett
Copy link
Collaborator Author

javabrett commented Apr 18, 2019

@benoitc thanks for your comment.

The PR was made with reliance on RFC 7230, which whilst clearly deprecating non-ascii characters in header-values, retains the (deprecated) obs-text, production %x80-FF, in allowed header-values. The deprecation also specifically mentions new headers/values, perhaps suggesting that existing ones are somehow exempt-from the deprecation, or are to be treated less strictly? No definition of old/new either.

So perhaps it goes to the question of how should implementations deal with such deprecation in the spec. Since the spec allows the characters, I assume we have to support such header values without choking. The spec warns that values containing such characters should be treated as "opaque", but that is an application concern.

Maybe Gunicorn could log something for non-ascii values, but that is possibly an extra cost to go to for small return.

You might have a different reading of the HTTP 1.1 RFC.

As you suggest, an option also seems like a reasonable compromise.

@tilgovi
Copy link
Collaborator

tilgovi commented Apr 28, 2019

I would not oppose an option, but I like having a tolerant default. Frameworks might take a stricter stance, but I think it's okay that a server, such as Gunicorn, be tolerant and support the deprecated characters, by default.

di pushed a commit to di/gunicorn that referenced this pull request Sep 21, 2019
This commit reverts one aspect changed by 5f4ebd2 (benoitc#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed benoitc#1778.

Signed-off-by: Brett Randall <javabrett@gmail.com>
(cherry picked from commit 879651b)
tilgovi pushed a commit that referenced this pull request Oct 13, 2019
This commit reverts one aspect changed by 5f4ebd2 (#1151);
header-values are again encoded as latin-1 and not ascii. Test is restored but uses
a latin-1-mappable test-character, not a general utf8 character.

Fixed #1778.

Signed-off-by: Brett Randall <javabrett@gmail.com>
(cherry picked from commit 879651b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding headers with ASCII breaks applications expecting them to be encoded in latin-1
6 participants