New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad Request when sending utf-8 encoded http path under python3 #1577

Closed
rslinckx opened this Issue Aug 24, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@rslinckx
Contributor

rslinckx commented Aug 24, 2017

This is the sent data:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('127.0.0.1', 8080))
print(s.send('GET /à%20k HTTP/1.0\r\n\r\n'.encode('utf-8')))
print(s.recv(100000).decode('ascii'))

This is the response

HTTP/1.1 400 Bad Request
Connection: close
Content-Type: text/html
Content-Length: 181

<html>
  <head>
    <title>Bad Request</title>
  </head>
  <body>
    <h1><p>Bad Request</p></h1>
    Invalid HTTP Version 'Invalid HTTP Version: '%20k HTTP/1.0''
  </body>
</html>

This is because the request line is first decoded as latin1 using the _compat:bytes_to_str this causes the "à" to be returned as "\xc3\xa0", then the request line is split using line.split(None, 2) which will consider the \xa0 (non breaking space) as whitespace and strip it, thus rendering the request line invalid.

A first attempt would be to use line.split(' ', 2) but then the split will no longer eat up all consecutive whitespaces and may introduce other bugs.

I'm not sure what would be the best solution here.

@rslinckx

This comment has been minimized.

Contributor

rslinckx commented Aug 24, 2017

Note that python2 is unaffected because bytes_to_str is a no-op in this case:

python2:

>>> 'GET /\xc3\xa0%20 HTTP'.split(None, 2)
['GET', '/\xc3\xa0%20', 'HTTP']

python3:

>>> b'GET /\xc3\xa0%20 HTTP'.decode('latin1').split(None, 2)
['GET', '/Ã', '%20 HTTP']
@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Aug 24, 2017

Could we split before decoding? [part.decode('latin1') for part in b'GET /\xc3\xa0%20 HTTP'.split(None, 2)]?

@rslinckx

This comment has been minimized.

Contributor

rslinckx commented Aug 24, 2017

That would at least preserve the python2 behavior, i don't know if there's an encoding that might break if split on space, i guess not.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Aug 24, 2017

I would think not or else the HTTP request line would be malformed, since the encoding there is pretty strict.

@tilgovi

This comment has been minimized.

Collaborator

tilgovi commented Aug 24, 2017

Thanks for raising this issue. Would you be willing to make a PR for it?

@benoitc

This comment has been minimized.

Owner

benoitc commented Aug 24, 2017

mmm shouldn't the start line be encoded in us-ascii though ?

http://httpwg.org/specs/rfc7230.html#rfc.section.3

@rslinckx

This comment has been minimized.

Contributor

rslinckx commented Aug 25, 2017

This section describe exactly the problem gunicorn is having in this case:

A recipient MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII [USASCII]. Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A). String-based parsers can only be safely used within protocol elements after the element has been extracted from the message, such as within a header field-value after message parsing has delineated the individual fields.

In any case the parsing of the request line should be done on bytes.

It's debatable if the request line should ascii-only. The spec is a bit unclear. However there are clients in the wild that just send those request lines and I'd rather handle them in the application which can do further processing to handle them properly.

@rslinckx

This comment has been minimized.

Contributor

rslinckx commented Aug 25, 2017

I made a pull request with a test case which failed under py3 and now passes.

I'm not sure about bytes_to_str() when raising InvalidRequestLine. I fear that having byte strings as message is not expected hence the conversion.

@benoitc benoitc closed this in 15e901a Sep 2, 2017

benoitc added a commit that referenced this issue Sep 2, 2017

Merge pull request #1578 from rslinckx/master
http/message: Split request line as bytes to avoid splitting on 0x0A. Fixes #1577

fofanov pushed a commit to fofanov/gunicorn that referenced this issue Mar 16, 2018

fofanov pushed a commit to fofanov/gunicorn that referenced this issue Mar 16, 2018

Merge pull request benoitc#1578 from rslinckx/master
http/message: Split request line as bytes to avoid splitting on 0x0A. Fixes benoitc#1577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment