Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic URL percent decoding can break S3 downloads #4307

Closed
JustAnotherArchivist opened this issue Oct 30, 2019 · 12 comments
Closed

Automatic URL percent decoding can break S3 downloads #4307

JustAnotherArchivist opened this issue Oct 30, 2019 · 12 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

Long story (not so) short

aiohttp (or rather yarl?) automatically decodes certain percent encodings. While that is legal and in fact recommended per RFC 7230 (§2.7.3), it may break retrieval in certain cases. The specific issue I was having involved AWS S3 URLs with signatures. Unfortunately, the real-world example of this is no longer available as the website in question (picosong) shut down a few days ago. However, this does not actually rely on S3, and I have included an alternative example server with similar behaviour below.

The site in question redirected to pre-signed S3 URLs. These look like https://bucket.s3.amazonaws.com/path/filename?AWSAccessKeyID=X&Signature=Y&Expires=Z, and the parameters are used for authentication of the request. The key thing for this issue is that the signature calculation includes the filename, and specifically, it includes the filename exactly in a particular encoding. For example, if the signature is generated using the filename foo.txt, requesting it as foo%2Etxt causes a signature mismatch, even though the URLs are equivalent per RFC 7230. This is a quirk of S3, but unfortunately a perfectly legal quirk as far as I can tell since URL normalisation is optional.

If such a redirect now points to a file named foo%27bar, then the next request is sent by aiohttp using foo'bar instead because the percent encoding of the single quote is decoded automatically. This means that it is impossible to download such files with aiohttp without a workaround (see below). This is actually not a problem with redirects, as shown in the StR below. However, in the case of redirects, it's more complicated to work around the problem.

Steps to reproduce

Have a web server that accepts a request for /%61 (e.g. HTTP 200) but not for /a (e.g. HTTP 403). In fact, any character from the set a-zA-Z0-9-._~!$'()*,+&=;@: would do from what I've seen.
(Yes, those two URLs are equivalent per RFC, and normal HTTP servers would handle them equivalently. If this bothers you, please replace it with /%61?filenamehash=<MD5("/%61")> or similar, where the server checks if the hash matches the path; this is a simplified version of the real-world S3 scenario described above.)
Here's a very simple example server, intentionally independent of aiohttp:

import asyncio


class HttpServerProtocol(asyncio.Protocol):
	def connection_made(self, transport):
		self.transport = transport

	def data_received(self, data):
		if data.startswith(b'GET /%61 HTTP/1.1\r\n'):
			self.transport.write(b'\r\n'.join([
				b'HTTP/1.1 200 OK',
				b'Content-Type: text/plain',
				b'Content-Length: 0',
				b'',
				b'',
			  ]))
		else:
			self.transport.write(b'\r\n'.join([
				b'HTTP/1.1 403 Forbidden',
				b'Content-Type: text/plain',
				b'Content-Length: 0',
				b'',
				b'',
			  ]))


loop = asyncio.get_event_loop()
coro = loop.create_server(HttpServerProtocol, '127.0.0.1', 8080)
server = loop.run_until_complete(coro)
print('Serving on {}'.format(server.sockets[0].getsockname()))
try:
	loop.run_forever()
except KeyboardInterrupt:
	pass
server.close()
loop.run_until_complete(server.wait_closed())
loop.close()

Now try to request /%61 with aiohttp:

import aiohttp
import asyncio


async def fetch():
	async with aiohttp.ClientSession() as session:
		async with session.get('http://127.0.0.1:8080/%61') as response:
			print(response.status)


loop = asyncio.get_event_loop()
loop.run_until_complete(fetch())
loop.close()

Expected behaviour

200

Actual behaviour

403 because the server receives a GET /a HTTP/1.1 request instead of GET /%61 HTTP/1.1

Workaround

The workaround is to first convert the URL to a yarl.URL with the encoded flag set to True:

		async with session.get(yarl.URL('http://127.0.0.1:8080/%61', encoded = True)) as response:

Of course, this requires that the URL is indeed encoded correctly, so this workaround will not be usable in the general case.

In the case of redirects, as in the example described above, this gets more complicated as you have to implement your own logic for following redirects: fetch with allow_redirects = False, check the response status code, get the Location header, throw it into yarl.URL with the encoded flag, repeat.

Closing thoughts

This is really an issue on the interface between aiohttp and yarl, not a pure aiohttp or pure yarl problem. I decided to file it against aiohttp since that is where the problem becomes visible to users.

I have no idea what the proper fix for this issue is. Presumably, there should be a way to disable URL normalisation, which would leave those percent-encoded characters as is; I assume that would be a new kwarg to yarl.URL. Perhaps that should even be the default since it would match the behaviour of browsers (at least Firefox) as far as I can tell; but if so, I guess this would reside in aiohttp, and yarl itself should still default to normalisation.

Your environment

Python 3.6.8 on Debian
aiohttp: I primarily experienced and investigated this issue with version 2.3.10, but it is still present in the current 3.6.2. (I have not tested the 4.x branch.)
yarl: 1.3.0

@asvetlov
Copy link
Member

asvetlov commented Nov 1, 2019

Did you try yarl.URL(url_str, encoded=True)?

@JustAnotherArchivist
Copy link
Contributor Author

Yes, see "Workaround" above. But that can only be used if you're sure that the URL is indeed encoded correctly, and you can't rely on that in the general case.

@asvetlov
Copy link
Member

asvetlov commented Nov 1, 2019

Do you ask for optional disabling of URL normalization?
Like URL(str, normalize=False)?
It can percent-encode all forbidden characters but keep untouched all valid symbols, even if they are not normalized.

I think the normalization is pretty important, it should be enabled by default.
If people want to skip this step -- it requires an explicit action like normalize argument usage.

@JustAnotherArchivist
Copy link
Contributor Author

Yes, disabling normalisation is what I was thinking. And it should be possible to do this through aiohttp as well so that you don't have to manually handle redirects.

I agree that normalisation is important, but why does it have to be enabled by default? Do you have any examples of things that break when it isn't used?

@asvetlov
Copy link
Member

asvetlov commented Nov 3, 2019

In an ideal world, people always use only normalized URLs.
Realistically, they are lazy humans, very often not networking experts.
The library should do the best to help them.
Normalization may hurt only if the HTTP message sign is needed, and the sign contains HTTP headers. For this case better to provide an API like prepared request which contains all headers to be sent plus normalized URL.
The signing should be done at this step.

@JustAnotherArchivist
Copy link
Contributor Author

It's worth mentioning that before RFC 7230 was released in 2014, normalisation (at least in the context of HTTP) had not been standardised at all. So older web servers can be a source of legitimately non-normalised URLs. And of course not all maintained servers have been "fixed" since the release of RFC 7230.

By the way, at least Firefox does not normalise URLs before sending the request, even though it displays the normalised URL in the URL bar (and also in the dev tools until you look at the request details). I have not tested any other browsers, but that is another indication that it's probably a good idea to disable normalisation. Users should be able to expect that a request that works in a browser also works in aiohttp.

I agree that the signing in the example is broken and should use the normalised URL, but in the case mentioned in the original report, that URL comes from a remote server, and I can't influence that process in any way.

@asvetlov
Copy link
Member

asvetlov commented Nov 7, 2019

We can disable normalization on redirect handling.

@JustAnotherArchivist
Copy link
Contributor Author

Yes, that seems like a reasonable solution.

@asvetlov
Copy link
Member

Fixed by yarl 1.6 I hope

@derlih
Copy link
Contributor

derlih commented Dec 23, 2020

@JustAnotherArchivist I'm closing this issue
If you still have this problem, this issue can be reopened.

@x0day
Copy link

x0day commented Aug 17, 2021

some way to disable this? because this bug still happens in some s3 download service when download url is in header's location, and this bug is not exists in requests or httpx.

@x0day
Copy link

x0day commented Aug 17, 2021

some way to disable this? because this bug still happens in some s3 download service when download url is in header's location, and this bug is not exists in requests or httpx.

solved by requote_redirect_url parameter in ClientSession

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants