Automatic URL percent decoding can break S3 downloads #4307

JustAnotherArchivist · 2019-10-30T01:17:08Z

Long story (not so) short

aiohttp (or rather yarl?) automatically decodes certain percent encodings. While that is legal and in fact recommended per RFC 7230 (§2.7.3), it may break retrieval in certain cases. The specific issue I was having involved AWS S3 URLs with signatures. Unfortunately, the real-world example of this is no longer available as the website in question (picosong) shut down a few days ago. However, this does not actually rely on S3, and I have included an alternative example server with similar behaviour below.

The site in question redirected to pre-signed S3 URLs. These look like https://bucket.s3.amazonaws.com/path/filename?AWSAccessKeyID=X&Signature=Y&Expires=Z, and the parameters are used for authentication of the request. The key thing for this issue is that the signature calculation includes the filename, and specifically, it includes the filename exactly in a particular encoding. For example, if the signature is generated using the filename foo.txt, requesting it as foo%2Etxt causes a signature mismatch, even though the URLs are equivalent per RFC 7230. This is a quirk of S3, but unfortunately a perfectly legal quirk as far as I can tell since URL normalisation is optional.

If such a redirect now points to a file named foo%27bar, then the next request is sent by aiohttp using foo'bar instead because the percent encoding of the single quote is decoded automatically. This means that it is impossible to download such files with aiohttp without a workaround (see below). This is actually not a problem with redirects, as shown in the StR below. However, in the case of redirects, it's more complicated to work around the problem.

Steps to reproduce

Have a web server that accepts a request for /%61 (e.g. HTTP 200) but not for /a (e.g. HTTP 403). In fact, any character from the set a-zA-Z0-9-._~!$'()*,+&=;@: would do from what I've seen.
(Yes, those two URLs are equivalent per RFC, and normal HTTP servers would handle them equivalently. If this bothers you, please replace it with /%61?filenamehash=<MD5("/%61")> or similar, where the server checks if the hash matches the path; this is a simplified version of the real-world S3 scenario described above.)
Here's a very simple example server, intentionally independent of aiohttp:

import asyncio


class HttpServerProtocol(asyncio.Protocol):
	def connection_made(self, transport):
		self.transport = transport

	def data_received(self, data):
		if data.startswith(b'GET /%61 HTTP/1.1\r\n'):
			self.transport.write(b'\r\n'.join([
				b'HTTP/1.1 200 OK',
				b'Content-Type: text/plain',
				b'Content-Length: 0',
				b'',
				b'',
			  ]))
		else:
			self.transport.write(b'\r\n'.join([
				b'HTTP/1.1 403 Forbidden',
				b'Content-Type: text/plain',
				b'Content-Length: 0',
				b'',
				b'',
			  ]))


loop = asyncio.get_event_loop()
coro = loop.create_server(HttpServerProtocol, '127.0.0.1', 8080)
server = loop.run_until_complete(coro)
print('Serving on {}'.format(server.sockets[0].getsockname()))
try:
	loop.run_forever()
except KeyboardInterrupt:
	pass
server.close()
loop.run_until_complete(server.wait_closed())
loop.close()

Now try to request /%61 with aiohttp:

import aiohttp
import asyncio


async def fetch():
	async with aiohttp.ClientSession() as session:
		async with session.get('http://127.0.0.1:8080/%61') as response:
			print(response.status)


loop = asyncio.get_event_loop()
loop.run_until_complete(fetch())
loop.close()

Expected behaviour

200

Actual behaviour

403 because the server receives a GET /a HTTP/1.1 request instead of GET /%61 HTTP/1.1

Workaround

The workaround is to first convert the URL to a yarl.URL with the encoded flag set to True:

		async with session.get(yarl.URL('http://127.0.0.1:8080/%61', encoded = True)) as response:

Of course, this requires that the URL is indeed encoded correctly, so this workaround will not be usable in the general case.

In the case of redirects, as in the example described above, this gets more complicated as you have to implement your own logic for following redirects: fetch with allow_redirects = False, check the response status code, get the Location header, throw it into yarl.URL with the encoded flag, repeat.

Closing thoughts

This is really an issue on the interface between aiohttp and yarl, not a pure aiohttp or pure yarl problem. I decided to file it against aiohttp since that is where the problem becomes visible to users.

I have no idea what the proper fix for this issue is. Presumably, there should be a way to disable URL normalisation, which would leave those percent-encoded characters as is; I assume that would be a new kwarg to yarl.URL. Perhaps that should even be the default since it would match the behaviour of browsers (at least Firefox) as far as I can tell; but if so, I guess this would reside in aiohttp, and yarl itself should still default to normalisation.

Your environment

Python 3.6.8 on Debian
aiohttp: I primarily experienced and investigated this issue with version 2.3.10, but it is still present in the current 3.6.2. (I have not tested the 4.x branch.)
yarl: 1.3.0

The text was updated successfully, but these errors were encountered:

asvetlov · 2019-11-01T07:53:10Z

Did you try yarl.URL(url_str, encoded=True)?

JustAnotherArchivist · 2019-11-01T11:00:15Z

Yes, see "Workaround" above. But that can only be used if you're sure that the URL is indeed encoded correctly, and you can't rely on that in the general case.

asvetlov · 2019-11-01T11:29:31Z

Do you ask for optional disabling of URL normalization?
Like URL(str, normalize=False)?
It can percent-encode all forbidden characters but keep untouched all valid symbols, even if they are not normalized.

I think the normalization is pretty important, it should be enabled by default.
If people want to skip this step -- it requires an explicit action like normalize argument usage.

JustAnotherArchivist · 2019-11-01T22:02:42Z

Yes, disabling normalisation is what I was thinking. And it should be possible to do this through aiohttp as well so that you don't have to manually handle redirects.

I agree that normalisation is important, but why does it have to be enabled by default? Do you have any examples of things that break when it isn't used?

asvetlov · 2019-11-03T00:28:53Z

In an ideal world, people always use only normalized URLs.
Realistically, they are lazy humans, very often not networking experts.
The library should do the best to help them.
Normalization may hurt only if the HTTP message sign is needed, and the sign contains HTTP headers. For this case better to provide an API like prepared request which contains all headers to be sent plus normalized URL.
The signing should be done at this step.

JustAnotherArchivist · 2019-11-03T16:57:28Z

It's worth mentioning that before RFC 7230 was released in 2014, normalisation (at least in the context of HTTP) had not been standardised at all. So older web servers can be a source of legitimately non-normalised URLs. And of course not all maintained servers have been "fixed" since the release of RFC 7230.

By the way, at least Firefox does not normalise URLs before sending the request, even though it displays the normalised URL in the URL bar (and also in the dev tools until you look at the request details). I have not tested any other browsers, but that is another indication that it's probably a good idea to disable normalisation. Users should be able to expect that a request that works in a browser also works in aiohttp.

I agree that the signing in the example is broken and should use the normalised URL, but in the case mentioned in the original report, that URL comes from a remote server, and I can't influence that process in any way.

asvetlov · 2019-11-07T10:29:47Z

We can disable normalization on redirect handling.

JustAnotherArchivist · 2019-12-11T02:37:12Z

Yes, that seems like a reasonable solution.

asvetlov · 2020-10-24T09:04:35Z

Fixed by yarl 1.6 I hope

derlih · 2020-12-23T17:48:04Z

@JustAnotherArchivist I'm closing this issue
If you still have this problem, this issue can be reopened.

x0day · 2021-08-17T05:58:14Z

some way to disable this? because this bug still happens in some s3 download service when download url is in header's location, and this bug is not exists in requests or httpx.

x0day · 2021-08-17T06:19:42Z

some way to disable this? because this bug still happens in some s3 download service when download url is in header's location, and this bug is not exists in requests or httpx.

solved by requote_redirect_url parameter in ClientSession

derlih closed this as completed Dec 23, 2020

OrIdow6 mentioned this issue May 9, 2021

"%7E" in Location: header is unescaped to raw tilde instead of kept as-is ArchiveTeam/wget-lua#11

Closed

Dreamsorcerer mentioned this issue Feb 24, 2023

Query parameters are being URL-decoded when making a request #7222

Closed

1 task

Dreamsorcerer mentioned this issue Mar 9, 2023

[BUG] Can't get responses content (multidict seems like doesnt work) #7233

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic URL percent decoding can break S3 downloads #4307

Automatic URL percent decoding can break S3 downloads #4307

JustAnotherArchivist commented Oct 30, 2019

asvetlov commented Nov 1, 2019

JustAnotherArchivist commented Nov 1, 2019

asvetlov commented Nov 1, 2019

JustAnotherArchivist commented Nov 1, 2019

asvetlov commented Nov 3, 2019

JustAnotherArchivist commented Nov 3, 2019

asvetlov commented Nov 7, 2019

JustAnotherArchivist commented Dec 11, 2019

asvetlov commented Oct 24, 2020

derlih commented Dec 23, 2020

x0day commented Aug 17, 2021

x0day commented Aug 17, 2021

Automatic URL percent decoding can break S3 downloads #4307

Automatic URL percent decoding can break S3 downloads #4307

Comments

JustAnotherArchivist commented Oct 30, 2019

Long story (not so) short

Steps to reproduce

Expected behaviour

Actual behaviour

Workaround

Closing thoughts

Your environment

asvetlov commented Nov 1, 2019

JustAnotherArchivist commented Nov 1, 2019

asvetlov commented Nov 1, 2019

JustAnotherArchivist commented Nov 1, 2019

asvetlov commented Nov 3, 2019

JustAnotherArchivist commented Nov 3, 2019

asvetlov commented Nov 7, 2019

JustAnotherArchivist commented Dec 11, 2019

asvetlov commented Oct 24, 2020

derlih commented Dec 23, 2020

x0day commented Aug 17, 2021

x0day commented Aug 17, 2021