New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic URL percent decoding can break S3 downloads #4307
Comments
Did you try |
Yes, see "Workaround" above. But that can only be used if you're sure that the URL is indeed encoded correctly, and you can't rely on that in the general case. |
Do you ask for optional disabling of URL normalization? I think the normalization is pretty important, it should be enabled by default. |
Yes, disabling normalisation is what I was thinking. And it should be possible to do this through aiohttp as well so that you don't have to manually handle redirects. I agree that normalisation is important, but why does it have to be enabled by default? Do you have any examples of things that break when it isn't used? |
In an ideal world, people always use only normalized URLs. |
It's worth mentioning that before RFC 7230 was released in 2014, normalisation (at least in the context of HTTP) had not been standardised at all. So older web servers can be a source of legitimately non-normalised URLs. And of course not all maintained servers have been "fixed" since the release of RFC 7230. By the way, at least Firefox does not normalise URLs before sending the request, even though it displays the normalised URL in the URL bar (and also in the dev tools until you look at the request details). I have not tested any other browsers, but that is another indication that it's probably a good idea to disable normalisation. Users should be able to expect that a request that works in a browser also works in aiohttp. I agree that the signing in the example is broken and should use the normalised URL, but in the case mentioned in the original report, that URL comes from a remote server, and I can't influence that process in any way. |
We can disable normalization on redirect handling. |
Yes, that seems like a reasonable solution. |
Fixed by yarl 1.6 I hope |
@JustAnotherArchivist I'm closing this issue |
some way to disable this? because this bug still happens in some s3 download service when download url is in header's location, and this bug is not exists in requests or httpx. |
solved by requote_redirect_url parameter in |
Long story (not so) short
aiohttp (or rather yarl?) automatically decodes certain percent encodings. While that is legal and in fact recommended per RFC 7230 (§2.7.3), it may break retrieval in certain cases. The specific issue I was having involved AWS S3 URLs with signatures. Unfortunately, the real-world example of this is no longer available as the website in question (picosong) shut down a few days ago. However, this does not actually rely on S3, and I have included an alternative example server with similar behaviour below.
The site in question redirected to pre-signed S3 URLs. These look like
https://bucket.s3.amazonaws.com/path/filename?AWSAccessKeyID=X&Signature=Y&Expires=Z
, and the parameters are used for authentication of the request. The key thing for this issue is that the signature calculation includes the filename, and specifically, it includes the filename exactly in a particular encoding. For example, if the signature is generated using the filenamefoo.txt
, requesting it asfoo%2Etxt
causes a signature mismatch, even though the URLs are equivalent per RFC 7230. This is a quirk of S3, but unfortunately a perfectly legal quirk as far as I can tell since URL normalisation is optional.If such a redirect now points to a file named
foo%27bar
, then the next request is sent by aiohttp usingfoo'bar
instead because the percent encoding of the single quote is decoded automatically. This means that it is impossible to download such files with aiohttp without a workaround (see below). This is actually not a problem with redirects, as shown in the StR below. However, in the case of redirects, it's more complicated to work around the problem.Steps to reproduce
Have a web server that accepts a request for
/%61
(e.g. HTTP 200) but not for/a
(e.g. HTTP 403). In fact, any character from the seta-zA-Z0-9-._~!$'()*,+&=;@:
would do from what I've seen.(Yes, those two URLs are equivalent per RFC, and normal HTTP servers would handle them equivalently. If this bothers you, please replace it with
/%61?filenamehash=<MD5("/%61")>
or similar, where the server checks if the hash matches the path; this is a simplified version of the real-world S3 scenario described above.)Here's a very simple example server, intentionally independent of aiohttp:
Now try to request
/%61
with aiohttp:Expected behaviour
200
Actual behaviour
403
because the server receives aGET /a HTTP/1.1
request instead ofGET /%61 HTTP/1.1
Workaround
The workaround is to first convert the URL to a
yarl.URL
with theencoded
flag set toTrue
:Of course, this requires that the URL is indeed encoded correctly, so this workaround will not be usable in the general case.
In the case of redirects, as in the example described above, this gets more complicated as you have to implement your own logic for following redirects: fetch with
allow_redirects = False
, check the response status code, get theLocation
header, throw it intoyarl.URL
with theencoded
flag, repeat.Closing thoughts
This is really an issue on the interface between aiohttp and yarl, not a pure aiohttp or pure yarl problem. I decided to file it against aiohttp since that is where the problem becomes visible to users.
I have no idea what the proper fix for this issue is. Presumably, there should be a way to disable URL normalisation, which would leave those percent-encoded characters as is; I assume that would be a new kwarg to
yarl.URL
. Perhaps that should even be the default since it would match the behaviour of browsers (at least Firefox) as far as I can tell; but if so, I guess this would reside in aiohttp, and yarl itself should still default to normalisation.Your environment
Python 3.6.8 on Debian
aiohttp: I primarily experienced and investigated this issue with version 2.3.10, but it is still present in the current 3.6.2. (I have not tested the 4.x branch.)
yarl: 1.3.0
The text was updated successfully, but these errors were encountered: