Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
http/wsgi.py: do not unquote path before injecting into environ #1211
This is a follow-up to #930, where we decided that WSGI
In the current aiohttp implementation we find (https://github.com/KeepSafe/aiohttp/blob/master/aiohttp/wsgi.py):
This leads to a class of problems where escaped and non-escaped parts are not easily distinguishable anymore on framework-level, as vividly clarified by the following issue: https://code.djangoproject.com/ticket/15718
There, the issue appeared with Django+mod_wsgi. In our case, the issue appeared with Falcon+Gunicorn, but factually it's exactly the same problem.
I think the most important insight is from aio-libs/aiohttp#177
So, I went ahead and simply removed the unquoting operation (exactly as done in aiohttp's wsgi.py). I ran tests against Python 3.4, and nothing broke.
This change affects three worker types (async, gthread, sync):
I think we can just merge this, mainly motivated by the fact that aiohttp uses the same method.
Still, it would be nice to now add a test that breaks with the old behavior, and passes with the new one. However, I am not warm enough with the gunicorn test structure.
And of course the question is if applications out there in the world rely on that behavior ... :/
In our application we have to be able to distinguish real slashes from escaped slashes on framework level. That is, we have to use a custom branch of Gunicorn right now -- otherwise that information is lost once requests enter framework level.
Gunicorn (and other servers like Werkzeug) follows PEP 3333 it implicitly requires PATH_INFO and friends to be unquoted. If you don't unquote PATH_INFO, you are going to get broken PATH_INFO. See https://www.python.org/dev/peps/pep-3333/#url-reconstruction for the URL reconstruction algorithm:
@berkerpeksag yes, the problem is the "implicitly". It's not explicit, so it leaves some room for interpretation.
Obviously, certain applications require to be able to distinguish escaped from non-escaped entities in the original request path. That's why mod_wsgi introduced the configuration variable
Now, what would you propose? I try to summarize:
It seems like double-encoding is the most reliable solution to that for applications.
application that need the RAW URI can get it from the environment variable RAW_URI in gunicorn. This how it's done since awhile. I would stick that way so we keep complying strictly to the WSGI spec. Imo we should have the same behaviour in all workers. Thoughts?
@benoitc thanks for the pointer towards
So, that PR can be closed, I guess. I'm not exactly sure what you mean with
Is that not the case as of now?
Just FYI, in our application, we are now relying on clients/API consumers to replace slashes that are supposed to be transparent to the URL template engine of our WSGI application with a