-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requesting page with non-ASCII URL fails in Jinja template rendering #2328
Comments
@bzar yes, please! A patch that expects only UTF-8/ASCII sounds great. We should only be passing unicode through to jinja. |
Digging into this I have uncovered the following:
Should this "fix" be reverted or at least parametrized? |
@bzar we should revert it. the legacy templates have been removed. There is still support for extensions using the legacy template engine. Write the PR and we can ask @davidread to test if it breaks their legacy templates. If not, we should backport this fix too. |
I've read and read this, but have lots of questions about what is meant by failure and why, and it would require me to dig a lot to understand how to test this. We've switched to Jinja templates for search page now, so it's also not much help me doing it anyway, I'm afraid. Sorry. |
Testing it is as simple as requesting The failure is a "500 internal server error" reply caused by rendering a non-ASCII python string (not unicode object) inside a JINJA template, namely the request URL stored in CKAN_CURRENT_URL. I'll prepare the pull request. |
This seems to have returned. The same I'm looking into it, but printing |
Jinja only accepts unicode or ASCII
ckan/lib/base.py
relays requested URLs in raw format to Jinja, causingUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 20: ordinal not in range(128)
if URL or its parts are rendered to page.This issue is visible, for example, when searching for datasets using a browser that does not auto-escape non-ASCII characters. For example, IE dies not escape while Firefox does. This bug is easily reproduced by searching datasets with "ä" as the search term using IE, or by using curl:
curl -v "http://demo.ckan.org/dataset?q=ä"
URLs and any other data used in Jinja templates should be decoded and encoded into unicode before rendering. This is not a trivial fix, since correctly identifying an unknown encoding in all cases is borderline, if not, impossible.
One possibility would be to assume requests come in UTF-8 or ASCII and just decode everything as if it was UTF-8 (since only-ASCII UTF-8 is the same as the same text in ASCII). This would not work for other encodings like latin-1, but would be a start.
I can create a patch and pull request for the UTF-8/ASCII-only solution if it's considered acceptable. Otherwise I'd like to request comments on other options. BeautifulSoup has one solution for a few encodings.
The text was updated successfully, but these errors were encountered: