Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting page with non-ASCII URL fails in Jinja template rendering #2328

Closed
bzar opened this issue Feb 27, 2015 · 6 comments
Closed

Requesting page with non-ASCII URL fails in Jinja template rendering #2328

bzar opened this issue Feb 27, 2015 · 6 comments
Assignees

Comments

@bzar
Copy link
Contributor

bzar commented Feb 27, 2015

Jinja only accepts unicode or ASCII

ckan/lib/base.py relays requested URLs in raw format to Jinja, causing UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 20: ordinal not in range(128) if URL or its parts are rendered to page.

This issue is visible, for example, when searching for datasets using a browser that does not auto-escape non-ASCII characters. For example, IE dies not escape while Firefox does. This bug is easily reproduced by searching datasets with "ä" as the search term using IE, or by using curl: curl -v "http://demo.ckan.org/dataset?q=ä"

URLs and any other data used in Jinja templates should be decoded and encoded into unicode before rendering. This is not a trivial fix, since correctly identifying an unknown encoding in all cases is borderline, if not, impossible.

One possibility would be to assume requests come in UTF-8 or ASCII and just decode everything as if it was UTF-8 (since only-ASCII UTF-8 is the same as the same text in ASCII). This would not work for other encodings like latin-1, but would be a start.

I can create a patch and pull request for the UTF-8/ASCII-only solution if it's considered acceptable. Otherwise I'd like to request comments on other options. BeautifulSoup has one solution for a few encodings.

@wardi wardi self-assigned this Mar 3, 2015
@wardi
Copy link
Contributor

wardi commented Mar 3, 2015

@bzar yes, please! A patch that expects only UTF-8/ASCII sounds great. We should only be passing unicode through to jinja.

@bzar
Copy link
Contributor Author

bzar commented Mar 5, 2015

Digging into this I have uncovered the following:

  1. Pylons already assumes all data is UTF-8 and decodes it into unicode by default since pylons 0.9.6
  2. The non-unicode string rendered by the template is CKAN_CURRENT_URL, defined here
  3. Changing CKAN_CURRENT_URL to unicode breaks "url_for" provided by "routes" library
  4. Current code base already has a fix for this that escapes the UTF-8 url into valid ASCII here
  5. This fix has been commented out in this commit because it affected legacy templates

Should this "fix" be reverted or at least parametrized?

@wardi
Copy link
Contributor

wardi commented Mar 5, 2015

@bzar we should revert it. the legacy templates have been removed.

There is still support for extensions using the legacy template engine. Write the PR and we can ask @davidread to test if it breaks their legacy templates. If not, we should backport this fix too.

@davidread
Copy link
Contributor

I've read and read this, but have lots of questions about what is meant by failure and why, and it would require me to dig a lot to understand how to test this. We've switched to Jinja templates for search page now, so it's also not much help me doing it anyway, I'm afraid. Sorry.

@bzar
Copy link
Contributor Author

bzar commented Mar 6, 2015

Testing it is as simple as requesting /dataset?q=ä". Make sure the HTTP client you use doesn't auto-escape the URL (Firefox for one does this, IE and curl don't as mentioned in pylons' manual).

The failure is a "500 internal server error" reply caused by rendering a non-ASCII python string (not unicode object) inside a JINJA template, namely the request URL stored in CKAN_CURRENT_URL.

I'll prepare the pull request.

@bzar
Copy link
Contributor Author

bzar commented Jan 14, 2019

This seems to have returned. The same curl -v "https://demo.ckan.org/dataset?q=ä" results in a 503.

I'm looking into it, but printing qs before and after the urllib.quote call seems to show the issue well, when compared to the correctly escaped curl -v "https://demo.ckan.org/dataset?q=%C3%A4".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants