Requesting page with non-ASCII URL fails in Jinja template rendering #2328

bzar · 2015-02-27T11:36:18Z

ckan/lib/base.py relays requested URLs in raw format to Jinja, causing UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 20: ordinal not in range(128) if URL or its parts are rendered to page.

This issue is visible, for example, when searching for datasets using a browser that does not auto-escape non-ASCII characters. For example, IE dies not escape while Firefox does. This bug is easily reproduced by searching datasets with "ä" as the search term using IE, or by using curl: curl -v "http://demo.ckan.org/dataset?q=ä"

URLs and any other data used in Jinja templates should be decoded and encoded into unicode before rendering. This is not a trivial fix, since correctly identifying an unknown encoding in all cases is borderline, if not, impossible.

One possibility would be to assume requests come in UTF-8 or ASCII and just decode everything as if it was UTF-8 (since only-ASCII UTF-8 is the same as the same text in ASCII). This would not work for other encodings like latin-1, but would be a start.

I can create a patch and pull request for the UTF-8/ASCII-only solution if it's considered acceptable. Otherwise I'd like to request comments on other options. BeautifulSoup has one solution for a few encodings.

The text was updated successfully, but these errors were encountered:

wardi · 2015-03-03T13:05:51Z

@bzar yes, please! A patch that expects only UTF-8/ASCII sounds great. We should only be passing unicode through to jinja.

bzar · 2015-03-05T11:57:47Z

Digging into this I have uncovered the following:

Pylons already assumes all data is UTF-8 and decodes it into unicode by default since pylons 0.9.6
The non-unicode string rendered by the template is CKAN_CURRENT_URL, defined here
Changing CKAN_CURRENT_URL to unicode breaks "url_for" provided by "routes" library
Current code base already has a fix for this that escapes the UTF-8 url into valid ASCII here
This fix has been commented out in this commit because it affected legacy templates

Should this "fix" be reverted or at least parametrized?

wardi · 2015-03-05T13:00:13Z

@bzar we should revert it. the legacy templates have been removed.

There is still support for extensions using the legacy template engine. Write the PR and we can ask @davidread to test if it breaks their legacy templates. If not, we should backport this fix too.

davidread · 2015-03-05T14:13:46Z

I've read and read this, but have lots of questions about what is meant by failure and why, and it would require me to dig a lot to understand how to test this. We've switched to Jinja templates for search page now, so it's also not much help me doing it anyway, I'm afraid. Sorry.

bzar · 2015-03-06T08:08:34Z

Testing it is as simple as requesting /dataset?q=ä". Make sure the HTTP client you use doesn't auto-escape the URL (Firefox for one does this, IE and curl don't as mentioned in pylons' manual).

The failure is a "500 internal server error" reply caused by rendering a non-ASCII python string (not unicode object) inside a JINJA template, namely the request URL stored in CKAN_CURRENT_URL.

I'll prepare the pull request.

bzar · 2019-01-14T14:27:36Z

This seems to have returned. The same curl -v "https://demo.ckan.org/dataset?q=ä" results in a 503.

I'm looking into it, but printing qs before and after the urllib.quote call seems to show the issue well, when compared to the correctly escaped curl -v "https://demo.ckan.org/dataset?q=%C3%A4".

wardi self-assigned this Mar 3, 2015

bzar mentioned this issue Mar 17, 2015

Fix #2328 by enabling URL quoting #2337

Merged

wardi closed this as completed in 8f85ef4 Mar 30, 2015

antitoxic mentioned this issue Apr 2, 2015

Проблеми с utf8 в URL (примерно проблеми с тагове на кирилца) governmentbg/opendata#10

Open

This was referenced Jan 15, 2019

Handling incorrect URL encoding in requests #4619

Open

Ensure URL encoding for all requests #4621

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting page with non-ASCII URL fails in Jinja template rendering #2328

Requesting page with non-ASCII URL fails in Jinja template rendering #2328

bzar commented Feb 27, 2015

wardi commented Mar 3, 2015

bzar commented Mar 5, 2015

wardi commented Mar 5, 2015

davidread commented Mar 5, 2015

bzar commented Mar 6, 2015

bzar commented Jan 14, 2019

Requesting page with non-ASCII URL fails in Jinja template rendering #2328

Requesting page with non-ASCII URL fails in Jinja template rendering #2328

Comments

bzar commented Feb 27, 2015

wardi commented Mar 3, 2015

bzar commented Mar 5, 2015

wardi commented Mar 5, 2015

davidread commented Mar 5, 2015

bzar commented Mar 6, 2015

bzar commented Jan 14, 2019