Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle non-UTF8 percent encoded characters in query part #19

Closed
asvetlov opened this issue Nov 4, 2016 · 8 comments
Closed

Comments

@asvetlov
Copy link
Member

asvetlov commented Nov 4, 2016

See for the reference aio-libs/aiohttp#1364

@asvetlov
Copy link
Member Author

asvetlov commented Nov 7, 2016

Fixed by 537caa8 and 23ec6d8

@asvetlov asvetlov closed this as completed Nov 7, 2016
@asvetlov
Copy link
Member Author

asvetlov commented Nov 7, 2016

Fixed in yarl 0.6.0

@kmike
Copy link

kmike commented Nov 10, 2016

Just FYI: in URLs path and fragment are always encoded to UTF8 and percent-escaped, while query parameters are encoded to page encoding (i.e. to encoding of a page URL was extracted from) and percent-escaped. So you can't extract an url like 'http://example.com?имя=вася' from a web page as an unicode string and then percent-escape it (e.g. to send an HTTP request) without specifiying page encoding. Likewise, if the URL is already percent-escaped in page body then you need to know original page encoding in order to unescape it, even if you already have page body as unicode.

@asvetlov
Copy link
Member Author

Sorry, I don't follow you.
yarl tries to percent-encode unsafe symbols assuming that encoding is UTF8 (which is good default).
Already percent-encoded chars are obviously safe.

On decoding already encoded symbols in non-UTF8 coding just are leaved as is.
Decoded URL parts are useful mostly for displaying info to user and they are never used as transport format

@kmike
Copy link

kmike commented Nov 10, 2016

yarl tries to percent-encode unsafe symbols assuming that encoding is UTF8 (which is good default).

My point is that this works if you're using yarl to prepare URL for sending to a server which uses UTF-8, but it doesn't work if server uses another encoding to serve its HTML pages. It means that currently yarl can't be used in the following case:

  1. download a page, decode it to unicode;
  2. extract an URL from a href attribute;
  3. use yarl to make it absolute and to percent-escape it (make ascii-only);
  4. follow the link by sending a request with this ascii-only absolute URL to a server.

Maybe it is not related directly to this issue, sorry if I misread it.

@kmike
Copy link

kmike commented Nov 10, 2016

Even if decoding is used only for displaying it is good to have it correct :)

@asvetlov
Copy link
Member Author

It's not related directly I believe but you've raised interesting question.
What change could solve your request?
Adding encoding parameter to yarl constructor for processing non-UTF8?

in URLs path and fragment are always encoded to UTF8 and percent-escaped, while query parameters are encoded to page encoding (i.e. to encoding of a page URL was extracted from) and percent-escaped.

Could you point on RFC for this rule? Or is it only de-facto behavior?

@kmike
Copy link

kmike commented Nov 10, 2016

Adding encoding parameter to yarl constructor for processing non-UTF8?

Yep, I think that's the way to go. It should also be made clear that this encoding is an encoding of a page URL was extracted from. Path and fragment should still use UTF8, not this encoding, and hostname should still use IDNA.

Could you point on RFC for this rule? Or is it only de-facto behavior?

It is a de-facto behavior of all browsers; as all other such gotchas, this is written down in https://url.spec.whatwg.org/ (see https://url.spec.whatwg.org/#query-encoding-example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants