-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mention encoding in curl_easy_escape docs #1612
Comments
URIs are per definition (RFC 3986) ASCII only, so there's no "encoding" at all to speak of. How do you suggest we make this clearer? |
Perhaps I misunderstand it then. I thought url-encoding was used to map non-ascii arguments into ascii ones. For example if to post a form: > curl_escape("food=寿司")
"food%3D%E5%AF%BF%E5%8F%B8"
> curl_unescape("food%3D%E5%AF%BF%E5%8F%B8")
"food=寿司" But this requires we know the character encoding of the output, right? Maybe I'm mistaken. |
Welcome to the mess of URLs. libcurl supports URLs as defined by RFC 3986 (with some "extensions"), while browsers (mostly) support the WHAT WG URL spec. This is a reason for "interesting" differences and I've collected a few of them in an URL interop issues document. A URL in libcurl cannot legally contain any 8bit characters as that's not allowed by the spec! (the exception to this rule is in the host name part which libcurl will decode and handle). But libcurl doesn't filter out 8bit characters, it is liberal and will instead accept them and just pass them on as-is. libcurl assumes that you passes in a valid URL that you wan to work with. If you want to pass on "寿司" (or similar) in a URL you probably want to encode it using percent encoding - somehow. The libcurl escape/unescape functions will URL-encode/decode for you, but they both simply work on binary data and they have no knowledge or awareness of specific encodings. |
Perhaps it's useful to warn the user (especially on windows) that the convention is, and the server might be expecting the url-encoded text to be UTF-8. See also this discussion. Note that browsers, even on windows, always use to UTF-8 when posting a form or using JavaScript. For example the default output is:
I understand that in |
Again, browsers think the WHATWG URL Spec defines how URLs work, while that's not at all a universal law, so they are bound to function different than all the world's URL using software that is written to work with the IETF/w3c URI specs. Note that in your four examples, the encoded versions (the ones on the right) are the URL formatted ones and the versions of the strings before encoding are just strings. Since libcurl works with URLs, it also assumes that the encoding is already done. The URL you set is the URL you want.
Can you suggest any wording that you think might've helped you? I assume you mean that these words should be added to the |
@bagder excuse my lack of understanding about this topic. My main concern is not so much
I think some naive users (like me) might mistakenly assume that I don't think this is a completely unreasonable expectation; the section about
So servers will expect strings to be posted as url-encoded UTF-8 unless specifically requested otherwise by
This is just a suggestion, perhaps I am still misunderstanding the topic :D In that case feel free to close this issue. |
Thanks, I edited the |
The manual pages for curl_easy_escape and curl_easy_unescape should mention which character encoding is used for
const char * url
if we escape e.g. a Chinese word.I assume it is UTF-8 which means that e.g. on Windows the user needs an additional call to
iconv()
to convert it to the native encoding. Currently this is not obvious.The text was updated successfully, but these errors were encountered: