New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tool: Improve deducing remote name from URL #155
Conversation
Before this, when using the tool's `-O`/`--remote-name`, the query string would be included in the filename and it wouldn't be URL-escaped. Anything to the left of path seperators is stripped. Control characters raise an error. Encoding is left unaddressed. In Unix, filenames encoding is not defined, so that seems reasonable. Windows/NTFS uses UTF-16, but as we don't know the input encoding (although UTF-8 seems to be being standardised around) any conversion would be a guess.
Okay, well, clearly I have. Obviously compiles fine here, will investigate later. |
I can reproduce the failure. Why don't you use diff --git a/src/tool_operhlp.c b/src/tool_operhlp.c
index bbe6219..344331c 100644
--- a/src/tool_operhlp.c
+++ b/src/tool_operhlp.c
@@ -160,12 +160,10 @@ CURLcode get_url_file_name(char **filename, const char *url)
*filename = strsep(filename, "?#");
/* URL-decode the filename, stripping control characters */
- res = Curl_urldecode(NULL, *filename, 0, filename, NULL, TRUE);
+ *filename = curl_easy_unescape(NULL, *filename, 0, NULL);
Curl_safefree(tofree);
- if(res != CURLE_OK) {
- *filename = NULL;
- return res;
- }
+ if(!*filename)
+ return CURLE_URL_MALFORMAT;
/* Get everything to the right of any path seperators */
pc = strrchr(*filename, '/'); |
I did consider that, but looked at the implementation and saw it didn't set |
Right, unfortunately you can't use We might want to export a new @bagder help? |
As @ghedo says, Curl_* symbols are libcurl private functions and cannot be used outside of the library. But I would like to take a step back and discuss the idea of this patch first. Is it to match what a browser would do? Because if it is, it likely means decoding the percent encoding into Unicode and then the decoder cannot just blindly reject what it considers ASCII control codes because they could be part of valid UTF8 sequences. Unfortunately, in the curl project we consider a URL to be RFC3986 which is called a URI within the IETF, while in a browser what is called a "URL" by most people is possibly closer to RFC3987 ("IRI") but none of them follow that very closely but instead have their own meaning of what a URL is, I think most closely specified by the WHATWG group. This is a sort of Pandora's box. |
Unless I'm reading the RFC wrong, bytes without the high bit set can't be part of UTF-8 sequences, unless they do actually represent the ASCII control characters. So they should be fine to error on, unless something's percent-encoded UTF-16 or something. I appreciate this isn't a straight-forward area though! The idea is indeed to match general browser behaviour. The difficulty of decoding anything of course is that we don't know what the input encoding is. I figured that because file names are specified to be a binary blob in POSIX, passing the byte sequence through (after cleaning) was the best idea. If the encoding is bad somehow, the user will notice and rename the file. I'm guessing detecting the encoding is going to be pretty difficult, especially with such a short string to do any kind of statistical analysis on. |
Ok, then that is not a problem, thanks for that. I'm really lost in Unicode land. I believe the best docs to read to understand what the browsers do is the WHATWG's doc on URLs: https://url.spec.whatwg.org/ |
Before this, when using the tool's
-O
/--remote-name
, the querystring would be included in the filename and it wouldn't be URL-escaped.
Anything to the left of path seperators is stripped. Control characters
raise an error. Encoding is left unaddressed. In Unix, filenames
encoding is not defined, so that seems reasonable. Windows/NTFS uses
UTF-16, but as we don't know the input encoding (although UTF-8 seems to
be being standardised around) any conversion would be a guess.
I've tried to follow all the docs/style guides/etc that I can see, but I'm new
to the project and not immensely seasoned in C, so apologies if I've messed
something up!