tool: Improve deducing remote name from URL #155

dwb · 2015-03-07T18:55:44Z

Before this, when using the tool's -O/--remote-name, the query
string would be included in the filename and it wouldn't be URL-escaped.
Anything to the left of path seperators is stripped. Control characters
raise an error. Encoding is left unaddressed. In Unix, filenames
encoding is not defined, so that seems reasonable. Windows/NTFS uses
UTF-16, but as we don't know the input encoding (although UTF-8 seems to
be being standardised around) any conversion would be a guess.

I've tried to follow all the docs/style guides/etc that I can see, but I'm new
to the project and not immensely seasoned in C, so apologies if I've messed
something up!

Before this, when using the tool's `-O`/`--remote-name`, the query string would be included in the filename and it wouldn't be URL-escaped. Anything to the left of path seperators is stripped. Control characters raise an error. Encoding is left unaddressed. In Unix, filenames encoding is not defined, so that seems reasonable. Windows/NTFS uses UTF-16, but as we don't know the input encoding (although UTF-8 seems to be being standardised around) any conversion would be a guess.

dwb · 2015-03-07T19:01:21Z

Okay, well, clearly I have. Obviously compiles fine here, will investigate later.

ghedo · 2015-03-08T11:18:30Z

I can reproduce the failure. Why don't you use curl_easy_unescape() instead? Like:

diff --git a/src/tool_operhlp.c b/src/tool_operhlp.c
index bbe6219..344331c 100644
--- a/src/tool_operhlp.c
+++ b/src/tool_operhlp.c
@@ -160,12 +160,10 @@ CURLcode get_url_file_name(char **filename, const char *url)
   *filename = strsep(filename, "?#");

   /* URL-decode the filename, stripping control characters */
-  res = Curl_urldecode(NULL, *filename, 0, filename, NULL, TRUE);
+  *filename = curl_easy_unescape(NULL, *filename, 0, NULL);
   Curl_safefree(tofree);
-  if(res != CURLE_OK) {
-    *filename = NULL;
-    return res;
-  }
+  if(!*filename)
+    return CURLE_URL_MALFORMAT;

   /* Get everything to the right of any path seperators */
   pc = strrchr(*filename, '/');

dwb · 2015-03-08T13:17:17Z

I did consider that, but looked at the implementation and saw it didn't set reject_ctrl, which I wanted to use to handle NULLs safely.

ghedo · 2015-03-08T16:30:01Z

Right, unfortunately you can't use Curl_* symbols outside of libcurl, they are not supposed to be exported.

We might want to export a new curl_easy_unescape2() function, or curl_easy_unescape() could set reject_ctrl to TRUE instead (not sure what kind of breakage this would cause).

@bagder help?

bagder · 2015-03-08T22:38:02Z

As @ghedo says, Curl_* symbols are libcurl private functions and cannot be used outside of the library.

But I would like to take a step back and discuss the idea of this patch first. Is it to match what a browser would do? Because if it is, it likely means decoding the percent encoding into Unicode and then the decoder cannot just blindly reject what it considers ASCII control codes because they could be part of valid UTF8 sequences.

Unfortunately, in the curl project we consider a URL to be RFC3986 which is called a URI within the IETF, while in a browser what is called a "URL" by most people is possibly closer to RFC3987 ("IRI") but none of them follow that very closely but instead have their own meaning of what a URL is, I think most closely specified by the WHATWG group.

This is a sort of Pandora's box.

dwb · 2015-03-09T15:19:26Z

Unless I'm reading the RFC wrong, bytes without the high bit set can't be part of UTF-8 sequences, unless they do actually represent the ASCII control characters. So they should be fine to error on, unless something's percent-encoded UTF-16 or something. I appreciate this isn't a straight-forward area though!

The idea is indeed to match general browser behaviour. The difficulty of decoding anything of course is that we don't know what the input encoding is. I figured that because file names are specified to be a binary blob in POSIX, passing the byte sequence through (after cleaning) was the best idea. If the encoding is bad somehow, the user will notice and rename the file. I'm guessing detecting the encoding is going to be pretty difficult, especially with such a short string to do any kind of statistical analysis on.

bagder · 2015-03-13T07:47:32Z

Ok, then that is not a problem, thanks for that. I'm really lost in Unicode land.
I'm still hesitant to export a new function in the libcurl API only for this, and re-using the source code seems tricky. We can debate this further, but I think the quickest route forward (for now at least) is that you copy enough code from the lib code to the tool code to get this functionality in there.

I believe the best docs to read to understand what the browsers do is the WHATWG's doc on URLs: https://url.spec.whatwg.org/

bagder self-assigned this Mar 7, 2015

bagder added the needs-info label Mar 26, 2015

bagder closed this Apr 19, 2015

lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool: Improve deducing remote name from URL #155

tool: Improve deducing remote name from URL #155

dwb commented Mar 7, 2015

dwb commented Mar 7, 2015

ghedo commented Mar 8, 2015

dwb commented Mar 8, 2015

ghedo commented Mar 8, 2015

bagder commented Mar 8, 2015

dwb commented Mar 9, 2015

bagder commented Mar 13, 2015

tool: Improve deducing remote name from URL #155

tool: Improve deducing remote name from URL #155

Conversation

dwb commented Mar 7, 2015

dwb commented Mar 7, 2015

ghedo commented Mar 8, 2015

dwb commented Mar 8, 2015

ghedo commented Mar 8, 2015

bagder commented Mar 8, 2015

dwb commented Mar 9, 2015

bagder commented Mar 13, 2015