Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool: Improve deducing remote name from URL #155

Closed
wants to merge 1 commit into from

Conversation

dwb
Copy link

@dwb dwb commented Mar 7, 2015

Before this, when using the tool's -O/--remote-name, the query
string would be included in the filename and it wouldn't be URL-escaped.
Anything to the left of path seperators is stripped. Control characters
raise an error. Encoding is left unaddressed. In Unix, filenames
encoding is not defined, so that seems reasonable. Windows/NTFS uses
UTF-16, but as we don't know the input encoding (although UTF-8 seems to
be being standardised around) any conversion would be a guess.

I've tried to follow all the docs/style guides/etc that I can see, but I'm new
to the project and not immensely seasoned in C, so apologies if I've messed
something up!

Before this, when using the tool's `-O`/`--remote-name`, the query
string would be included in the filename and it wouldn't be URL-escaped.
Anything to the left of path seperators is stripped. Control characters
raise an error. Encoding is left unaddressed. In Unix, filenames
encoding is not defined, so that seems reasonable. Windows/NTFS uses
UTF-16, but as we don't know the input encoding (although UTF-8 seems to
be being standardised around) any conversion would be a guess.
@dwb
Copy link
Author

dwb commented Mar 7, 2015

Okay, well, clearly I have. Obviously compiles fine here, will investigate later.

@bagder bagder self-assigned this Mar 7, 2015
@ghedo
Copy link
Contributor

ghedo commented Mar 8, 2015

I can reproduce the failure. Why don't you use curl_easy_unescape() instead? Like:

diff --git a/src/tool_operhlp.c b/src/tool_operhlp.c
index bbe6219..344331c 100644
--- a/src/tool_operhlp.c
+++ b/src/tool_operhlp.c
@@ -160,12 +160,10 @@ CURLcode get_url_file_name(char **filename, const char *url)
   *filename = strsep(filename, "?#");

   /* URL-decode the filename, stripping control characters */
-  res = Curl_urldecode(NULL, *filename, 0, filename, NULL, TRUE);
+  *filename = curl_easy_unescape(NULL, *filename, 0, NULL);
   Curl_safefree(tofree);
-  if(res != CURLE_OK) {
-    *filename = NULL;
-    return res;
-  }
+  if(!*filename)
+    return CURLE_URL_MALFORMAT;

   /* Get everything to the right of any path seperators */
   pc = strrchr(*filename, '/');

@dwb
Copy link
Author

dwb commented Mar 8, 2015

I did consider that, but looked at the implementation and saw it didn't set reject_ctrl, which I wanted to use to handle NULLs safely.

@ghedo
Copy link
Contributor

ghedo commented Mar 8, 2015

Right, unfortunately you can't use Curl_* symbols outside of libcurl, they are not supposed to be exported.

We might want to export a new curl_easy_unescape2() function, or curl_easy_unescape() could set reject_ctrl to TRUE instead (not sure what kind of breakage this would cause).

@bagder help?

@bagder
Copy link
Member

bagder commented Mar 8, 2015

As @ghedo says, Curl_* symbols are libcurl private functions and cannot be used outside of the library.

But I would like to take a step back and discuss the idea of this patch first. Is it to match what a browser would do? Because if it is, it likely means decoding the percent encoding into Unicode and then the decoder cannot just blindly reject what it considers ASCII control codes because they could be part of valid UTF8 sequences.

Unfortunately, in the curl project we consider a URL to be RFC3986 which is called a URI within the IETF, while in a browser what is called a "URL" by most people is possibly closer to RFC3987 ("IRI") but none of them follow that very closely but instead have their own meaning of what a URL is, I think most closely specified by the WHATWG group.

This is a sort of Pandora's box.

@dwb
Copy link
Author

dwb commented Mar 9, 2015

Unless I'm reading the RFC wrong, bytes without the high bit set can't be part of UTF-8 sequences, unless they do actually represent the ASCII control characters. So they should be fine to error on, unless something's percent-encoded UTF-16 or something. I appreciate this isn't a straight-forward area though!

The idea is indeed to match general browser behaviour. The difficulty of decoding anything of course is that we don't know what the input encoding is. I figured that because file names are specified to be a binary blob in POSIX, passing the byte sequence through (after cleaning) was the best idea. If the encoding is bad somehow, the user will notice and rename the file. I'm guessing detecting the encoding is going to be pretty difficult, especially with such a short string to do any kind of statistical analysis on.

@bagder
Copy link
Member

bagder commented Mar 13, 2015

Ok, then that is not a problem, thanks for that. I'm really lost in Unicode land.
I'm still hesitant to export a new function in the libcurl API only for this, and re-using the source code seems tricky. We can debate this further, but I think the quickest route forward (for now at least) is that you copy enough code from the lib code to the tool code to get this functionality in there.

I believe the best docs to read to understand what the browsers do is the WHATWG's doc on URLs: https://url.spec.whatwg.org/

@bagder bagder closed this Apr 19, 2015
@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

None yet

3 participants