Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Disposition: Add support for UTF-8 encoded filename* field. #1995

Closed
wants to merge 1 commit into from

Conversation

reidwagner
Copy link

Closes #1888.

If -J, --remote-header-name is specified, and filename* is given in Content-Disposition, it is decoded by libcurl's curl_easy_unescape according to RFC 3986 and used as the filename for -O, --remote-name. A UTF-8 filename* takes precedence over an ASCII filename (no asterisk) if both are specified.

I tested this by serving a file with an encoded filename* field in the Content-Disposition header as specified in RFC 6266, and verifying that it saved on my filesystem as expected. I tested this with filename* alone, and both filename and filename*.

I wasn't able to write a test in tests/data that allowed me save the output file in tests/log, as --remote-name saves specifically in the current directory.

@bagder
Copy link
Member

bagder commented Oct 16, 2017

This is still red because of this:

tool_cb_hdr.c: In function ‘tool_header_cb’:
tool_cb_hdr.c:127:9: error: conversion to ‘int’ from ‘size_t’ may alter its value [-Werror=conversion]
         filename = curl_easy_unescape(outs->config->easy, filename, len, NULL);
         ^
cc1: all warnings being treated as errors
make[2]: *** [curl-tool_cb_hdr.o] Error 1


@reidwagner
Copy link
Author

reidwagner commented Oct 17, 2017

Okay, I added a cast to int because it's safe to assume that will be large enough. I also added a Curl_safefree on the unencoded filename buffer if it is replaced by an encoded filename.

/* no match, find next parameter */
while((p < end) && (*p != ';'))
p++;
continue;
}
p += 9;
p += 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be strict, you removed the check for the = so filenamefoo for example would be incorrectly accepted

@@ -19,6 +19,7 @@
* KIND, either express or implied.
*
***************************************************************************/
#include <curl/curl.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to include curl.h it is included automatically by tool_setup

@reidwagner
Copy link
Author

Good catch. I backed up and went with a simpler approach. I also filled in a missing piece of the "ext-value" from RFC 5987.

Both "filename-parm" and "charset" need to be case insensitive (RFC 6266, 5987), so I used curl_strnequal for the new parameter, as well as to replace memcmp for filename=.

If -J, --remote-header-name is specified, and filename* is given in Content-Disposition as utf-8 with a blank language tag, e.g. filename*=utf-8''foo%C2%AE, it is decoded by libcurl's curl_easy_unescape according to RFC 3986 and used as the filename for -O, --remote-name.  A UTF-8 filename* takes precedence over an ASCII filename (no asterisk) if both are specified.

Replacing memcmp with curl_strnequal also adds case insensitivity to filename and filename* parameters as is specified in RFC6266 section 4.3.

Closes curl#1888
Reported-by: devbazilio
@jay
Copy link
Member

jay commented Nov 5, 2017

I wonder how this would work on Windows. I think we would need to set a separate flag is_utf8_filename and then something like this:

diff --git a/src/tool_cb_wrt.c b/src/tool_cb_wrt.c
index 6716ba5..c16b945 100644
--- a/src/tool_cb_wrt.c
+++ b/src/tool_cb_wrt.c
@@ -36,15 +36,32 @@ bool tool_create_output_file(struct OutStruct *outs)
 {
   struct GlobalConfig *global = outs->config->global;
   FILE *file;
+#ifdef WIN32
+  wchar_t *u16filename = NULL;
+#endif
 
   if(!outs->filename || !*outs->filename) {
     warnf(global, "Remote filename has no length!\n");
     return FALSE;
   }
 
+#ifdef WIN32
+  if(outs->is_utf8_filename) {
+    if(!utf8_to_utf16(filename, &u16filename)) {
+      warnf(global, "Can't convert UTF-8 filename to UTF-16\n");
+      return FALSE;
+    }
+  }
+#endif
+
   if(outs->is_cd_filename) {
     /* don't overwrite existing files */
-    file = fopen(outs->filename, "rb");
+#ifdef WIN32
+    if(u16filename)
+      file = _wfopen(u16filename, L"rb");
+    else
+#endif
+      file = fopen(outs->filename, "rb");
     if(file) {
       fclose(file);
       warnf(global, "Refusing to overwrite %s: %s\n", outs->filename,
@@ -54,7 +71,12 @@ bool tool_create_output_file(struct OutStruct *outs)
   }
 
   /* open file for writing */
-  file = fopen(outs->filename, "wb");
+#ifdef WIN32
+  if(u16filename)
+    file = _wfopen(u16filename, L"wb");
+  else
+#endif
+    file = fopen(outs->filename, "wb");
   if(!file) {
     warnf(global, "Failed to create the file %s: %s\n", outs->filename,
           strerror(errno));

How does this work on other operating systems, do they just recognize the encoding?

@bagder
Copy link
Member

bagder commented Nov 6, 2017

Most Linux systems are setup to mount and assume that file systems are UTF8. But that's not necessarily true everywhere and you can certainly mount file systems using other encodings.

@ForNeVeR
Copy link

ForNeVeR commented Jan 1, 2018

curl doesn't work at all In my Windows 10 environment if I built if from the branch of this PR:

$ cd curl\winbuild
$ git checkout reidwagner/filename_encoding_support
$ nmake /f .\Makefile.vc mode=dll
$ ..\builds\libcurl-vc-x86-release-dll-ipv6-sspi-winssl\bin\curl.exe http://example.org/

It prints the example.org content (as intended) and then loops forever, entirely eating one CPU core (as not intended).

(At the same time, the version from the master branch works for me, so it looks like nothing's wrong with my environment, and that's a real trouble with the code in the branch.)

@bagder
Copy link
Member

bagder commented Jan 30, 2018

Yeah, just this problem is harder than what's being done so far in this PR. Pretending everything works fine as-is if it comes UTF-8 encoded is not going to work either since the presumption that the file system is using UTF-8 is wrong too often. Windows specifically typically doesn't encode files UTF-8.

@reidwagner
Copy link
Author

Okay, that was a naive assumption in the original PR.

So a first order solution might be conditionally including code handling UTF-8 encoding for Windows (like above) and known POSIX operating systems. Then in the latter case, statfs() could be used to determine the filesystem encoding.

Or, would it just be simpler to make this opt-in, so a user that knows what they're doing can include UTF-8 support with a command line flag?

@jay
Copy link
Member

jay commented Jan 31, 2018

It prints the example.org content (as intended) and then loops forever, entirely eating one CPU core (as not intended).

I'm unable to reproduce.

@bagder bagder added the stale label Apr 26, 2018
@stale stale bot closed this May 10, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Aug 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

4 participants