Skip to content

tool_operate: for -O, use a default filename when the URL has none #13988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

bagder
Copy link
Member

@bagder bagder commented Jun 23, 2024

curl sets the filename to the last directory part of the URL or if that also is missing to default (without extension) for this situation.

Instead of returning error.

Add test 690 and 691 to verify

  • code
  • documentation
  • test cases

@github-actions github-actions bot added the tests label Jun 23, 2024
@bagder bagder added the feature-window A merge of this requires an open feature window label Jun 23, 2024
@bagder bagder force-pushed the bagder/remote-default branch from 0e32c92 to cc1100e Compare June 23, 2024 10:23
@bagder
Copy link
Member Author

bagder commented Jun 23, 2024

Alternatively, we could consider picking the filename from the last directory part in the URL or similar, but that is harder to explain (also: there is not always a directory).

It cannot use the content-type or similar since it needs to be 100% deterministic based on the given URL.

Note: --no-clobber can be used fine even when the file name is picked like this.

@bagder bagder force-pushed the bagder/remote-default branch from 35c32cd to 7314ddb Compare June 24, 2024 12:07
@njdoyle
Copy link

njdoyle commented Jun 24, 2024

If it were up to me, I would prefer a filename other than default. The word default, to me, is a property of something, not the thing itself. I would choose a filename like response or response_body or curl_response or something rather than default. I think these filenames would also help alleviate confusion when a user unfamiliar with this behaviour encounters this for the first time.

@bagder
Copy link
Member Author

bagder commented Jun 24, 2024

Yeah, default is definitely not the best. I think I prefer curl_response out of the ones so far because:

  1. mentions curl (answers who saved it)
  2. says it is a response
  3. does not use an extension (because people will think the extension means something)

@dfandrich
Copy link
Contributor

As a point of reference, wget uses the name index.html in this case, no matter if the Content-Type: is text/html or anything else. But, if an index.html file already exists, it gets a new name with a numeral appended so an existing index.html file doesn't get overwritten. That's the down side of curl using a common name like index.html since the contents would get overwritten by default which might be unexpected. Something like curl_response seems better in the curl case for this reason because it's a name that's unlikely to exist except if written by curl and it's clear afterward where it came from, even if it's a little ugly.

@bagder bagder changed the title tool_operate: for -O, use "default" as filename when the URL has none tool_operate: for -O, use a default filename when the URL has none Jun 25, 2024
@bagder
Copy link
Member Author

bagder commented Jun 25, 2024

It was suggested that we should effectively enable --no-clobber when we pick a name this way.

I'm not sure that's a good idea even if I like the thinking.

@bagder bagder self-assigned this Jun 30, 2024
@bagder bagder force-pushed the bagder/remote-default branch from ce57eb8 to 41b28dd Compare July 4, 2024 08:52
@samueloph
Copy link
Contributor

It was suggested that we should effectively enable --no-clobber when we pick a name this way.

Having --no-clobber only be set for the default URL case can cause confusion on the user's side, for example, consider how the current manpage entry for -O would have to be rewritten to accommodate this:

The remote filename to use for saving is extracted from the given URL, nothing else,
and if it already exists it is overwritten. If you want the server to be able to
choose the filename refer to -J, --remote-header-name which can be used in addition
to this option. If the server chooses a filename and that name already exists it is
not overwritten.

This also means that if curl were to have --no-clobber for the default filename case, it might be better to have it set on all cases then, but we'd be talking about a behavior change which could be unexpected too.

I believe users overall would still prefer --no-clobber enabled on all cases, but if that's not possible, it becomes a matter of consistent behavior.
Will it be too much for an user to remember that --no-clobber is set if -J is used OR when curl goes with the default filename? Will the user struggle to understand on which cases curl picks the default filename and fail to set --no-clobber when they want to?
This might force the user to end up always explicitly setting either --no-clobber or --clobber in order to avoid any risk of uncertainty.

I don't really know which one is best, I'm commenting just to expose how I see the issue and hopefully help.

bagder added 2 commits August 3, 2024 20:08
... or pick the last directory part from the path if available.

Instead of returning error.

Add test 690 to verify. Test 76 and 2036: no longer apply.
... or pick the last directory part from the path if available.

Instead of returning error.

Add test 690 to verify. Test 76 and 2036: no longer apply.

Closes #13988
@BrianInglis
Copy link
Contributor

BrianInglis commented Sep 15, 2024

Could we please use curl-response rather than curl_response to make it easier for users to type, or better the host name e.g. curl.se, curl-se, or curl[-.]se[-.]index and append media-type/mime-type suffix e.g. .html?
Added similar comment to @wCurl #4

@BrianInglis
Copy link
Contributor

BrianInglis commented Sep 16, 2024

Could translate back from response content-type: header media-type/mime-type, for example:

$ curl -I curl.se
...
HTTP/2 200
server: nginx/1.21.1
content-type: text/html
...

to file type suffix extension using shared-mime-info data in /usr/share/mime/packages/freedesktop.org.xml which gives a list of glob patterns for each mime-type, for example:

$ awk '/<mime-type\stype="text\/html">/,/<\/mime-type>/' /usr/share/mime/packages/freedesktop.org.xml
  <mime-type type="text/html">
    <comment>HTML document</comment>
    <comment xml:lang="zh_TW">HTML 文件</comment>
    <comment xml:lang="zh_CN">HTML 文档</comment>
...
    <comment xml:lang="en_GB">HTML document</comment>
...
    <acronym>HTML</acronym>
    <expanded-acronym>HyperText Markup Language</expanded-acronym>
    <sub-class-of type="text/plain"/>
    <magic>
      <match type="string" value="&lt;!DOCTYPE HTML" offset="0:256"/>
...
    </magic>
    <magic priority="40">
      <match type="string" value="&lt;!--" offset="0"/>
      <match type="string" value="&lt;TITLE" offset="0:256"/>
      <match type="string" value="&lt;title" offset="0:256"/>
    </magic>
    <glob pattern="*.html" weight="80"/>
    <glob pattern="*.htm" weight="80"/>
  </mime-type>

The code could be something equivalent to this awk command:

$ awk '/<mime-type\s+type="[^"]+"[^>]*>/,/<\/mime-type>/ {
  if (!found) found = match( $0, "<mime-type type=\"" mime_type "\"");
  if (found && /<glob\s+pattern="/) {
    sub( /^\s*<glob\s+pattern="\*/, "");
    sub( /".*$/, "");
    print;
    exit; # exit on first match
  }
}' mime_type="text/html" /usr/share/mime/packages/freedesktop.org.xml
.html

Added similar comment to @wCurl curl/wcurl#4

@bagder
Copy link
Member Author

bagder commented Sep 16, 2024

-O with no other flags is meant to decide a name based on what is known at the time (the URL), without letting the server have a say. Ie without relying on response headers.

-J / --remote-header-name exists for selecting a name based on headers.

better the host name

I avoided that because of all the potential issues with IDN etc

@BrianInglis
Copy link
Contributor

Using the host name the user entered/provided should be no issue, whereas the IDN 2003/2008 encoded name is NOT User Friendly(TM)!
My interactive curl alias has -LRsSZ --no-progress-meter and -JOR is one of my favourite additions!
Using -J we get stdout, using -O we get curl_response, but combining them with -JO we just get an error:

$ curl -LJORsSvZ --no-progress-meter curl.se
* Host curl.se:80 was resolved.
* IPv6: 2a04:4e42:600::347, 2a04:4e42:c00::347, 2a04:4e42:400::347, 2a04:4e42:e00::347, 2a04:4e42:a00::347, 2a04:4e42:800::347, 2a04:4e42::347, 2a04:4e42:200::347
* IPv4: 151.101.193.91, 151.101.1.91, 151.101.65.91, 151.101.129.91
*   Trying [2a04:4e42:600::347]:80...
* Connected to curl.se () port 80
> GET / HTTP/1.1
> Host: curl.se
> User-Agent: curl/8.10.0
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 301 Moved Permanently
< Connection: close
< Content-Length: 0
< Server: Varnish
< Retry-After: 0
< Location: https://curl.se/
< Accept-Ranges: bytes
< Date: Mon, 16 Sep 2024 21:42:39 GMT
< Via: 1.1 varnish
< X-Served-By: cache-bfi-krnt7300065-BFI
< X-Cache: HIT
< X-Cache-Hits: 0
< X-Timer: S1726522960.714981,VS0,VE0
< alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
<
* shutting down connection #0
* Clear auth, redirects to port from 80 to 443
* Issue another request to this URL: 'https://curl.se/'
* Host curl.se:443 was resolved.
* IPv6: 2a04:4e42:600::347, 2a04:4e42:c00::347, 2a04:4e42:400::347, 2a04:4e42:e00::347, 2a04:4e42:a00::347, 2a04:4e42:800::347, 2a04:4e42::347, 2a04:4e42:200::347
* IPv4: 151.101.193.91, 151.101.1.91, 151.101.65.91, 151.101.129.91
*   Trying [2a04:4e42:600::347]:443...
* Connected to curl.se () port 443
* ALPN: curl offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
*  CAfile: /etc/pki/tls/certs/ca-bundle.crt
*  CApath: none
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2558 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [36 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [36 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / X25519 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=curl.se
*  start date: Aug 15 01:42:54 2024 GMT
*  expire date: Nov 13 01:42:53 2024 GMT
*  subjectAltName: host "curl.se" matched cert's "curl.se"
*  issuer: C=US; O=Let's Encrypt; CN=R11
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (4096/152 Bits/secBits), signed using sha256WithRSAEncryption
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [201 bytes data]
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://curl.se/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: curl.se]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.10.0]
* [HTTP/2] [1] [accept: */*]
} [5 bytes data]
> GET / HTTP/2
> Host: curl.se
> User-Agent: curl/8.10.0
> Accept: */*
>
* Request completely sent off
{ [5 bytes data]
< HTTP/2 200
< server: nginx/1.21.1
< content-type: text/html
< x-frame-options: SAMEORIGIN
< last-modified: Mon, 16 Sep 2024 10:05:02 GMT
< etag: "2c56-62239b4cf7e5e"
< cache-control: max-age=60
< expires: Mon, 16 Sep 2024 10:06:09 GMT
< x-content-type-options: nosniff
< content-security-policy: default-src 'self' curl.haxx.se www.curl.se curl.se; style-src 'unsafe-inline' 'self' curl.haxx.se www.curl.se curl.se; require-trusted-types-for 'script';
< strict-transport-security: max-age=31536000
< via: 1.1 varnish, 1.1 varnish
< accept-ranges: bytes
< date: Mon, 16 Sep 2024 21:42:40 GMT
< age: 53
< x-served-by: cache-bma1620-BMA, cache-bfi-krnt7300022-BFI
< x-cache: HIT, HIT
< x-cache-hits: 14, 1
< x-timer: S1726522960.003332,VS0,VE1
< vary: Accept-Encoding
< alt-svc: h3=":443";ma=86400,h3-29=":443";ma=86400,h3-27=":443";ma=86400
< content-length: 11350
<
} [5 bytes data]
* client returned ERROR on write of 1369 bytes
{ [5 bytes data]
* Connection #1 to host curl.se left intact
curl: (23) client returned ERROR on write of 1369 bytes

@BrianInglis
Copy link
Contributor

Amusingly, when I try your curl(1) example, I get an IDN host name:

$ \curl ‐JO https://example.com/file  # \curl overrides alias options
curl: (6) Could not resolve host: xn--jo-u1t
<!doctype html>
<html>
<head>
    <title>Example Domain</title>
...

@bagder
Copy link
Member Author

bagder commented Sep 16, 2024

Using the host name the user entered/provided should be no issue, whereas the IDN 2003/2008 encoded name is NOT User Friendly(TM)!

What if the user provides an IDN name? In Swedish doing curl http://räksmörgås.se is easy and even user-friendly.

Using -J we get stdout

Then something is funny.

combining them with -JO we just get an error

Then something is wrong.

@BrianInglis
Copy link
Contributor

Sorry, by IDN I thought you meant the encoded ASCII name xn--... rather than the non-ASCII name, which in Canada may also be registered with any (possibly limited to valid French, but I don't see why that should be so) alternative variations in accents you like e.g. SystémàtïcSW.ab.ca, but resolve to the unaccented ASCII letters SystematicSW.... ;^>
I definitely keep getting errors from command lines like:
$ /bin/curl -LJOv curl.se
but they do seem to be going via a varnish cache:
< via: 1.1 varnish, 1.1 varnish

@BrianInglis
Copy link
Contributor

Thanks Daniel,
I was not sure if your "Something is funny/wrong" meant with my build or curl itself.
I am just glad it was reproducible and fixable, at some effort, which I appreciate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmdline tool feature-window A merge of this requires an open feature window tests
Development

Successfully merging this pull request may close these issues.

5 participants