handling URL with http:/// (3 slashes between protocol and domain) #791

Closed
cjbern opened this Issue May 5, 2016 · 13 comments

Projects

None yet

5 participants

@cjbern
cjbern commented May 5, 2016 edited

I did this

The url is communicating with a live webserver that is returning a malformed location field in the 301 HTTP response.

curl -I -L "http://bozardford.net"

I expected the following

Successful redirect to http://www.bozardford.net

despite the fact that there was an extra slash between the protocol and the domain name in the location field of the 301 response header "http:///www.bozardford.net"

Firefox 45 and Chrome 49 both handle the malformed location field by ignoring the extra slash and redirected according to what was meant.

What I got

HTTP/1.1 301 Moved Permanently
Server: Apache-Coyote/1.1
Location: http:///www.bozardford.net
Connection: close
Content-Length: 0
Date: Thu, 05 May 2016 02:10:29 GMT
X-DDC-Arch-Trace: ,HttpResponse

curl: (6) Could not resolve host: http

curl/libcurl version

curl 7.35.0 (x86_64-pc-linux-gnu) libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

Also happens in pycurl:

'PycURL/7.19.5.3 libcurl/7.35.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3'

Haven't had time to build a more recent libcurl version to test this on, but I haven't found any previous mention of this problem in the issues on github or in an internet search.

operating system

Ubuntu 14.04 LTS

% uname -a
Linux 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Like I said, it appears that current browsers handle location fields malformed in this way in a manner that makes implicit sense. It would be nice for curl to do this as well, instead of my having to build out specialized redirection handling in my code.

@jay
Member
jay commented May 5, 2016

I can confirm both Firefox and Chrome will skip seemingly any number of arbitrary slashes after the scheme. Is that some legacy issue? I don't think it's correct to do that.

@cjbern
cjbern commented May 5, 2016 edited

I agree that it isn't correct, and RFC 7230 makes a missing host in an absolute http URI explicitly invalid now (MUST NOT send, if received, MUST treat as invalid). Earlier RFCs never seemed to make that explicit or clear enough, apparently.

Worse, this comment implies that Firefox used to reject HTTP URLs with the wrong number of slashes after the scheme: http://superuser.com/questions/352133/why-do-file-urls-start-with-3-slashes#comment388378_352134

Even worse: this autocorrection appears intentional in Chrome and is wontfix:
https://bugs.chromium.org/p/chromium/issues/detail?id=385645
https://blog.it-securityguard.com/google-chrome-security-multiple-leading-slashes-in-urls-may-confuse-some-server-side-xss-filters/

I don't have a strong opinion on whether you should implement the same autocorrection in curl after researching the above. Like I said it would be nice and it would save me some time, but I can code a workaround.

@bagder bagder added the HTTP label May 5, 2016
@bagder
Member
bagder commented May 5, 2016

First, let's not add file:/// to the confusion. It has three slashes in the correct case. http:// is different. A http:// URI cannot work without a host name.

RFC 3986 dictates how URIs work and there's no room for three slashes for HTTP.

But sure, Chrome and Firefox most probably do this for "web compatibility" as we like to call it. Meaning that enough other clients break the specs to make you want to do it as well as otherwise users get upset and think your product is flawed. Similar to handling space in Location: headers, which we already do for exactly that reason.

So, I would not be against a patch that makes curl act like the popular browsers in this regard. After all, people use curl to mimic browsers to a large extent and not acting like browsers in this aspect makes curl not deliver that promise for these users.

@bagder
Member
bagder commented May 5, 2016

Firefox does in fact accept one or more slashes for HTTP and HTTPS redirects. I just tested redirects with one slash and I tested with 10 slashes using Firefox. They all redirect fine.

@tomuta
tomuta commented May 6, 2016

On that note, it would be nice to have a CURLOPT_REDIRECTFUNCTION option so that an application could easily implements its own behavior by re-writing the redirection URL or rejecting it, causing the transfer to fail. If we had this, one could implement this and support such malformed URLs.

@jay
Member
jay commented May 7, 2016 edited

@tomuta You can use CURLINFO_REDIRECT_URL to manually redirect.

#define MAXREDIRS  50
  int redir_count;
  for(redir_count = 0; redir_count < MAXREDIRS; ++redir_count) {
    char *url = NULL;
    res = curl_easy_perform(curl);
    if(res || curl_easy_getinfo(curl, CURLINFO_REDIRECT_URL, &url) || !url)
      break;
    /* redirect needed. this is where you could make a copy of the url and modify that */
    curl_easy_setopt(curl, CURLOPT_URL, url);
  }
  if(redir_count == MAXREDIRS) {
    fprintf(stderr, "\nError: Maximum (%d) redirects followed\n", MAXREDIRS);
  }
@bagder
Member
bagder commented May 8, 2016 edited

To allow one, two or three slashes. Something like this could be applied:

diff --git a/lib/url.c b/lib/url.c
index 70ccd0f..f07dd39 100644
--- a/lib/url.c
+++ b/lib/url.c
@@ -4133,16 +4133,22 @@ static CURLcode parseurlandfillconn(struct SessionHandle *data,

     protop = "file"; /* protocol string */
   }
   else {
     /* clear path */
+    char slashbuf[4];
     path[0]=0;

-    if(2 > sscanf(data->change.url,
-                   "%15[^\n:]://%[^\n/?]%[^\n]",
-                   protobuf,
-                   conn->host.name, path)) {
+    rc = sscanf(data->change.url,
+                "%15[^\n:]:%3[/]%[^\n/?]%[^\n]",
+                protobuf, slashbuf, conn->host.name, path);
+    if(2 == rc) {
+      failf(data, "Bad URL");
+      return CURLE_URL_MALFORMAT;
+    }
+    if(3 > rc) {

       /*
        * The URL was badly formatted, let's try the browser-style _without_
        * protocol specified like 'http://'.
        */
@bagder
Member
bagder commented May 9, 2016

Apparently browsers support any amount of slashes. They do that because their spec says so. And the spec says so because they do that.

whatwg/url#118

@bagder bagder changed the title from curl handling of HTTP 301 redirection fails when response location header starts with http:///<domain> (3 slashes between protocol and domain)) to handling URL with http:/// (3 slashes between protocol and domain) May 16, 2016
@bagder bagder self-assigned this May 17, 2016
@bagder
Member
bagder commented May 17, 2016

The rant

curl has actually never been very strict or particular with its URL parsing. I mean, it even accepts URLs on the command line with the "scheme://" part completely left out, which never has been considered a URL by anyone. It also only parses the "bare minimum" for it to be able to do what it needs to, which means that it accepts other sorts of illegal URLs as well if you just want to.

I've given this a lot of thoughts and I've discussed the WHATWG-URL "standard" widely and intensely the last couple of days.

I think it would be a tactical mistake to give up completely and say we accept the WHATWG-URL as a standard. They run and write their "standard" as they see fit only for browsers without proper concern for the entire web and ecosystem. That said, hopefully they will come around at some point and we can work on converting their doc into a "real" standard. That would be of a huge benefit for the web.

This said, I think we need to be realists and adapt to the world around us and when the WHATWG clearly says these URLs are fine and a huge portion of browsers accept them, it forces us to act. Sure we can say they're not RFC3986 compliant and refuse to work with them. But who'd be happy with that in the long run? I don't think we in the curl project have enough power to make such a stance have any effect on the servers and providers that send back broken URLs in headers. They will just curse curl and continue to successfully use browsers against said servers.

The intent

I intend to merge a patch similar to what I described above, after the curl 7.49.0 release to give us time to test it out and feel it. It will accept one, two or three slashes only and it will complain in the verbose output for anything that isn't two slashes and it will rewrite the URL internally to the correct look so that extracting the URL or passing it onward to proxies etc will still use the correct format.

@jay
Member
jay commented May 17, 2016

How is /// any more likely than //// ? They both seem really really unlikely. Is there some common server configuration error which causes the former?

@bagder
Member
bagder commented May 17, 2016

Very anecdotal "evidence" only so more of a hunch or a guess.

We've seen the former (in this bug report) and not the latter. When I've complained to whatwg people some of them have hinted that URLs "like this" (unclear how many slashes that imply) are being found to at least a measurable extent and finally I'm just guessing that fewer slashes are more likely than more. Like /// as a typo is more common than //// just because the first is a single letter mistake and the second means twice as many mistakes.

If we at a later point reconsider and have a reason to start accepting more slashes, then there's nothing preventing us from revisiting this topic.

@bch
Contributor
bch commented May 17, 2016

On 5/17/16, Daniel Stenberg notifications@github.com wrote:

Very anecdotal "evidence" only so more of a hunch or a guess.

We've seen the former (in this bug report) and not the latter. When I've
complained to whatwg people some of them have hinted that URLs "like this"
(unclear how many slashes that imply) are being found to at least a
measurable extent and finally I'm just guessing that fewer slashes are
more likely than more. Like /// as a typo is more common than //// just
because the first is a single letter mistake and the second means twice as
many mistakes.

Could see "///*" for misconfigured CMSs or otherwise auto-generated
code. I think somebody earlier wondered if Mozilla had telemetry on
this. Is there data, or just educated guesses and anecdotes?

If we at a later point reconsider and have a reason to start accepting more
slashes, then there's nothing preventing us from revisiting this topic.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#791 (comment)

@bagder
Member
bagder commented May 19, 2016

Is there data, or just educated guesses and anecdotes?

There has been no data provided in this discussion, just random people making up random statements. Me included. I've mentioned that it would be possible to add a counter in Firefox or similar, but (A) I'm not sure it would be accepted by the maintainers of said code, (B) I don't feel like writing that code and (C) I fear whatever number would get out of that won't make a difference in the end.

@bagder bagder added a commit that closed this issue May 30, 2016
@bagder bagder URL parser: allow URLs to use one, two or three slashes
Mostly in order to support broken web sites that redirect to broken URLs
that are accepted by browsers.

Browsers are typically even more leniant than this as the WHATWG URL
spec they should allow an _infinite_ amount. I tested 8000 slashes with
Firefox and it just worked.

Added test case 1141, 1142 and 1143 to verify the new parser.

Closes #791
5409e1d
@bagder bagder closed this in 5409e1d May 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment