New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inappropriately converting / -> /index.html while mirroring sites with Slurping #248

Closed
GoogleCodeExporter opened this Issue Apr 6, 2015 · 7 comments

Comments

Projects
None yet
1 participant
@GoogleCodeExporter

GoogleCodeExporter commented Apr 6, 2015

From jmaessen:

I've been collecting a fresh slurp, since we're now doing more stuff than we 
were the last time I did so.  But in looking at the logs, I've realized we're 
running into an odd problem:

When we ask apache for a uri ending in /, like say http://www.ibm.com/ , Apache 
sees the url and says "hey, a directory, I'd better append index.html".  So we 
end up asking the web for http://www.ibm.com/index.html , which is great except 
that this page says "302, try http://www.ibm.com/ instead".  So we end up not 
fetching quite a lot of content, because apache corrupts the uri as it proxies 
it through the slurper.  I presume (but don't know for sure) that mod_proxy 
doesn't have the same flaw.  Not 100% sure of the mechanics inside Apache that 
cause this to happen; does anyone know more?

Original issue reported on code.google.com by sligocki@google.com on 21 Mar 2011 at 8:18

@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

Confirmed for trunk build:

$ curl -x localhost:8080 -v http://www.ibm.com/
* About to connect() to proxy localhost port 8080 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 8080 (#0)
> GET http://www.ibm.com/ HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k 
zlib/1.2.3.3 libidn/1.15
> Host: www.ibm.com
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 302 Found
< Date: Wed, 16 Mar 2011 20:31:36 GMT
< Server: Apache/2.2.16 (Unix) DAV/2
< Location: http://www.ibm.com/
< Content-Length: 203
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.ibm.com/">here</a>.</p>
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

But it is not broken in latest release:

$ curl -x localhost:80 -v http://www.ibm.com/
* About to connect() to proxy localhost port 80 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 80 (#0)
> GET http://www.ibm.com/ HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k 
zlib/1.2.3.3 libidn/1.15
> Host: www.ibm.com
> Accept: */*
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 302 Found
< Date: Wed, 16 Mar 2011 20:20:07 GMT
< Server: IBM_HTTP_Server
< Location: http://www.ibm.com/us/en/
< Cache-Control: no-cache, must-revalidate
< Pragma: no-cache
< Expires: Mon, 01 Jan 1990 00:00:20 GMT
< Vary: Accept-Encoding
< Content-Length: 209
< Content-Type: text/html
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.ibm.com/us/en/">here</a>.</p>
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

Original comment by sligocki@google.com on 21 Mar 2011 at 8:19

@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

This also affects playback of older slurps -- we simply 404 on them.

Original comment by morlov...@google.com on 21 Mar 2011 at 8:23

@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

I believe I know why this is.  I'm being troubled by this situation now.  It's 
due to this call sequence:
   apache_slurp.cc: SlurpUrl()
   InstawebContext::MakeRequestUrl()
   ap_construct_url()
This occurs, in my debugger, with request with these fields:
     the_request = 0x7c10d0 "GET http://www.vip-chicks.de/ HTTP/1.1", 
     hostname = 0x7601a0 "www.vip-chicks.de", 
     unparsed_uri = 0x7b9550 "/index.html", 
     uri = 0x7b9570 "/index.html", 
     parsed_uri.path = "/index.html"
     main != NULL
the 'main' points to a request where:
     unparsed_uri = 0x75f4f0 "http://www.vip-chicks.de/", 
     uri = 0x74e980 "/", 
     parsed_uri.path = 0x74e980 "/", 
So I think a good fix for that is, in MakeRequestUri, follow the main() pointer 
till its null before looking at 'uri' fields.

Original comment by jmara...@google.com on 23 Mar 2011 at 2:23

@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

Original comment by jmara...@google.com on 23 Mar 2011 at 3:16

@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

fix coming....

Original comment by jmara...@google.com on 23 Mar 2011 at 3:50

  • Changed state: Started
@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

Original comment by jmara...@google.com on 24 Mar 2011 at 12:51

  • Changed state: Fixed
@GoogleCodeExporter

This comment has been minimized.

GoogleCodeExporter commented Apr 6, 2015

Original comment by jmara...@google.com on 6 May 2011 at 4:39

  • Changed title: Inappropriately converting / -> /index.html while mirroring sites with Slurping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment