Redirect to relative path not working without appending '/' #26

elakamarcus · 2017-11-24T09:56:15Z

Snippet from lines 652-655. When i use the script to some sites that will send a "Location: /folder/page.html" the script would try http:/folder/page.html. After adding '/' to the URL, if missing, this is resolved.

# Must have protocol
url = "http://#{url}" if url !~ /^http(s)?:\/\//
url = url+"/" if url !~ /\/$/

The culprit, line 262:

base_url = uri.to_s[0, uri.to_s.rindex('/')]

p.s. my apologies if this is fixed in some version i have missed or break some other dependency. I've only tested with a few sites and parameters.

digininja · 2017-11-24T10:59:40Z

It doesn't look like you are using the current version from git, give that a try. It may have the same bug as the code looks the same but see what happens.

…

On Fri, 24 Nov 2017 at 09:56 elakamarcus ***@***.***> wrote: Snippet from lines 652-655. When i use the script to some sites that will send a "Location: /folder/page.html" the script would try http:/folder/page.html. After adding '/' to the URL, if missing, this is resolved. # Must have protocol url = "http://#{url}" if url !~ /^http(s)?:\/\// url = url+"/" if url !~ /\/$/ The culprit, line 262: base_url = uri.to_s[0, uri.to_s.rindex('/')] p.s. my apologies if this is fixed in some version i have missed or break some other dependency. I've only tested with a few sites and parameters. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHJWS9WzadHRFzk64eS1gPA1TZJ6tqBks5s5pK_gaJpZM4QpmPP> .

elakamarcus · 2017-11-24T11:20:23Z

Thanks, I’ll try that. It’s likely the case that I’m not on the right version.

digininja · 2017-11-24T11:52:46Z

The line numbers don't match up so I don't think you are.

…

On Fri, 24 Nov 2017 at 11:20 elakamarcus ***@***.***> wrote: Thanks, I’ll try that. It’s likely the case that I’m not on the right version. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHJWc28CjgALZh3IRENLMmSB4VZAxqwks5s5qZ3gaJpZM4QpmPP> .

elakamarcus · 2017-11-24T15:08:51Z

I just did a git clone, and got the same issue. After i added the code as shown above (snipped below also), it worked.

url = url+"/" if url !~ /\/$/

Sorry I maybe wasn't clear. In my initial post, the snippet is code I've already added to my local version of your script.

digininja · 2017-11-24T16:12:08Z

Your snippet is adding a / to the end of the URL if there isn't one already there, wouldn't that break the majority of the redirects?

/test.php

would become

/test.php/

which is wrong.

elakamarcus · 2017-11-24T22:48:28Z

At the moment inputting a target url like this: http://www.site.com
and get a redir location tag Location: /welcome/index.php
will result in cewl trying to go http:/welcome/index.php
which is wrong.

what i did solved my immediate hurdle. I did not work on a general case, my expectation is that is what these issues are for.

digininja · 2017-11-24T22:59:28Z

I've just managed to reproduce your issue. Your fix still doesn't seem to make sense from just reading it but I'll have a look at it in the morning and see if it makes more sense when I see it in context. Odd this bug hasn't been seen before as it seems like a fairly common redirect to do.

…

On Fri, 24 Nov 2017 at 22:48 elakamarcus ***@***.***> wrote: At the moment inputting a target url like this: http://www.site.com and get a redir location tag Location: /welcome/index.php will result in cewl trying to go http:/welcome/index.php which is wrong. what i did solved my immediate hurdle. I did not work on a general case, my expectation is that is what these issues are for. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHJWZ_kzAatKYMFcEAD3UhGnP2XhuxGks5s50e8gaJpZM4QpmPP> .

digininja · 2017-11-24T23:03:15Z

Just seen this in the code:

 702 # The spider doesn't work properly if there isn't a / on the end
 703 if url !~ /\/$/
 704     # Commented out for Yori
 705    #  url = "#{url}/"
 706 end

I can't remember who Yori is or why he wanted it commented out but looks like I knew it was a bug a while ago.

digininja · 2017-11-24T23:08:24Z

I've put the line back in in the master branch and updated the version number.

Now to wait for the comment reminding me why the line was commented out.

Thanks for the report.

elakamarcus · 2017-11-24T23:12:08Z

Yeah, that's the part i noticed and used. But as you say, it only solved this situation outlined in my comment above.

elakamarcus · 2017-11-24T23:12:26Z

Appreciate you looking into it. Thank you.

elakamarcus · 2017-12-04T00:56:45Z

I left that 'adding / to the end' commented out, and instead I did the below modification on my local copy. See if it causes less issues:

      if res.redirect?
        puts "Redirect URL" if @debug
        if uri.to_s[-1] == '/'
                base_url = uri.to_s[0, uri.to_s.rindex('/')]
        else
                base_url = uri.to_s
        end
        new_url = URI.parse(construct_complete_url(base_url, res['Location']))

        # If auth is used then a name:pass@ gets added, this messes the tree
        # up so easiest to just remove it
        current_uri = uri.to_s.gsub(/:\/\/[^:]*:[^@]*@/, "://")
        @next_urls.push current_uri => new_url.to_s
      elsif res.code == "401"
        puts "Authentication required, can't continue on this branch - #{uri}" if @verbose
      else
        block.call(res)
      end

digininja · 2017-12-04T07:47:00Z

If people complain this fix breaks things then I'll have some test cases to work with to find out why it breaks things.

…

On Mon, 4 Dec 2017, 00:56 elakamarcus, ***@***.***> wrote: I left that 'adding / to the end' commented out, and instead I did the below modification on my local copy. See if it causes less issues: if res.redirect? puts "Redirect URL" if @debug if uri.to_s[-1] == '/' base_url = uri.to_s[0, uri.to_s.rindex('/' )] else base_url = uri.to_s end new_url = URI.parse(construct_complete_url(base_url, res['Location'])) # If auth is used then a name:pass@ gets added, this messes the tree # up so easiest to just remove it current_uri = uri.to_s.gsub(/:\/\/[^:]*:[^@]*@/, "://") @next_urls.push current_uri => new_url.to_s elsif res.code == "401" puts "Authentication required, can't continue on this branch - #{uri}" if @verbose else block.call(res) end — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHJWfGnf8lh7wsKkYrcMKFOz2xLzuTVks5s80NNgaJpZM4QpmPP> .

spadou · 2018-09-22T22:25:49Z

Always adding a trailing / is a problem if the server doesn't handle the URL with the / the same way that without the /. Most of the time this isn't visible because the server will serve the same content for both URLs or do a redirect but when this isn't the case it will break.

For example it isn't possible to crawl a Wikipedia page because adding the trailing / will show a missing article page instead of the correct page:

https://en.wikipedia.org/wiki/Example    -> show the correct article page
https://en.wikipedia.org/wiki/Example/   -> show the missing article page

Forcing the trailing / will probably also breaks things if the URL contains query string parameters.

The trailing / should probably only be added to construct a full URL from a relative location to avoid breaking the others cases.

deargle · 2018-10-17T06:23:28Z

Just ran into the same wikipedia-crawling problem as @SamuelPadou . Trailing slash returns 404.

Xm0d3us · 2019-01-04T22:17:27Z

Same issue here. Trailing slash makes it impossible to crawl Wikipedia

digininja · 2019-01-04T23:00:21Z

I've just tried the following two commands and both work fine:

CeWL $ ./cewl.rb   https://en.wikipedia.org --verbose

CeWL $ ./cewl.rb   https://en.wikipedia.org/ --verbose

What problem are you having? Are you definitely crawling the right URL, not one that redirects to different domains before showing content?

deargle · 2019-01-04T23:06:12Z

Try these examples: #26 (comment)

digininja · 2019-01-04T23:19:55Z

OK, got rid of it in fc249dc

I wish I could change spiders, this one has so many little niggles that I've had to work around I've lost track of what fixes do what and why.

digininja closed this as completed Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redirect to relative path not working without appending '/' #26

Redirect to relative path not working without appending '/' #26

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017 via email

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017 via email

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017 via email

digininja commented Nov 24, 2017 •

edited

Loading

digininja commented Nov 24, 2017

elakamarcus commented Nov 24, 2017

elakamarcus commented Nov 24, 2017

elakamarcus commented Dec 4, 2017

digininja commented Dec 4, 2017 via email

spadou commented Sep 22, 2018

deargle commented Oct 17, 2018

Xm0d3us commented Jan 4, 2019

digininja commented Jan 4, 2019

deargle commented Jan 4, 2019

digininja commented Jan 4, 2019

Redirect to relative path not working without appending '/' #26

Redirect to relative path not working without appending '/' #26

Comments

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017 via email

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017 via email

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017

elakamarcus commented Nov 24, 2017

digininja commented Nov 24, 2017 via email

digininja commented Nov 24, 2017 • edited Loading

digininja commented Nov 24, 2017

elakamarcus commented Nov 24, 2017

elakamarcus commented Nov 24, 2017

elakamarcus commented Dec 4, 2017

digininja commented Dec 4, 2017 via email

spadou commented Sep 22, 2018

deargle commented Oct 17, 2018

Xm0d3us commented Jan 4, 2019

digininja commented Jan 4, 2019

deargle commented Jan 4, 2019

digininja commented Jan 4, 2019

digininja commented Nov 24, 2017 •

edited

Loading