Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some links cause an endless loop #115

Closed
mfairchild365 opened this issue Jan 10, 2017 · 0 comments · Fixed by #116
Closed

Some links cause an endless loop #115

mfairchild365 opened this issue Jan 10, 2017 · 0 comments · Fixed by #116

Comments

@mfairchild365
Copy link
Contributor

Take this scenario

A website has a badly coded relative URL link, we keeps appending the link's path to the end of the url. It would look something like this: http://www.example.com/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/contact-us/co

SiteMaster has logic to prevent such a URL from stalling the system. It does this by taking the md5 hash of the URL looking for another page in the scan that has the same hash.

This does not appear to be working correctly in some circumstances. I believe this is because of URI truncation in the database.

The script looks first for the full url, then gets inserted as the truncated url. because the truncated url has a different md5 hash than the full URL, it will result in duplicate pages being scanned.

mfairchild365 added a commit to mfairchild365/site_master that referenced this issue Jan 10, 2017
fixes UNLSiteMaster#115

URLs that were longer than the mysql uri field length were being truncated, resulting in the same URL being scanned over and over again while ignoring the distinct page limit. This fixes that issue by truncating the URL before it even gets to mysql.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant