Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download pages from robots.txt in SubdomainDataExtractor #10

Open
eranzim opened this issue Jan 22, 2018 · 2 comments
Open

Download pages from robots.txt in SubdomainDataExtractor #10

eranzim opened this issue Jan 22, 2018 · 2 comments

Comments

@eranzim
Copy link
Owner

eranzim commented Jan 22, 2018

Since robots.txt often includes interesting pages, we should make sure the extractor script gets them recursively (they could just be downloaded with another wget -r, and that would assure they'd be there if they exist). At this point, it might be worth moving the whole "get entire website" logic to a separate script file, or at least another function.

@eranzim
Copy link
Owner Author

eranzim commented Jan 22, 2018

Note that some robots.txt files are a bit more complex than the common ones. Examples:
https://www.facebook.com/robots.txt (contains comments, multiple user-agents, allow and disallow)
https://www.google.com/robots.txt (also has a mix of allow and disallow, has sitemap references, has wildcards and special characters (*,?,=,$))

@eranzim
Copy link
Owner Author

eranzim commented Jun 21, 2021

Sometimes, downloading robots.txt gives the same content as index.html (or potentially other things as well), so try validating the file before starting to work with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant