Download pages from robots.txt in SubdomainDataExtractor #10

eranzim · 2018-01-22T14:56:35Z

Since robots.txt often includes interesting pages, we should make sure the extractor script gets them recursively (they could just be downloaded with another wget -r, and that would assure they'd be there if they exist). At this point, it might be worth moving the whole "get entire website" logic to a separate script file, or at least another function.

eranzim · 2018-01-22T15:06:42Z

Note that some robots.txt files are a bit more complex than the common ones. Examples:
https://www.facebook.com/robots.txt (contains comments, multiple user-agents, allow and disallow)
https://www.google.com/robots.txt (also has a mix of allow and disallow, has sitemap references, has wildcards and special characters (*,?,=,$))

eranzim · 2021-06-21T11:16:23Z

Sometimes, downloading robots.txt gives the same content as index.html (or potentially other things as well), so try validating the file before starting to work with it

eranzim added enhancement help wanted up-for-grabs labels Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download pages from robots.txt in SubdomainDataExtractor #10

Download pages from robots.txt in SubdomainDataExtractor #10

eranzim commented Jan 22, 2018 •

edited

Loading

eranzim commented Jan 22, 2018

eranzim commented Jun 21, 2021

Download pages from robots.txt in SubdomainDataExtractor #10

Download pages from robots.txt in SubdomainDataExtractor #10

Comments

eranzim commented Jan 22, 2018 • edited Loading

eranzim commented Jan 22, 2018

eranzim commented Jun 21, 2021

eranzim commented Jan 22, 2018 •

edited

Loading