Switch branches/tags
Nothing to show
Find file History
Latest commit f6e8bc5 Oct 13, 2017
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md robots.txt analysis code Oct 19, 2017
analyze_badbots.py robots.txt analysis code Oct 19, 2017
analyze_googlebot.py robots.txt analysis code Oct 19, 2017
analyze_jobads.py robots.txt analysis code Oct 19, 2017
crawler.py robots.txt analysis code Oct 19, 2017
requirements.txt robots.txt analysis code Oct 19, 2017
robots_database.py robots.txt analysis code Oct 19, 2017

README.md

Robots.txt Analysis

This folder has extra code for my analysis of the worlds leading robots.txt files

Crawling the robots.txt files

I'm using the grab python package to download all the results. I highly recommend enabling aync DNS resolving otherwise downloading the files will be bottlenecked on the default single threaded DNS lookup.

You will also need the list of he top 1 million sites from alexa (available here).

All the files are dumped out to a local sqlite database for later analysis.

To run the crawler:

python crawler.py --crawl top-1m.csv --threads 50

This will probably take a couple of hours to finish.

Running the analysis

There are a series of scripts like analyze_googlebot.py. Running them will spit out the results as shown in the post