Switch branches/tags
Nothing to show
Find file History
Latest commit f6e8bc5 Oct 13, 2017

README.md

Robots.txt Analysis

This folder has extra code for my analysis of the worlds leading robots.txt files

Crawling the robots.txt files

I'm using the grab python package to download all the results. I highly recommend enabling aync DNS resolving otherwise downloading the files will be bottlenecked on the default single threaded DNS lookup.

You will also need the list of he top 1 million sites from alexa (available here).

All the files are dumped out to a local sqlite database for later analysis.

To run the crawler:

python crawler.py --crawl top-1m.csv --threads 50

This will probably take a couple of hours to finish.

Running the analysis

There are a series of scripts like analyze_googlebot.py. Running them will spit out the results as shown in the post