Skip to content


Subversion checkout URL

You can clone with
Download ZIP
common crawl quick hack examples
Branch: master
Pull request Compare This branch is 9 commits behind matpalm:master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.

some quick hacks using the common crawl dataset

links in metadata is an example of using hadoop streaming with a python script to extract links from the metadata set

finding names gives a quick overview of the textdata set and presents a simple NLTK app for extracting noun phrases (again python streaming)

url status codes shows how to run over the metadata set using java mapreduce to extract urls and the status codes the crawler received when crawling them

Something went wrong with that request. Please try again.