common crawl quick hack examples
Pull request Compare This branch is 9 commits behind matpalm:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
finding_names
links_in_metadata
url_status_codes
.gitignore
README.md

README.md

some quick hacks using the common crawl dataset

links in metadata is an example of using hadoop streaming with a python script to extract links from the metadata set

finding names gives a quick overview of the textdata set and presents a simple NLTK app for extracting noun phrases (again python streaming)

url status codes shows how to run over the metadata set using java mapreduce to extract urls and the status codes the crawler received when crawling them