Skip to content

fone4u/common-crawl-quick-hacks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

some quick hacks using the common crawl dataset

links in metadata is an example of using hadoop streaming with a python script to extract links from the metadata set

finding names gives a quick overview of the textdata set and presents a simple NLTK app for extracting noun phrases (again python streaming)

url status codes shows how to run over the metadata set using java mapreduce to extract urls and the status codes the crawler received when crawling them

About

common crawl quick hack examples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published