Useful scripts for attacking the CommonCrawl dataset and WARC/WET/WAT files
CommonCrawl Test version of Nutch
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
News crawling with SC - stores output as WARC
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
Deployment of pywb as a CommonCrawl Index Server
A library of examples showing how to use the Common Crawl corpus.
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
gzipstream allows Python to process multi-part gzip files from a streaming source
An AWS SDK-backed FileSystem driver for Hadoop
Official mirror of the AWS SDK for Java. For more information on the AWS SDK for Java, see our web site:
The CommonCrawl Crawler Engine and Related MapReduce code
CommonCrawl Project Repository
These map reduce functions use Common Crawl data to look at the spread of congressional legislation on the internet
Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts