• Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

    Python 70 49 Updated Dec 2, 2016
  • nutch

    Forked from Aloisius/nutch

    CommonCrawl Test version of Nutch

    Java 4 704 Updated Dec 2, 2016
  • Java 51 Updated Nov 24, 2016
  • Python Updated Nov 7, 2016
  • Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

    Python 1 4 Updated Nov 3, 2016
  • News crawling with SC - stores output as WARC

    Java 2 1 Updated Oct 30, 2016
  • Useful scripts for attacking the CommonCrawl dataset and WARC/WET/WAT files

    Python 3 Updated Oct 5, 2016
  • Java 19 Updated Sep 14, 2016
  • Deployment of pywb as a CommonCrawl Index Server

    HTML 4 Updated Aug 5, 2016
  • A library of examples showing how to use the Common Crawl corpus.

    Java 61 44 Updated Aug 5, 2016
  • Teneo

    Forked from Smerity/Teneo
    Java 9 Updated Apr 12, 2016
  • CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 12 31 Updated Jan 12, 2016
  • Java 34 12 Updated Dec 23, 2015
  • gzipstream allows Python to process multi-part gzip files from a streaming source

    Python 15 14 Updated Oct 8, 2015
  • python-hadoop

    Python 1 95 Updated Jul 27, 2015
  • An AWS SDK-backed FileSystem driver for Hadoop

    Java 29 Updated Jul 7, 2014
  • Official mirror of the AWS SDK for Java. For more information on the AWS SDK for Java, see our web site:

    Java 1,199 Updated Mar 11, 2014
  • The CommonCrawl Crawler Engine and Related MapReduce code

    Java 178 54 Updated Jul 14, 2013
  • CommonCrawl Project Repository

    C 450 94 Updated Feb 14, 2013
  • Python 7 Updated Jan 15, 2013
  • Java 2 3 Updated Jan 15, 2013
  • Java 2 Updated Jan 15, 2013
  • Java 2 Updated Jan 15, 2013
  • Java 1 1 Updated Jan 15, 2013
  • 1 1 Updated Jan 15, 2013
  • Java 2 Updated Jan 15, 2013
  • Java 3 Updated Jan 13, 2013
  • These map reduce functions use Common Crawl data to look at the spread of congressional legislation on the internet

    3 Updated Sep 18, 2012
  • JavaScript 1 7 Updated Sep 18, 2012
  • Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts

    Java 4 11 Updated Sep 5, 2012