Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 33 lines (22 sloc) 1.575 kb
fa87ea7 @ahadrana Added introductory README
ahadrana authored
1 #CommonCrawl Support Library
2
3 ##Overview
4
5 The commoncrawl source code repository is used as a distribution vehicle for our custom
6 Hadoop InputFormat (ARCInputFormat located in org.commoncrawl.hadoop.io). Please refer
7 to the CommonCrawl website at http://www.commoncrawl.org/ for more details on how to
8 access our crawl corpus.
9
10 The sample class BasicArcFileReaderSample.java (located in org.commoncrawl.samples) for an
11 example of how to configure the InputFormat. A more detailed example of how to use it in
12 the context of a Hadoop Job will be forthcoming.
13
14 ##Build Notes:
15
16 1. You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
17 2. Set hadoop.path (in build.properties) to point to your Hadoop folder.
18 3. Make sure you have the thrift compiler (version 0.7.0) installed on your system.
19 4. If you want to use the Google URL Canoncilization library in Hadoop job,
20 copy the shared libraries under lib/native/{Platform} to /usr/local/lib or equivalent.
21
22 #Sample Usage:
23
24 Once commoncrawl.jar has been built, you can execute a job/sample via the bin/launcher.sh script.
25 The sample class BasicArcFileReaderSample.java (located in org.commoncrawl.samples) demonstrates
26 how you can go about configuring our InputFormat. To run the BasicArcFileReaderSample against
27 an ARC file in the corpus (2010/01/07/18/1262876244253_18.arc.gz for example), you would run
28 the following command line:
29
8f8bd9f @ahadrana Fix HTML escaping issue in README.md
ahadrana authored
30 bin/launcher.sh org.commoncrawl.samples.BasicArcFileReaderSample {AWS ACCESS KEY} {AWS SECRET KEY} commoncrawl-crawl-002 2010/01/07/18/1262876244253_18.arc.gz
fa87ea7 @ahadrana Added introductory README
ahadrana authored
31
32
Something went wrong with that request. Please try again.