WARC Input and Output Formats for Hadoop

warc-hadoop is a Java library for working with WARC (Web Archive) files in Hadoop. It provides InputFormats for reading and OutputFormats for writing WARC files in MapReduce jobs (supporting both the 'old' org.apache.hadoop.mapred and the 'new' org.apache.hadoop.mapreduce API).

WARC files are used to record the activity of a web crawler. They include both the HTTP requests that were sent to servers, and the HTTP response received (including headers). WARC is an ISO standard, and is used (amongst others) by the Internet Archive and CommonCrawl.

This warc-hadoop library was written in order to explore the CommonCrawl data, a publicly available dump of billions of web pages. The data is made available for free as a public dataset on AWS. If you want to process it, you just need to pay for the computing capacity of processing it on AWS, or for the network bandwidth to download it.

Using warc-hadoop

Add the following Maven dependency to your project:

<dependency>
    <groupId>com.martinkl.warc</groupId>
    <artifactId>warc-hadoop</artifactId>
    <version>0.1.0</version>
</dependency>

Now you can import either com.martinkl.warc.mapred.WARCInputFormat or com.martinkl.warc.mapreduce.WARCInputFormat into your Hadoop job, depending on which version of the API you are using. Example usage:

JobConf job = new JobConf(conf, CommonCrawlTest.class);

FileInputFormat.addInputPath(job, new Path("/path/to/my/input"));
FileOutputFormat.setOutputPath(job, new Path("/path/for/my/output"));
FileOutputFormat.setCompressOutput(job, true);

job.setInputFormat(WARCInputFormat.class);
job.setOutputFormat(WARCOutputFormat.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(WARCWritable.class);

Example of a mapper that emits server responses, using the URL as the key:

public static class MyMapper extends MapReduceBase
        implements Mapper<LongWritable, WARCWritable, Text, WARCWritable> {

    public void map(LongWritable key, WARCWritable value, OutputCollector<Text, WARCWritable> collector,
                    Reporter reporter) throws IOException {
        String recordType = value.getRecord().getHeader().getRecordType();
        String targetURL  = value.getRecord().getHeader().getTargetURI();

        if (recordType.equals("response") && targetURL != null) {
            collector.collect(new Text(targetURL), value);
        }
    }
}

File format parsing

A WARC file consists of a flat sequence of records. Each record may be a HTTP request (recordType = "request"), a response (recordType = "response") or one of various other types, including metadata. When reading from a WARC file, the records are given to the mapper one at a time. That means that the request and the response will appear in two separate calls of the map method.

This library currently doesn't perform any parsing of the data inside records, such as the HTTP headers or the HTML body. You can simply read the server's response as an array of bytes. Additional parsing functionality may be added in future versions.

WARC files are typically gzip-compressed. Gzip files are not splittable by Hadoop (i.e. an entire file must be processed sequentially, it's not possible to start reading in the middle of a file) so projects like CommonCrawl typically aim for a maximum file size of 1GB (compressed). If you're only doing basic parsing, a file of that size takes less than a minute to process.

When writing WARC files, this library automatically splits output files into gzipped segments of approximately 1GB. You can customize the segment size using the configuration key warc.output.segment.size (the value is the target segment size in bytes).

Documentation

Javadocs

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradle/wrapper

gradle/wrapper

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

RELEASE.md

RELEASE.md

build.gradle

build.gradle

gradle.properties

gradle.properties

gradlew

gradlew

gradlew.bat

gradlew.bat

settings.gradle

settings.gradle

Repository files navigation

WARC Input and Output Formats for Hadoop

Using warc-hadoop

File format parsing

Documentation

Meta

About

Releases

Packages

Languages

License

ept/warc-hadoop

Folders and files

Latest commit

History

Repository files navigation

WARC Input and Output Formats for Hadoop

Using warc-hadoop

File format parsing

Documentation

Meta

About

Resources

License

Stars

Watchers

Forks

Languages