Analyzes HTML content for RDFa, Microdata and Microformats
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
build.xml Initial checkin. Feb 3, 2012


This is a library that analyzes the CommonCrawl dataset for structured
data expressed as RDFa, Microdata or Microformats.

To build

You'll need to have Apache Ant (
installed, and once you do, just run a:

# ant dist

This step will compile the libraries and Hadoop code into an Elastic MapReduce-
friendly JAR at dist/lib/StructuredDataAnalyzer.jar, suitable for use as a 
custom JAR-based Elastic MapReduce workflow.

To run locally

You'll need to be running Hadoop, and if you don't have it installed, Cloudera
provides a useful set of OS-specific Hadoop packages which will make it easy.
Check out their site:

Once you've got Hadoop installed, you can use the 'hadoop jar' task to execute
the tutorial code.  Here's the pattern:

hadoop jar <checkout location>/dist/lib/StructuredDataAnalyzer.jar 
   <Amazon AWS access key ID> 
   <Amazon AWS secret access key> 
   <CommonCrawl crawl files to use as input> 
   <HDFS output location>