Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This is a library that analyzes the CommonCrawl dataset for structured data expressed as RDFa, Microdata or Microformats. To build -------- You'll need to have Apache Ant (http://ant.apache.org/manual/install.html) installed, and once you do, just run a: # ant dist This step will compile the libraries and Hadoop code into an Elastic MapReduce- friendly JAR at dist/lib/StructuredDataAnalyzer.jar, suitable for use as a custom JAR-based Elastic MapReduce workflow. To run locally -------------- You'll need to be running Hadoop, and if you don't have it installed, Cloudera provides a useful set of OS-specific Hadoop packages which will make it easy. Check out their site: https://ccp.cloudera.com/display/SUPPORT/Downloads Once you've got Hadoop installed, you can use the 'hadoop jar' task to execute the tutorial code. Here's the pattern: hadoop jar <checkout location>/dist/lib/StructuredDataAnalyzer.jar com.digitalbazaar.analyzer.StructuredDataAnalyzer <Amazon AWS access key ID> <Amazon AWS secret access key> <CommonCrawl crawl files to use as input> <HDFS output location>