Skip to content
John Wieczorek edited this page Aug 14, 2014 · 1 revision

Much on this page is out of date. Use the Harvest Workflow document instead to understand how harvesting is accomplished.

Overview

Gulo can be run locally with small data sets using the REPL or on a Hadoop cluster with big data. This wiki page describes how to run it manually on Amazon Elastic MapReduce. Down the road we'll ride on pallet for automated provisioning and deployment.

The workflow is:

  1. Harvest Darwin Core Archives locally into a single CSV file and upload to S3
  2. Compile Gulo into a standalone JAR and upload to S3
  3. Create and run a MapReduce job using Amazon AWS console page
  4. Download MapReduce outputs locally and upload to CartoDB

S3 Setup

The following buckets and folders are required on S3:

guloharvest
gulohfs/
  occ
  tax
  loc
  taxloc
gulojar
gulologs
gulotables

Harvest

First let's harvest all Darwin Core Archives listed in the publishers table on CartoDB into a single CSV file and then upload it to Amazon S3.

Fire up your Clojure REPL:

$ lein repl

Then use these commands to harvest:

user=> (use `gulo.main)
user=> (use `gulo.harvest)
user=> (Harvest (publishers) "/mnt/hgfs/Data/vertnet/gulo/harvest")

When that's done, you'll have all the records from all the Darwin Core Archives in a single CSV file at /mnt/hgfs/Data/vertnet/gulo/harvest/dwc.csv. Let's upload that to the guloharvest bucket on S3 using s3cmd:

$ s3cmd put /mnt/hgfs/Data/vertnet/gulo/harvest/dwc.csv s3://guloharvest/dwc.csv

Compile

Next we need to compile Gulo and upload the resulting JAR to S3. Make sure you have lein installed and then:

$ lein do clean, deps, uberjar

That compiles Gulo into a standalone JAR in the target/ directory. Let's upload it to S3 using s3cmd:

$ s3cmd put target/gulo-0.1.0-SNAPSHOT-standalone.jar s3://gulojar/gulo-0.1.0-SNAPSHOT-standalone.jar

MapReduce

Now we have the data and JAR uploaded to S3, so let's create a MapReduce job. Go to the Elastic MapReduce console and click the Create New Job Flow button.

The first step is Define Job Flow where in theCreate a Job Flow menu you select select Custom JAR and then click Continue.

The second step is Specify Parameters where you set JAR Location to gulojar/gulo-0.1.0-SNAPSHOT-standalone.jar and JAR Arguments to gulo.main.Shred s3n://guloharvest s3n://gulohfs s3n://gulotables and then click Continue.

The third step is Configure EC2 Instances where you can keep the defaults and then click Continue.

The fourth step is Advanced Options where you can keep the defaults except for Amazon S3 Log Path (Optional) which you set to s3n://gulologs/shred and the click Continue.

The fifth step is Bootstrap Options where you can keep the defaults and then click Continue.

The last step is Review where you click the Create Job Flow button which fires off the MapReduce job.

You can monitor the the status of the cluster using the Elastic MapReduce console. To monitor the status of the MapReduce job, click the cluster and in the Description tab copy the Master Public DNS Name into a browser window and append port :9100.

Download

When the cluster finishes, you can download results which are located in the occ, tax, loc, taxloc folders in the gulohfs bucket. Again, just use s3cmd.

CartoDB

Finally, fire up your REPL again to prepare CartoDB tables for upload and wire them up after they get uploaded:

$ lein repl

In the REPL use these commands to prepare the table:

user=> (use `gulo.main)
user=> (PrepareTables)

That will zip up the tables into the /mnt/hgfs/Data/vertnet/gulo/tables directory. You can upload each ZIP file directly to CartoDB using the dashboard. When they are all uploaded, make the tables public and then back in the REPL let's wire them up (build indexes, etc):

user=> (use `gulo.main)
user=> (WireTables)

BOOM. We're done!