Harvest Workflow

Index Workflow: https://github.com/VertNet/dwc-indexer/wiki/Index-Workflow

Index History: https://vertnet.cartodb.com/tables/index_history/table

Task Queue Status: http://dwc-indexer.vertnet-portal.appspot.com/mapreduce/status

Quick Guide

Cost before:

https://console.aws.amazon.com/billing/home?#/

Before proceeding, make sure that the table 'resource' is up to date in Carto. Update data set metadata in VertNet Carto resource_staging table. When complete, use the SQL window to execute

DELETE from resource

, then

INSERT INTO resource
(
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi) 
SELECT 
cartodb_id, title, url, created_at, updated_at, the_geom, eml, dwca, pubdate, orgname, description, emlrights, contact, email, icode, ipt, count::integer, citation, networks, collectioncount, orgcountry, orgstateprovince, orgcity, source_url, migrator, license, lastindexed, gbifdatasetid, gbifpublisherid, doi 
from resource_staging 
where ipt=True and networks like '%Vert%'

Start EC2 Instance:

ec2-run-instances ami-5cebac35 -n 1 -k vertnetfmnh --instance-type m3.xlarge -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY

If you get a message such as

Missing argument for option:O (use -h for usage)

then the variables AWS_ACCESS_KEY and AWS_SECRET_KEY are not defined, create entries for these in .bashrc such as:

export AWS_ACCESS_KEY=[20 character Amazon Web Services Access Key]
export AWS_SECRET_KEY=[41 character secret key for Amazon Web Services]

and then run:

source ~/.bashrc

Log in to EC2 instance:

ssh -i ~/.ssh/vertnetfmnh.pem ubuntu@[].compute-1.amazonaws.com

where [] is the ip address of the instance, which can be found at:

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:

Look to see what is in /mnt/beast already. If there, where is it coming from? Clean it and then bootstrap the instance with the software needed for harvest and configure Google Cloud Storage access to harvest into:

WARNING: Be very careful that you are actually logged into the AWS server and not in a shell on your local machine before running the following commands. rm -r * is VERY dangerous.

cd /mnt/beast
{careful} rm -r *
cd /home/ubuntu
cp /home/ubuntu/gulo/dev/bootstrap.sh .
chmod a+x bootstrap.sh
./bootstrap.sh

gsutil config

Follow the instructions given, use the authentication code for the project_id "vertnet-portal".

Get the latest harvesting code:

cd /home/ubuntu/gulo
git checkout master
git pull origin

If there have been changes to the Darwin Core Reader that need to be incorporated, update the dwca-reader2-clj project (https://github.com/VertNet/dwca-reader-clj) to use the latest release of the Darwin Core Reader in https://github.com/VertNet/dwca-reader-clj/blob/master/project.clj,

                 [org.gbif/dwca-reader "1.20-SNAPSHOT"]

then set the release of dwca-reader2-clj in the file gulo/project.clj to point to the latest clojars at https://clojars.org/dwca-reader2-clj/dwca-reader-clj

                 [dwca-reader2-clj/dwca-reader-clj "0.20-SNAPSHOT"]

If there is space on the default 40GB drive (11.33kb per record, 11GB per million records, 26% of the default drive is taken by the system).

Create the list of resources to harvest:

vim /home/ubuntu/resource_list.txt

Add resources, one per line. For example:

http://ipt.vertnet.org:8080/ipt/resource.do?r=ccber_mammals.

Ready to harvest:

cd /home/ubuntu/gulo
lein repl

Once lein is running:

(use 'gulo.harvest)
(in-ns 'gulo.harvest)
(harvest-all "/mnt/beast" :path-file "/home/ubuntu/resource_list.txt")

If the processing of any resources fails to show the number of records found, such as in the example below, the url provided in the resource_list.txt file is not exactly the same as that in the resource or resource_staging table. This is usually because the Carto table entry does not contain the '.do' in the URL. Correct the discrepancy, using '.do' everywhere, clean up the extraneous Google cloud storage folders, and run any failed harvests again.

"Downloading records from http://ipt.vertnet.org:8080/ipt/resource.do?r=msb_herp"
" records found"
"Writing to /mnt/beast/2017-11-07/null-msb_herp/msb_herp.csv"

If you need to synchronize the resource table in CartoDB with the resource_staging table and the latest metadata from the IPT resources registered there, do the following before harvest-all:

(sync-resource-table)

Note: If any resource log shows " records found" without a count or any Google Cloud Storage folder for a dataset starts with "null", the most likely cause is a mismatch between the url given in the resource_list.txt file and the Carto url fields in resource. The IPT resolves the endpoints with "resource?r=" and "resource.do?r=" to the same location. If the urls do not match exactly, gulo can't get the information out of the resource table in Carto, and the resource harvest is compromised. Make sure the urls match. A good convention is to always use the pattern "resource?r=" without the ".do"

If no errors, done. Terminate the EC2 instance.

The harvester writes files to Google Cloud Storage at:

https://cloud.google.com/console/project/apps~vertnet-portal/storage/vertnet-harvesting/data/

The harvester automatically makes a folder named for the date the harvest was begun, such as:

https://cloud.google.com/console/project/apps~vertnet-portal/storage/vertnet-harvesting/data/2016-07-15/

Folders are made for each harvested data set using a concatenation of the icode field in the CartoDB resource table and the resource short name extracted from the url field in the CartoDB resource table, to create a folder such as:

vertnet-harvesting/data/2016-07-15/AM-am_all

Within the folder for the resource are one or more files that have been split into 10000 or less records and prepared for post-harvest processing (https://github.com/VertNet/post-harvest-processor). These files have the usual names from a split command (aa, ab, ac, etc.).

Harvesting Details

Harvesting begins with the content of the resource_staging table in CartoDB. This table is updated manually as new resources become harvestable. The table is used by gulo to harvest metadata from the urls of IPTs (ipt=true) and construct the resource table in CartoDB, from which the harvest is done and from which the portal stats derive.

CartoDB user: vertnet

https://vertnet.cartodb.com/tables/resource_staging/

https://vertnet.cartodb.com/tables/resource/

EC2 Tools and Authorization (one-time)

If not done previously, download and install EC2 API Tools from http://aws.amazon.com/developertools/351

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/SettingUp_CommandLine.html

http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html

Set up the access keys (AWS_ACCESS_KEY and AWS_SECRET_KEY) in ~/.bashrc and paths in ~/.bash_profile.

Login to Amazon Elastic Compute Console at: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1

Go to key pairs panel. For a new access point (new dev machine, for example), click on Create Key Pair. Give it a new name. Will download the private key to the file named [keyname].pem on the machine you're logged in from. Move the file to the folder ~/.ssh and change the permissions of the .pem file specifically to read/write for the user only:

chmod 0600 ~/.ssh/vertnetjrw.pem

For further reference see: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html

Harvesting using EC2

Make sure that the environment is set up with variable AWS_ACCESS_KEY and AWS_SECRET_KEY in ~/.bashrc similar to the following:

export AWS_ACCESS_KEY=YOURAWSACCESSKEYHERE
export AWS_SECRET_KEY=YOURAWSSECRETKEYHERE

Use a valid key pair (vn2013-12-17 in the following example; see above) to launch an instance of the image ami-5cebac35 created specifically for harvesting:

ec2-run-instances ami-5cebac35 -n 1 -k vn2013-12-17 --instance-type m3.xlarge -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY

If you get an error such as the following,

Missing argument for option:O (use -h for usage)

the access key variables have not been sourced. Make sure the variables are in ~/.bashrc and then

source ~/.bashrc

Make note of the charges on the EC2 account in order to get the difference after harvesting. Note charges are for instance time, mostly, at $0.45 per hour.

https://portal.aws.amazon.com/gp/aws/developer/account?ie=UTF8&action=activity-summary

Confirm the instance is running at:

https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:

Click on the instance to see the details, including the DNS entry, which can also be determined from:

ec2-describe-instances -O $AWS_ACCESS_KEY -W $AWS_SECRET_KEY

It should be something of the form: ec2-50-19-163-49.compute-1.amazonaws.com

ssh into that instance using the key pair name, user ubuntu, and DNS entry:

ssh -i ~/.ssh/vertnetjrw.pem ubuntu@[ec2-50-19-163-49].compute-1.amazonaws.com

Once logged in, from the home directory, bootstrap the instance with software needed for harvest:

curl -O https://raw.github.com/VertNet/gulo/master/dev/bootstrap.sh
chmod a+x bootstrap.sh
./bootstrap.sh

At this point there should be a folder called gulo. Make sure that has the latest code:

cd gulo
git checkout master
git pull origin

The default Amazon Machine Image (AMI) has two volumes, of 8GB and 40GB respectively. 40GB is not enough for a full harvest. As of 14 Nov 2013, with 108 resources and 9.5M records consisting of 21.14GB, 100GB of disk space was sufficient. If space is a problem during harvest, errors will occur such as the following:

"ERROR: Resource http://ipt.vertnet.org:8080/ipt/resource.do?r=kstc_schmidt_birds (Failed to create directory within 10000 attempts (tried 1384385786425-0 to 1384385786425-9999))" IllegalStateException Failed to create directory within 10000 attempts (tried 1384385786425-0 to 1384385786425-9999) com.google.common.io.Files.createTempDir (Files.java:443)

To avoid this, before harvesting, create a new volume with enough space in the EC2 console in the Volumes panel, create volume. It must use the same region as the instance. Standard configuration, no snapshot. Attach it to the instance you created. Determine the attachment location in /dev such as /dev/xvdf. Make the file system on the drive and mount it with owner ubuntu:

sudo mkfs -t ext3 /dev/xvdf
sudo mount /dev/xvdf /mnt/beast
sudo chown ubuntu:ubuntu /mnt/beast

There should now be a folder called /mnt/beast with owner ubuntu.

Following the guidelines at https://developers.google.com/storage/docs/gsutil_install#authenticate, run

gsutil config

use the authentication code for the project_id "vertnet-project".

If the harvest will be for a subset of resources, create a list of the access points for the resources in the file /home/ubuntu/resource_list.txt, where this file contains the list of URLs IPT resources. For example,

http://ipt.vertnet.org:8080/ipt/resource.do?r=ccber_mammals.

Ready to harvest:

cd gulo
git checkout master
git pull origin
(screen -m)
lein repl

when the repl comes up with the prompt:

user=> (use 'gulo.harvest)

There will be an error that does not matter:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
nil
user=>

Enter

user=> (in-ns 'gulo.harvest)

response will be:

#<Namespace gulo.harvest>
gulo.harvest=>

Note the time if you want to track progress and then harvest:

To harvest a list of resources from the resource_list.txt file:

gulo.harvest=> (harvest-all "/mnt/beast" :path-file "/home/ubuntu/resource_list.txt")

To harvest all registered resources, sync the resource_staging table on CartoDB and populate the resource table at https://vertnet.cartodb.com/tables/resource.

gulo.harvest=> (sync-resource-table)

Check the table to be sure that the counts are all correct. Old IPTs were a problem getting metadata at one point. There was no count problem 14 Nov 2013. If there is a problem, set the counts in the resource table to be in accord with the counts in IPT source.

gulo.harvest=> (harvest-all "/mnt/beast")

One can detach/reattach screen to monitor progress as well as ssh in and out.

One can monitor progress on Google Cloud Storage as well. Data will go to a folder with today's date in the form yyyy-mm-dd that can be accessed from the Google Cloud Storage Console:

https://cloud.google.com/console/project/apps~vertnet-portal/storage/vertnet-harvesting/data/

or with gsutil, using, for example:

gsutil ls gs://vertnet-harvesting/data/2014-01-19

Harvest should finish up with something like this:

"Downloading records from http://ipt.vertnet.org:8080/ipt/resource.do?r=amnh_mammals"
"290333 records found"
"Writing to /mnt/beast/2013-12-17/amnh_mammals-80b02e05-a008-4a71-a424-cf0937558e71/amnh_mammals-80b02e05-a008-4a71-a424-cf0937558e71.csv"
"Done harvesting amnh_mammals"
"Copying /mnt/beast/2013-12-17/amnh_mammals-80b02e05-a008-4a71-a424-cf0937558e71 to gs://vn-harvesting/data/2013-12-17/"
"Harvest complete."
nil

Set the Access Control List to make the harvested files publicly readable:

dev:gulo tuco$ gsutil -m setacl -R -a public-read gs://vn-harvesting
You are using a deprecated alias, "setacl", for the "acl" command.
Please use "acl" with the appropriate sub-command in the future. See
"gsutil help acl" for details.

Remember to Terminate the EC2 instance (https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances:) and delete and associated volumes (https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Volumes) in order to avoid incurring ongoing costs when the instance is not being used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest Workflow

Quick Guide

Harvesting Details

EC2 Tools and Authorization (one-time)

Harvesting using EC2

Clone this wiki locally