Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge the active develop branch back into master #136

Merged
merged 346 commits into from
Sep 16, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
346 commits
Select commit Hold shift + click to select a range
3520270
fix project.clj :excludes on org.slf4j
eightysteele Feb 27, 2013
1ba6010
add json to project.clj dep
eightysteele Feb 27, 2013
a27d6aa
cleanup ipt namespace, add docs
eightysteele Feb 28, 2013
0352bc6
leetle more cleanup
eightysteele Feb 28, 2013
2a192ab
add xml->json java stuff
eightysteele Mar 1, 2013
44115fe
Updated deps
robinkraft Mar 2, 2013
4f2bd90
Added functions and queries to facilitate resource synchronization
robinkraft Mar 2, 2013
a1d1ea0
Merge pull request #41 from VertNet/feature/resource-sync
eightysteele Mar 3, 2013
02d6e4c
Merge pull request #42 from VertNet/feature/harvest
eightysteele Mar 3, 2013
76b8c60
hotfix slf deps and gitignore
eightysteele Mar 3, 2013
c63a731
fold sync queries into ResourceTable protocol.
eightysteele Mar 3, 2013
d1c3448
fix resource map keywords
eightysteele Mar 4, 2013
ff40da7
cleanup
eightysteele Mar 4, 2013
c3b2dff
harvest resource/org to cdb
eightysteele Mar 5, 2013
028d910
sinking records to pail
eightysteele Mar 5, 2013
13556db
sink data and metadata with test-data in ipt namespace
eightysteele Mar 5, 2013
a6267f3
First shot at stats queries
robinkraft Mar 6, 2013
c5cb5e7
Corrected handling of uniques; added queries for total-taxa stat
robinkraft Mar 6, 2013
fdd66ef
Added collection and class queries
robinkraft Mar 6, 2013
c95a484
Merge pull request #45 from VertNet/feature/stats-queries
robinkraft Mar 6, 2013
6f26b15
add Resource protocol
eightysteele Mar 6, 2013
7bf887d
Merge branch 'feature/sync' of github.com:VertNet/gulo into feature/sync
eightysteele Mar 6, 2013
2a001cc
Added docs
robinkraft Mar 6, 2013
48b4693
Simplified thrift unpacking by using generic methods in helper functions
robinkraft Mar 7, 2013
5cb0de2
Added tests and future facts for views.clj
robinkraft Mar 7, 2013
3bb366e
Test data is now bootstrapped before tests run
robinkraft Mar 7, 2013
2460b03
Fixed formatting
robinkraft Mar 7, 2013
ec160f7
Merge pull request #46 from VertNet/feature/stats-queries
robinkraft Mar 7, 2013
c25c7e0
checkpoint commit
eightysteele Mar 13, 2013
5690294
Merge branch 'feature/sync' of github.com:VertNet/gulo into feature/sync
eightysteele Mar 13, 2013
4573b1c
add initial sqlite ns, round out sync
eightysteele Mar 13, 2013
655bbbd
Tweaked test helper functions, added tests for most stats queries
robinkraft Mar 13, 2013
2094d2f
Added test data to resources directory to support views-test namespace
robinkraft Mar 13, 2013
dcf7510
Wrote remaining tests for views namespace
robinkraft Mar 14, 2013
942a1c3
Fixed get-OrganizationProperty-id
robinkraft Mar 14, 2013
3f03283
Regenerated test data
robinkraft Mar 14, 2013
3f7c586
Tweaked isValidTarget to support greater depth in pail directory stru…
robinkraft Mar 14, 2013
6d8170d
harvesting new resources
eightysteele Mar 16, 2013
f6da93c
Removed TODO
robinkraft Mar 20, 2013
473d952
wip on edges
eightysteele Mar 25, 2013
b4bb1e6
Adding support for impailing edges
robinkraft Mar 25, 2013
dfb93a1
Edge processing nearly complete, added pail support
robinkraft Mar 26, 2013
56d5da2
Revert tweak to thrift schema
robinkraft Mar 26, 2013
40b51e7
fixy
eightysteele Mar 26, 2013
b5d56fc
Merge branch 'feature/edges' of github.com:VertNet/gulo into feature/…
eightysteele Mar 26, 2013
fba07e8
Impailing DatasetRecordEdge now works correctly
robinkraft Mar 26, 2013
3685c29
boom
eightysteele Mar 26, 2013
1e7e637
Merge branch 'feature/edges' of github.com:VertNet/gulo into feature/…
eightysteele Mar 26, 2013
df42535
dissoc :url from resource
eightysteele Mar 26, 2013
7bcf398
cleanup
eightysteele Mar 26, 2013
792c088
add exception handling for harvest
eightysteele Mar 26, 2013
762af48
make it fast
eightysteele Mar 26, 2013
55b908d
nits
eightysteele Mar 26, 2013
73127ae
mo nits
eightysteele Mar 26, 2013
4054ce3
cleanup prns
eightysteele Mar 26, 2013
186529b
handle resource without guid
eightysteele Mar 26, 2013
f5667af
add support for sinking to s3 path
eightysteele Mar 27, 2013
4ce4007
Added IUnpackable protocol
robinkraft Mar 28, 2013
f5111c6
Can now collapse record properties into one line per record
robinkraft Mar 28, 2013
78ab037
Merge pull request #49 from VertNet/feature/prep-for-bulkload
robinkraft Mar 28, 2013
836c0d1
harvest csv to s3
eightysteele Mar 28, 2013
9d154f9
Merge branch 'feature/edges' of github.com:VertNet/gulo into feature/…
eightysteele Mar 28, 2013
2e92fde
harvest to textline
eightysteele Mar 30, 2013
0115957
status logging
eightysteele Mar 30, 2013
ba3b2ac
handle eml with zero associated parties
eightysteele Mar 30, 2013
ff04ce8
Modifying stats queries to handle wide source textline fields
robinkraft Apr 4, 2013
2498045
Added test data for stats queries
robinkraft Apr 9, 2013
a66b878
Added ?dummy field to end of harvest fields vector
robinkraft Apr 9, 2013
c86aa18
Removed all support for thrift and pail; rewrote queries to parse tex…
robinkraft Apr 9, 2013
f48df56
Merge pull request #50 from VertNet/feature/new-stats-queries
eightysteele Apr 10, 2013
26dea46
Merge pull request #47 from VertNet/feature/stats-queries
eightysteele Apr 10, 2013
3b810fb
wip
eightysteele Apr 10, 2013
07fffa5
Merge pull request #43 from VertNet/feature/sync
eightysteele Apr 10, 2013
822e675
merged in develop
eightysteele Apr 10, 2013
812a459
Merge pull request #48 from VertNet/feature/edges
eightysteele Apr 10, 2013
9f125ef
hi
eightysteele Apr 10, 2013
df0afc9
remove log
eightysteele Apr 10, 2013
048b780
index on develop: df0afc9 remove log
eightysteele Apr 10, 2013
0d0f9e3
WIP on develop: df0afc9 remove log
eightysteele Apr 10, 2013
8da779a
Pulling in changes lost in pull req from feature/new-stats-queries
robinkraft Apr 10, 2013
608fdfe
Merge pull request #51 from VertNet/feature/semicolon
robinkraft Apr 10, 2013
84f86cc
Merge pull request #52 from VertNet/feature/stats
robinkraft Apr 10, 2013
f9426ef
index on feature/semicolon: 0d0f9e3 WIP on develop: df0afc9 remove log
eightysteele Apr 10, 2013
13d7a5d
WIP on feature/semicolon: 0d0f9e3 WIP on develop: df0afc9 remove log
eightysteele Apr 10, 2013
ee3799c
Merge commit 'stash' into feature/s3-update
eightysteele Apr 10, 2013
a6579d3
Merge pull request #53 from VertNet/feature/s3-update
robinkraft Apr 10, 2013
2ac7fad
index on develop: a6579d3 Merge pull request #53 from VertNet/feature…
eightysteele Apr 10, 2013
b12829a
WIP on develop: a6579d3 Merge pull request #53 from VertNet/feature/s…
eightysteele Apr 10, 2013
1838ba8
Merge pull request #54 from VertNet/feature/fix-harvest
robinkraft Apr 10, 2013
31617c1
Final stats queries and test data
robinkraft Apr 10, 2013
bc68488
hotfix add :s3 to harvest-all
eightysteele Apr 10, 2013
68e2644
Added tests for stats view queries
robinkraft Apr 10, 2013
38c14dc
index on develop: bc68488 hotfix add :s3 to harvest-all
eightysteele Apr 15, 2013
9fd0580
WIP on develop: bc68488 hotfix add :s3 to harvest-all
eightysteele Apr 15, 2013
5156caa
delete files after use, name resources with uuid
eightysteele Apr 16, 2013
16725a4
order by cartodb_id
eightysteele Apr 16, 2013
abddec6
Added defmain to run all stats queries, plus supporting functions and…
robinkraft Apr 17, 2013
23b261c
cleanup
eightysteele Apr 17, 2013
c3e631c
Added note about credentials.json for EMR
robinkraft Apr 17, 2013
b8aae9e
Switched input and output to hfs-textline
robinkraft Apr 17, 2013
215791c
ipt namespace function to fill out resource cdb table
eightysteele Apr 18, 2013
27497ed
add columns from resource table to csv outputs
eightysteele Apr 19, 2013
1dc0e3e
consolidate ipt namespace into harvest namespace
eightysteele Apr 20, 2013
5ced891
tidy up sync
eightysteele Apr 20, 2013
1a2cf1e
doall to sync
eightysteele Apr 20, 2013
b287cd2
better harvest logging
eightysteele Apr 20, 2013
0e10322
cleanup
eightysteele Apr 20, 2013
4fb4764
make sync optional
eightysteele Apr 20, 2013
1ae5c17
clear local archives after processing
eightysteele Apr 22, 2013
e6ca8d1
Updated readme re credentials.json
robinkraft Apr 23, 2013
eb955d4
Merge branch 'feature/harvest-all' of github.com:VertNet/gulo into fe…
robinkraft Apr 23, 2013
1abccc0
Modified harvest-fields to match fields added by harvest-resource
robinkraft Apr 23, 2013
4b9560f
Merge pull request #55 from VertNet/feature/stats
robinkraft Apr 23, 2013
8b8172e
Fixed merge conflict
robinkraft Apr 23, 2013
3f69a7c
Modified queries to reflect latest harvest schema
robinkraft Apr 23, 2013
ad9ddae
Modified test data to reflect harvest schema
robinkraft Apr 23, 2013
3a43c5d
Various tweaks to deps and use/require, project now compiles
robinkraft Apr 23, 2013
8720be1
Updated readme to include info on resources/s3.json
robinkraft Apr 23, 2013
a727e00
Consolidated reading credentials json files in utils
robinkraft Apr 23, 2013
77eb50d
Tweaked readme creds filename
robinkraft Apr 23, 2013
c27b92b
Merge branch 'develop' into feature/fix-deps
robinkraft Apr 23, 2013
0fb4d4d
Removed tests for functions and namespace that no longer exist
robinkraft Apr 23, 2013
0eba94c
Merge pull request #60 from VertNet/feature/fix-deps
eightysteele Apr 24, 2013
860aefc
Replace linebreaks in resource properties with spaces during harvesting
robinkraft Apr 24, 2013
06d27a6
Fix merge conflict
robinkraft Apr 24, 2013
70ca51b
Merge pull request #61 from VertNet/feature/fix-59
robinkraft Apr 25, 2013
8c2d6db
require splitline
eightysteele Apr 30, 2013
8132041
merge
eightysteele Apr 30, 2013
50ec6a1
Merge pull request #56 from VertNet/feature/harvest-all
eightysteele May 1, 2013
d76a983
Cleanup namespaces post merge
robinkraft May 1, 2013
c4cbcc7
Merge pull request #62 from VertNet/feature/cleanup
robinkraft May 1, 2013
dfdc70c
hotfix: check for nil values in remove-line-breaks, index props corre…
eightysteele May 1, 2013
3dd1809
Updated test data to remove duplicated icode field
robinkraft May 15, 2013
e996be8
Support relying on teratorn for common functions, fields and columns
robinkraft May 15, 2013
53f31e9
Removed core.clj - now rolled into teratorn.vertnet
robinkraft May 15, 2013
47ab7f0
Removed thrift.clj - no longer used
robinkraft May 15, 2013
3e9b93f
Remove core tests - no longer needed since core ns is gone
robinkraft May 15, 2013
2452d9d
Removed functions now in teratorn; broke out harvest code into functi…
robinkraft May 15, 2013
4b9d985
Added access to teratorn.common functions
robinkraft May 15, 2013
da4ffc8
Simplified harvest functions/queries to facilitate testing; added tests
robinkraft May 15, 2013
e1e542a
Merge pull request #64 from VertNet/feature/support-teratorn
eightysteele May 15, 2013
fa81fad
Fix emlrights keyword in resource-row
robinkraft May 20, 2013
80fc5e5
Fixing up namespaces, tests, remove pail.clj
robinkraft May 20, 2013
b2278da
Removed shredding, re-fixed emlrights keyword
robinkraft May 20, 2013
60c520a
Updated -Xmx to 14 gigs; -Xms stays at 1 gig to ease local testing
robinkraft May 20, 2013
84f445b
Made s3 bucket an argument to harvesting, cleaned up filename business
robinkraft May 20, 2013
21cde78
Fixed optional creds arg for mk-full-s3-path
robinkraft May 20, 2013
d0bc874
Fixed path format for printing
robinkraft May 20, 2013
4d66749
Fix AWS key names
robinkraft May 20, 2013
347005f
Simplify path creation, fix s3 paths
robinkraft May 20, 2013
5f6844b
Merge pull request #66 from VertNet/feature/cleanup-ns
robinkraft May 22, 2013
c05c8e2
Prepend uuid to each record
robinkraft May 22, 2013
0f31f13
Merge pull request #67 from VertNet/feature/id-hotfix
robinkraft May 22, 2013
d291efb
Update README.md
robinkraft May 24, 2013
6c29de6
Bootstrap script for VertNet harvest/table cluster - put on s3
robinkraft May 28, 2013
cbc3b04
Added comment about location on s3
robinkraft May 28, 2013
80aab02
Remove teratorn dep, move field defs into gulo.fields
robinkraft May 29, 2013
c4183ca
Pull in util functions from Teratorn
robinkraft May 29, 2013
1781bfd
Remove references to teratorn, update ns to use new fields, utils fun…
robinkraft May 29, 2013
318ba64
Update formatting, tweak views to use harvestid instead of id per new…
robinkraft May 29, 2013
e7eaf92
Update tests to include functions from Teratorn, remove refs to terat…
robinkraft May 29, 2013
a8c7aa1
Change to include 500 occs of ttrs_mammals using latest harvesting sc…
robinkraft May 29, 2013
0968898
Update views tests to reflect new test data source
robinkraft May 29, 2013
4ceea7e
Merge pull request #68 from VertNet/feature/remove-teratorn-dep
robinkraft May 30, 2013
34063ea
Tweaks for apt-get update, installing gulo and teratorn
robinkraft May 30, 2013
5dc18e3
Merge pull request #69 from VertNet/feature/bootstrap
robinkraft Jun 4, 2013
d15e753
Citation field will now be harvested and will populate resource table
robinkraft Jun 4, 2013
ee6a14f
Formatting
robinkraft Jun 5, 2013
d5a885a
Merge pull request #70 from VertNet/feature/fix-130
robinkraft Jun 5, 2013
bbb1ba3
Restructure use of doall to avoid NPE during harvesting
robinkraft Jun 5, 2013
68fa8c2
Added print statement on completion
robinkraft Jun 5, 2013
a613977
Merge pull request #71 from VertNet/feature/fix-63
robinkraft Jun 5, 2013
acb50b1
Simplified harvesting for easier testing and harvesting only specific…
robinkraft Jun 11, 2013
82c11ef
Added vars for table names
robinkraft Jun 11, 2013
871ca8b
Merge pull request #73 from VertNet/feature/simplify-harvest
robinkraft Jun 18, 2013
39b88dd
Support networks field
robinkraft Jun 18, 2013
c7b3a3e
Typo hotfix - forma -> format
robinkraft Jun 19, 2013
b944319
Full bootstrap of EC2 instance for harvesting and bulkloading
robinkraft Jun 19, 2013
9e2c8f5
Merge pull request #78 from VertNet/feature/instance-config
robinkraft Jun 19, 2013
ce060b9
Merge pull request #76 from VertNet/feature/fix-75
robinkraft Jun 19, 2013
3dcf73f
Hotfix: remove gulo uberjar - using lein repl not hadoop jar ...
robinkraft Jun 19, 2013
7c81abe
Added functions and queries to support season field, simplified harve…
robinkraft Jun 26, 2013
cc4cf0e
Fixed function and var names
robinkraft Jun 26, 2013
88fae6d
Reverting resource->s3 and doing season processing outside of Cascalog
robinkraft Jun 26, 2013
7cb6adc
Use dwca-reader functions to extract season
robinkraft Jun 26, 2013
40ed25e
Add test for harvesting back in
robinkraft Jun 26, 2013
840e9c8
Check for and handle mal-formed lat/lon/months
robinkraft Jun 26, 2013
2484026
Merge pull request #79 from VertNet/feature/seasons
eightysteele Jun 27, 2013
e122943
get-season-idx returns nil if invalid month
robinkraft Jul 12, 2013
44d5be6
Merge pull request #84 from VertNet/feature/fix-season
robinkraft Jul 12, 2013
ae161d1
Scraping record count now returns '-1' on failure
robinkraft Jul 12, 2013
42cd043
Merge pull request #87 from VertNet/feature/fix-86
robinkraft Jul 12, 2013
ddcfe47
Support harvesting list or file of resource urls
robinkraft Jul 13, 2013
dec6936
Add tests for functions for harvesting individual resources
robinkraft Aug 6, 2013
855ef07
Merge pull request #89 from VertNet/feature/fix-80
robinkraft Aug 6, 2013
017520b
Python script splits and uploads harvested resources to google cloud …
robinkraft Aug 7, 2013
74a3e93
Merge pull request #97 from VertNet/feature/fix-96
robinkraft Aug 7, 2013
48243f1
WIP
robinkraft Aug 8, 2013
45e3c96
Integrated python script into harvest-all workflow; removed all seaso…
robinkraft Aug 8, 2013
39557ba
Merge pull request #99 from VertNet/feature/fix-98
robinkraft Aug 14, 2013
7d08938
Modified harvesting to upload each resource immediately to GCS (see #…
robinkraft Aug 20, 2013
8894a3d
Merge pull request #101 from VertNet/feature/fix-100
robinkraft Aug 20, 2013
87058d2
Added support for collectioncount field
robinkraft Sep 9, 2013
bdd0faa
Refactored getting field info - replaced metadata field-specific func…
robinkraft Sep 9, 2013
3caeb56
Fixed IPT resource page parsing to get record count
robinkraft Sep 11, 2013
f0aa0fc
Merge pull request #102 from VertNet/feature/fix-334
robinkraft Sep 11, 2013
44eb581
Merge branch 'develop' into feature/fix-103
robinkraft Sep 11, 2013
d95e7e1
Tweak page parser to work with multiple versions of IPT
robinkraft Sep 11, 2013
dd9c077
Merge pull request #104 from VertNet/feature/fix-103
robinkraft Sep 11, 2013
3c3c013
Now use metadata from CartoDB to populate orgname
robinkraft Sep 11, 2013
c23741b
Merge pull request #105 from VertNet/feature/fix-webapp-358
robinkraft Sep 11, 2013
366aa0b
networks field now string, not array, so use url->field w/o special h…
robinkraft Sep 11, 2013
9209553
Merge pull request #106 from VertNet/feature/fix-webapp-334
robinkraft Sep 11, 2013
c1a3409
Hotfix to :path-file syntax
robinkraft Sep 11, 2013
9598e08
Hotfix to collectioncount test
robinkraft Sep 11, 2013
6570736
Changing location of harvest file storage and access.
Dec 12, 2013
695dd69
remove -p since gsutil reconfigure handles it
eightysteele Dec 13, 2013
0c3bbb0
Merge pull request #109 from VertNet/feature/harvest-update
eightysteele Dec 13, 2013
7c93cf9
support for local harvest
eightysteele Jan 10, 2014
e91b33d
Implementing CartoDB-recommended change from using DELETE (deprecated…
Feb 6, 2014
dd2fd0f
Oops. Take 2 at implementing CartoDB-recommended change from using DE…
Feb 6, 2014
bf7aec7
Merge pull request #112 from VertNet/feature/truncate
Feb 6, 2014
c4f031b
Removing spurious 7 from locationaccordingto field name.
Apr 26, 2014
a8cd02e
Cleaning up fields synchonized and fields in the harvest.
Jul 12, 2014
7a63d3e
Adding migrator field to resource sync.
Jul 16, 2014
0e0d5d3
Typo.
Jul 17, 2014
3a476a2
Merge pull request #120 from VertNet/feature/fieldcleanup
Jul 17, 2014
afc5599
Attempt to pull some metadata from original source and some from Vert…
Jul 17, 2014
28213ad
Merge pull request #121 from VertNet/feature/fieldcleanup
Jul 17, 2014
13e995d
Adding GBIF dataset and publisher ids as well as license and lastinde…
Oct 1, 2014
a0cfa69
Correcting syntax errors.
Oct 1, 2014
b1b688f
Merge pull request #127 from VertNet/feature/syncupdate
Oct 1, 2014
3105713
Update to latest dwca-reader-clj with proper deps
robinkraft Oct 9, 2014
7148432
Merge branch 'develop' of github.com:VertNet/gulo into develop
robinkraft Oct 9, 2014
7da3078
Bump version, dwca-reader-clj version
robinkraft Dec 9, 2014
b5465e3
First attempt to harvest more metadata from CartoDB along with Darwin…
Dec 19, 2014
99b5637
Merge branch 'develop' of https://github.com/VertNet/gulo into develop
Dec 19, 2014
ad290c6
Removed lastindexed from the set of harvested base fields.
Dec 25, 2014
aa4e91d
Amended the list of Darwin Core fields as given by the Darwin Core re…
Dec 25, 2014
119018a
Further filtering harvest sync to include only resource_staging recor…
Dec 25, 2014
dd284bd
Ameding filter for VertNet in networks field.
Dec 25, 2014
eb57094
Report the resource title instead of resource record count (no longer…
Dec 26, 2014
53592ce
Adding doi from CartoDB resource table to harvest fields.
May 22, 2015
1c804ff
Completing doi integration from CartoDB into harvest fields.
May 22, 2015
b521741
Updated project file to use new clojars repo for dwca-reader.
Oct 10, 2015
b74273e
Correcting repo location for dwca-reader2-clj SNAPSHOT 0.20.
Oct 10, 2015
9f84c56
Changed file and folder naming scheme to remove burdensome uuid. Patt…
Jul 15, 2016
3ab93a6
Merge pull request #135 from VertNet/nouuid
tucotuco Jul 15, 2016
b3c33a7
Re-enabled record count display and cleaned up unnecessary sorting on…
Jul 15, 2016
cf1813c
Just updating gitignore.
Sep 16, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,19 @@
vertnet.pem
pom.xml
*jar
/lib/
/classes/
.lein-deps-sum
.lein-plugins
rm-dwca-reader-clj-jars.sh
creds.json
s3.json
aws.json
target/
#*.*
*.*~
*sublime*
\#*.*\#
.lein*
.nrepl*
.DS_Store
56 changes: 53 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,54 @@
gulo
====
# What is Gulo?

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
![](http://3.bp.blogspot.com/-s1vAPdg_zZM/TZ3bnzUZgVI/AAAAAAAACKo/Mk-Tu-Nil74/s1600/animalangry.jpg)

Gulo is the genus for wolverine, the biggest land-dwelling species of weasel on the planet. It is a stocky and muscular carnivore, resembling a small bear. The wolverine has a reputation for endurance, ferocity, and strength out of proportion to its size, with the capacity to battle with competitors many times its size.

Gulo is also a VertNet project designed for harvesting Darwin Core Archives, shredding them into small pieces, and loading them into [CartoDB](http://cartodb.com). It's written in the [Clojure](http://clojure.org) programming language and rides on [Cascading](http://www.cascading.org) and [Cascalog](https://github.com/nathanmarz/cascalog) for processing "Big Data" on top of [Hadoop](http://hadoop.apache.org) using [MapReduce](http://research.google.com/archive/mapreduce.html).

# Developing
## AWS credentials

Running Gulo queries with Elastic MapReduce requires adding the following to the file `credentials.json` in the project root:

```json
{
"access-id": "your_aws_access_id",
"private-key":"your_aws_private_key",
"key-pair-file":"~/.ssh/vertnet.pem",
"key-pair":"vertnet"
}
```

Working with the `gulo.cdb` namespace requires this to be stored in `resources/aws.json`:

```json
{
"access-id": "your_aws_access_id",
"secret-key": "your_aws_private_key"
}
```

## CartoDB OAuth credentials

Gulo depends on an authenticated connection to CartoDB. This requires adding the following file in `resources/creds.json`:

```json
{
"key": "your_cartodb_oauth_key",
"secret": "your_cartodb_oauth_secret",
"user": "your_cartodb_username",
"password": "your_cartodb_password"
}
```

## Dependencies

For adding BOM bytes to UTF-8 files, so that CartoDB can detect the encoding, we use the `uconv` program which can be installed on Ubuntu like this:

```bash
$ sudo apt-get install apt-file
$ sudo apt-file update
$ apt-file search bin/uconv
$ sudo apt-get install libicu-dev
```
30 changes: 30 additions & 0 deletions dev/bootstrap.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# configure EMR cluster for use with VertNet projects
# put on S3 at s3://vnproject/bootstrap-actions/gulo/bootstrap.sh

# install some helpful utilities
sudo apt-get update
sudo apt-get install -y screen s3cmd zip unzip

# Setup for git
git config --global user.name "Whizbang Systems"
git config --global user.email "admin@whizbangsystems.net"

# generate ssh key
ssh-keygen -t rsa -N "" -f /home/hadoop/.ssh/id_rsa -C "admin@whizbangsystems.net"
sudo chmod 644 /home/hadoop/.ssh/id_rsa

# Add github to known_hosts
echo "github.com,207.97.227.239 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==" >> /home/hadoop/.ssh/known_hosts


# simple leiningen install via 'li'
echo "alias li='cd /home/hadoop/bin; wget https://raw.github.com/technomancy/leiningen/stable/bin/lein; chmod u+x lein; ./lein; cd /home/hadoop;'" >> /home/hadoop/.bashrc

# simple uberjarring
echo "alias uj='lein do deps, compile :all, uberjar'" >> /home/hadoop/.bashrc

# simple installs & configs
echo "alias gulo='git clone git://github.com/VertNet/gulo.git'" >> /home/hadoop/.bashrc
echo "alias teratorn='git clone git://github.com/MapofLife/teratorn.git'" >> /home/hadoop/.bashrc

echo "alias dl='wget https://gist.github.com/robinkraft/5666682/download'" >> /home/hadoop/.bashrc
111 changes: 111 additions & 0 deletions dev/ec2-bootstrap.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Run this script to configure an instance for harvesting and bulkloading.

# install a few things

sudo apt-get update
sudo apt-get -y install screen zip unzip git sqlite3
http://s3tools.org/repo/deb-all/stable/s3cmd_1.0.0.orig.tar.gz
tar -xvf s3cmd_1.0.0.orig.tar.gz
cd s3cmd-1.0.0
sudo python setup.py install
cd

# Setup for git
git config --global user.name "David Bloom"
git config --global user.email "dbloom@vertnet.org"

# generate ssh key
ssh-keygen -t rsa -N "" -f /home/$USER/.ssh/id_rsa -C "dbloom@vertnet.org"
sudo chmod 644 /home/$USER/.ssh/id_rsa

# Add github to known_hosts
echo "github.com,207.97.227.239 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==" >> /home/$USER/.ssh/known_hosts

# install Java
sudo apt-get -y install openjdk-7-jre
sudo apt-get -y install openjdk-7-jdk

# make ~/bin directory, add to PATH
mkdir ~/bin
echo "export PATH=/home/$USER/bin:${PATH}" >> ~/.bashrc

# install lein
cd ~/bin
wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
chmod u+x lein
./lein
cd ~/

# install app engine sdk
cd bin
wget http://googleappengine.googlecode.com/files/google_appengine_1.8.0.zip
unzip google_appengine_1.8.0.zip
echo "export PATH=/home/$USER/bin/google_appengine:${PATH}" >> ~/.bashrc
cd

# simple uberjarring via uj command
echo "alias uj='lein do deps, compile :all, uberjar'" >> /home/$USER/.bashrc

# clone projects
git clone git://github.com/VertNet/gulo.git
git clone git://github.com/VertNet/webapp.git

# configure EBS volume
sudo mkfs -t ext3 /dev/xvdb
sudo mkdir /mnt/beast
sudo mount /dev/xvdb /mnt/beast
sudo chown $USER:$USER /mnt/beast

# configure credentials

echo "Configuring CartoDB. Please have your credentials ready and press 'enter' to continue."
read na
echo "Oauth key:"
read OAUTH_KEY
echo "Oauth secret:"
read OAUTH_SECRET
echo "Username:"
read USERNAME
echo "Password:"
read CDB_PASSWORD
echo "API key:"
read API_KEY

echo "{
\"key\": \"$OAUTH_KEY\",
\"secret\": \"$OAUTH_SECRET\",
\"user\": \"$USERNAME\",
\"password\": \"$CDB_PASSWORD\",
\"api_key\": \"$API_KEY\"
}" > ~/gulo/resources/creds.json

echo "Configuring AWS. Please have your credentials ready and press 'enter' to continue. Note that backslashes in your AWS credentials may cause errors."
read na
echo "Access key:"
read ACCESS_ID
echo
echo "Secret key:"
read SECRET_KEY
echo

echo "{
\"access-id\": \"$ACCESS_ID\",
\"secret-key\": \"$SECRET_KEY\"
}" > ~/gulo/resources/aws.json

echo "Keep those AWS credentials handy for configuring s3cmd. Press 'enter' to continue"

s3cmd --configure

# configure app engine credentials

echo "Please enter your App Engine email address: "
read EMAIL
echo "export EMAIL=$EMAIL" >> ~/.bashrc

echo "Please enter your App Engine password: "
read GAE_PASSWORD
echo "export GAE_PASSWORD=$GAE_PASSWORD" >> ~/.bashrc
echo "Credentials are now set up."

echo "Instance configured - go have a beer to celebrate!"
9 changes: 9 additions & 0 deletions dev/genthrift.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/sh

# Generates Java code from the vn.thrift DSL. Depends on the Apache Thrift compiler.

rm -rf ../src/jvm/gen-java
rm -rf ../src/jvm/gulo/schema/*
thrift -o "../src/jvm" -r --gen java:hashcode gulo.thrift
mv ../src/jvm/gen-java/gulo/schema ../src/jvm/gulo
rm -rf ../src/jvm/gen-java
Loading