Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Jupyter notebooks that are backed by GUODA's Spark and HDFS infrastructure are currently available at http://jupyter.guoda.bio.
This is an alpha pre-release service. It may be down at times or not work right. It runs on a small cluster donated by the ACIS lab and has limited resources. We are very interested in hearing about your experiences with it so we can improve the services and make it more useful to biodiversity researchers. We are also interested in any help you can provide in formatting data sets.
- Create a Github user account if you don't already have one.
- Go to http://jupyter.guoda.bio/ and log in. (Note that this redirects to an idigbio.org domain at the moment, that is ok.)
- Log in with your Github credentials
- Start your Jupyter server if this is the first time you're visiting.
- You'll now be presented with a web-based view of the files in your home directory on the Jupyter server.
- Navigate to the Examples directory by clicking on it.
- Navigate to the collection_date_graph directory
- We suggest you start by clicking on the 01_iDigBio_Specimens_Collected_Over_Time.ipynb file to launch the Jupyter notebook and read through its contents to start learning how Jupyter and Spark work together.
- When you're done, please choose File | Close and halt on the menu in your notebook. This frees up the cluster resources for other users.
Available data sets
A number of data sets are generated and stored on the HDFS file system. Below is a list of the paths to use to access them.
The 100k and 1M row subsets contain a semi-random sampling of the larger data sets ideal for testing algorithms and code rapidly without needing to occupy the whole cluster.
|Source||Date||Full Path of Latest|
|iDigBio 100k rows||2017-06-09||/guoda/data/idigbio-20170609T073048-100k.parquet|
|iDigBio 1M rows||2017-06-09||/guoda/data/idigbio-20170609T073048-1M.parquet|
|iDigBio Media Records||2017-06-11||/guoda/data/idigbio-media-20170611T154222.parquet|
|iDigBio Media Records 100k rows||2017-06-11||/guoda/data/idigbio-media-20170611T154222-100k.parquet|
|iDigBio Media Records 1M rows||2017-06-11||/guoda/data/idigbio-media-20170611T154222-1M.parquet|
|GBIF Backbone taxonomy||2017-05-05||/tmp/jhpoelen/gbif-backbone.parquet|