Jupyter Notebooks

Matthew J Collins edited this page Jun 21, 2017 · 25 revisions

Service Description

Jupyter notebooks that are backed by GUODA's Spark and HDFS infrastructure are currently available at http://jupyter.guoda.bio.

This is an alpha pre-release service. It may be down at times or not work right. It runs on a small cluster donated by the ACIS lab and has limited resources. We are very interested in hearing about your experiences with it so we can improve the services and make it more useful to biodiversity researchers. We are also interested in any help you can provide in formatting data sets.

Getting started

  1. Create a Github user account if you don't already have one.
  2. Go to http://jupyter.guoda.bio/ and log in. (Note that this redirects to an idigbio.org domain at the moment, that is ok.)
  3. Log in with your Github credentials
  4. Start your Jupyter server if this is the first time you're visiting.
  5. You'll now be presented with a web-based view of the files in your home directory on the Jupyter server.
  6. Navigate to the Examples directory by clicking on it.
  7. Navigate to the collection_date_graph directory
  8. We suggest you start by clicking on the 01_iDigBio_Specimens_Collected_Over_Time.ipynb file to launch the Jupyter notebook and read through its contents to start learning how Jupyter and Spark work together.
  9. When you're done, please choose File | Close and halt on the menu in your notebook. This frees up the cluster resources for other users.

Available data sets

A number of data sets are generated and stored on the HDFS file system. Below is a list of the paths to use to access them.

The 100k and 1M row subsets contain a semi-random sampling of the larger data sets ideal for testing algorithms and code rapidly without needing to occupy the whole cluster.

Source Date Full Path of Latest
iDigBio 2017-06-09 /guoda/data/idigbio-20170609T073048.parquet
iDigBio 100k rows 2017-06-09 /guoda/data/idigbio-20170609T073048-100k.parquet
iDigBio 1M rows 2017-06-09 /guoda/data/idigbio-20170609T073048-1M.parquet
iDigBio Media Records 2017-06-11 /guoda/data/idigbio-media-20170611T154222.parquet
iDigBio Media Records 100k rows 2017-06-11 /guoda/data/idigbio-media-20170611T154222-100k.parquet
iDigBio Media Records 1M rows 2017-06-11 /guoda/data/idigbio-media-20170611T154222-1M.parquet
BHL 2017-01-01 /guoda/data/bhl-20170101-0559.parquet
GBIF unk /guoda/data/gbif-idigbio.parquet
elevation unk /guoda/data/elevation.parquet
GloBI 2017-04-30 /tmp/jhpoelen/globi.parquet
GBIF Backbone taxonomy 2017-05-05 /tmp/jhpoelen/gbif-backbone.parquet
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.