# Part 1: How to access GloBi database 


**You need to expand a bit on the intro. What is your name and background? Write a brief summary of what you did in both notebooks and why someone would want to read on.**

I am interested in this GloBi because it provides data in the format of interactions, which is different from other databases I was exploring. Instead of focusing on one species at a time, it connects different species by describing interactions between them. By exploring deeply into GloBi, we can discover and study patterns in the networks among species.

## What is GloBi?

GloBI does a fantastic job of explaining itself: 
>Global Biotic Interactions (GloBI) provides open access to species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software. By providing an infrastructure to capture and share interaction data, individual biologists can focus on gathering new interaction data and analyzing existing datasets without having to spend resources on (re-) building a cyberinfrastructure to do so.

><p>GloBI is made possible by a community of software engineers, bioinformaticists and biologists. Software engineers
        such as <a href="http://linkedin.com/in/jhpoelen">Jorrit Poelen</a>, <a href="https://github.com/coding46">Göran
            Bodenschatz</a>, and <a href="https://github.com/reiz">Robert Reiz</a> collaborate with bioinformaticists
        like <a href="http://www.mendeley.com/profiles/chris-mungall/">Chris Mungall</a>, data managers like <a href="http://github.com/millerse">Sarah E. Miller</a> and biologists like <a
                href="http://ccs.tamucc.edu/people/dr-james-d-simons/">Jim Simons</a>, <a
                href="http://ronininstitute.org/research-scholars/anne-thessen/">Anne Thessen</a>, <a href="https://orcid.org/0000-0002-9943-2342">Jen Hammock</a> and <a
                href="https://sites.google.com/site/haydenresearch/">Brian Hayden</a> to capture, provide access to and use interaction data
        that is provided by <a href="/references.html">biologists and citizen scientists around the world</a>. <b>GloBI is sustained by an intricate network of  thriving open source, open data and open science communities</b> in addition to receiving donations, grants, awards or being written into grants, including, but not limited to, EOL's <a href="http://eol.org/info/485">EOL Rubenstein Fellows Program</a> (CRDF EOL-33066-13/F33066, 2013) and the David M. Rubenstein Grant (FOCX-14-60988-1, 2014), and the Smithsonian Institution (SI) (T15CC10297-002, 2016).


## How to access GloBi

There is a package called [rglobi](https://cran.r-project.org/web/packages/rglobi/index.html) in R which allows us to access the database on Global Biotic Interactions (GloBI).  
  
Description from the documentation of package:  

> A programmatic interface to the web service methods provided by Global Biotic Interactions (GloBI). GloBI provides access to spatial-temporal species interaction records from sources all over the world. rglobi provides methods to search species interactions by location, interaction type, and taxonomic name. In addition, it supports Cypher, a graph query language, to allow for executing custom queries on the GloBI aggregate species interaction data set."  

To use its methods and functions, we need to install and library the package "rglobi" in R.

```r
install.packages("rglobi")
library(rglobi)
```

Users are able to search data on species interactions by location, interaction type, and taxonomic names and so on. Please check out the [rglobi vignette)(https://cran.r-project.org/web/packages/rglobi/vignettes/rglobi_vignette.html) to learn more about the use of this package.

While the r package provides built in methods and functions, it has limitation on the maximum amount of data displayed. I messed around quite a bit with the package, but soon as I found out that the entire GloBi database is only ~6.5GB and could be easily download.  I decided to take that route for further exploration. 



### Accessing all the data

#### Choice 1:
Use Pagination: https://github.com/ropensci/rglobi/blob/master/vignettes/rglobi_vignette.Rmd#L410

"By default, the amount of results are limited. If you'd like to retrieve all results, you can used pagination. For instance, to retrieve parasitic interactions using pagination, you can use:


```r
otherkeys = list("limit"=10, "skip"=0)
first_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
otherkeys = list("limit"=10, "skip"=10)
second_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
```

To exhaust all available interactions, you can keep paging results until the size of the page is less than the limit (e.g., ```nrows(interactions) < limit```)."


#### Choice 2:  
Through API: https://github.com/jhpoelen/eol-globi-data/wiki/API  
The link above contains API which provide access to interaction data for the purpose of integrating the data into wikis, custom webpages or other interaction exploration tools.

#### Choice 3:
Download the whole dataset directly at https://www.globalbioticinteractions.org/data  
Datasets are available to download in different formats including tsv, csv and N-Quads/RDF. I chose the `.tsv` version. 

## Basic data exploration and characteristics

I ended up choosing Choice 3 and explored the dataset with Python in the Jupyter notebook enviroment. One of the reason is that I don't want to be limited by the built-in functions in `rglobi` package. Importing the whole dataset allows me to explore in whatever ways I want to. Also, by Choice 3, I have the same dataset everytime so the results can be reproducible.

In [1]:
%matplotlib inline
import pandas as pd

In [2]:
# Takes a few mintutes to load.
data = pd.read_csv('~/Desktop/interactions.tsv', delimiter='\t', encoding='utf-8')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
data.head()

Unnamed: 0,sourceTaxonId,sourceTaxonIds,sourceTaxonName,sourceTaxonRank,sourceTaxonPathNames,sourceTaxonPathIds,sourceTaxonPathRankNames,sourceTaxonSpeciesName,sourceTaxonSpeciesId,sourceTaxonGenusName,...,eventDateUnixEpoch,argumentTypeId,referenceCitation,referenceDoi,referenceUrl,sourceCitation,sourceNamespace,sourceArchiveURI,sourceDOI,sourceLastSeenAtUnixEpoch
0,EOL:4472733,EOL:4472733 | EOL:4472733,Deinosuchus,genus,Deinosuchus,EOL:4472733,genus,,,Deinosuchus,...,,https://en.wiktionary.org/wiki/support,"Rivera-Sylva H.E., E. Frey and J.R. Guzmán-Gui...",10.4267/2042/28152,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
1,EOL:4433651,EOL:4433651 | EOL:4433651,Daspletosaurus,genus,Daspletosaurus,EOL:4433651,genus,,,Daspletosaurus,...,,https://en.wiktionary.org/wiki/support,doi:10.1666/0022-3360(2001)075<0401:GCFACT>2.0...,10.1666/0022-3360(2001)075<0401:GCFACT>2.0.CO;2,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
2,EOL_V2:24210058,EOL_V2:24210058 | OTT:3617018 | GBIF:4975216 |...,Repenomamus robustus,species,Eucarya | Opisthokonta | Metazoa | Eumetazoa |...,EOL:5610326 | EOL:2910700 | EOL:42196910 | EOL...,| | subkingdom | | | | | | | | | supe...,Repenomamus robustus,EOL_V2:24210058,Repenomamus,...,,https://en.wiktionary.org/wiki/support,doi:10.1038/nature03102,10.1038/nature03102,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
3,EOL:4433892,EOL:4433892 | EOL:4433892,Sinocalliopteryx gigas,species,Sinocalliopteryx gigas,EOL:4433892,species,Sinocalliopteryx gigas,EOL:4433892,,...,,https://en.wiktionary.org/wiki/support,doi:10.1371/journal.pone.0044012,10.1371/journal.pone.0044012,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
4,EOL:4433892,EOL:4433892 | EOL:4433892,Sinocalliopteryx gigas,species,Sinocalliopteryx gigas,EOL:4433892,species,Sinocalliopteryx gigas,EOL:4433892,,...,,https://en.wiktionary.org/wiki/support,doi:10.1371/journal.pone.0044012,10.1371/journal.pone.0044012,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z


In [9]:
# check the number of rows
len(data)

3445494

In [6]:
# How many Columns
len(data.columns)

80

In [5]:
# What are the 80 columns of this dataset?
data.columns

Index(['sourceTaxonId', 'sourceTaxonIds', 'sourceTaxonName', 'sourceTaxonRank',
       'sourceTaxonPathNames', 'sourceTaxonPathIds',
       'sourceTaxonPathRankNames', 'sourceTaxonSpeciesName',
       'sourceTaxonSpeciesId', 'sourceTaxonGenusName', 'sourceTaxonGenusId',
       'sourceTaxonFamilyName', 'sourceTaxonFamilyId', 'sourceTaxonOrderName',
       'sourceTaxonOrderId', 'sourceTaxonClassName', 'sourceTaxonClassId',
       'sourceTaxonPhylumName', 'sourceTaxonPhylumId',
       'sourceTaxonKingdomName', 'sourceTaxonKingdomId', 'sourceId',
       'sourceOccurrenceId', 'sourceCatalogNumber', 'sourceBasisOfRecordId',
       'sourceBasisOfRecordName', 'sourceLifeStageId', 'sourceLifeStageName',
       'sourceBodyPartId', 'sourceBodyPartName', 'sourcePhysiologicalStateId',
       'sourcePhysiologicalStateName', 'interactionTypeName',
       'interactionTypeId', 'targetTaxonId', 'targetTaxonIds',
       'targetTaxonName', 'targetTaxonRank', 'targetTaxonPathNames',
       'targetTaxonPath

#### How many different types of taxons as sources & target? 

In [11]:
# Source taxon
len(data['sourceTaxonId'].unique())

147156

In [12]:
#Target taxon
len(data['targetTaxonId'].unique())

105196

### What interaction types are there?

The backbone of the GloBi database is that all all observations are interactions and must fit into 36 interaction types. 

In [13]:
data['interactionTypeName'].unique()

array(['eats', 'preysOn', 'interactsWith', 'pollinates', 'parasiteOf',
       'pathogenOf', 'visitsFlowersOf', 'adjacentTo', 'dispersalVectorOf',
       'hasHost', 'endoparasitoidOf', 'symbiontOf', 'endoparasiteOf',
       'hasVector', 'ectoParasiteOf', 'vectorOf', 'livesOn', 'livesNear',
       'parasitoidOf', 'guestOf', 'livesInsideOf', 'farms',
       'ectoParasitoid', 'inhabits', 'kills', 'hasDispersalVector',
       'livesUnder', 'kleptoparasiteOf', 'hostOf', 'visits', 'eatenBy',
       'flowersVisitedBy', 'preyedUponBy', 'hasParasite', 'pollinatedBy',
       'hasPathogen'], dtype=object)

In [14]:
# number of different types of interaction
len(data['interactionTypeName'].unique())

36

### Unique Interaction types

Each record in GloBi comes from a specific dataset. One of the great parts of GloBi is the transparency on *exactly* where that data is coming from.  GloBi has a system set up that continually gathers the information from its sources on a daily basis. Because of this, the database can fix a mistake on their end and without intervention GloBi will incorporate those changes into their data set.  You can tell where it came from by a few columns, but what is especially interesting is the `sourceNamespace` column which displays the exact place on Github in which the data is coming from.

In [57]:
# Top 10 data source with the most records contributed. 
data['sourceNamespace'].value_counts().head(10)

globalbioticinteractions/fishbase                                            504260
globalbioticinteractions/arthropodEasyCaptureAMNH                            350213
millerse/Wardeh-et-al.-2015                                                  271904
globalbioticinteractions/natural-history-museum-london-interactions-bank     242429
millerse/Dapstrom-integrated-database-and-portal-for-fish-stomach-records    225564
globalbioticinteractions/ices                                                183935
EOL/pseudonitzchia                                                           183773
globalbioticinteractions/noaa-reem                                           122328
millerse/US-National-Parasite-Collection                                      99713
globalbioticinteractions/roopnarine                                           96647
Name: sourceNamespace, dtype: int64

To look at where GloBi is getting this data from simply add the first column to github.com. 
example: The largest contributer appears to be Fishbase [github.com/globalbioticinteractions/fishbase](github.com/globalbioticinteractions/fishbase). You can also get the status of GloBi's interaction with the data sources here: [https://www.globalbioticinteractions.org/status.html](https://www.globalbioticinteractions.org/status.html).

I'm interested in how many unique interaction types records are found in GloBi. 

In [15]:
data.drop_duplicates(['sourceTaxonId', 'interactionTypeName', 'targetTaxonId'], inplace = True)

In [58]:
len(data)

3456395