**I am Yikang Li, an international student from Tianjin, China. I have just graduated from UC Berkeley with BA degree in statistics. I have long been interested in data science and therefore seized every possible means to to do scientific researches and projects to improve my practical ability in data mining. During this research experience, I, together with other interns, studied different natural history databases and revised each other’s work on git-hub repository. And I explored Global Biotic Interactions, a natural history database on species interactions, from a data science perspective and conducted exploratory analysis with immediate products in the form of medium-length reports that will be published on-line at the Cabinet of Curiosity Website.**

I am interested in the GloBi database because it provides data in the format of interactions, which is different from other databases I was exploring. Instead of focusing on one species at a time, it connects different species by describing interactions between them. By exploring deeply into GloBi, we can discover and study patterns in the networks among species. 

GloBI does a fantastic job of explaining itself: 
>Global Biotic Interactions (GloBI) provides open access to species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software. By providing an infrastructure to capture and share interaction data, individual biologists can focus on gathering new interaction data and analyzing existing datasets without having to spend resources on (re-) building a cyberinfrastructure to do so.

><p>GloBI is made possible by a community of software engineers, bioinformaticists and biologists. Software engineers
        such as <a href="http://linkedin.com/in/jhpoelen">Jorrit Poelen</a>, <a href="https://github.com/coding46">Göran
            Bodenschatz</a>, and <a href="https://github.com/reiz">Robert Reiz</a> collaborate with bioinformaticists
        like <a href="http://www.mendeley.com/profiles/chris-mungall/">Chris Mungall</a>, data managers like <a href="http://github.com/millerse">Sarah E. Miller</a> and biologists like <a
                href="http://ccs.tamucc.edu/people/dr-james-d-simons/">Jim Simons</a>, <a
                href="http://ronininstitute.org/research-scholars/anne-thessen/">Anne Thessen</a>, <a href="https://orcid.org/0000-0002-9943-2342">Jen Hammock</a> and <a
                href="https://sites.google.com/site/haydenresearch/">Brian Hayden</a> to capture, provide access to and use interaction data
        that is provided by <a href="/references.html">biologists and citizen scientists around the world</a>. <b>GloBI is sustained by an intricate network of  thriving open source, open data and open science communities</b> in addition to receiving donations, grants, awards or being written into grants, including, but not limited to, EOL's <a href="http://eol.org/info/485">EOL Rubenstein Fellows Program</a> (CRDF EOL-33066-13/F33066, 2013) and the David M. Rubenstein Grant (FOCX-14-60988-1, 2014), and the Smithsonian Institution (SI) (T15CC10297-002, 2016).


## How to access GloBi

There is a package called [rglobi](https://cran.r-project.org/web/packages/rglobi/index.html) in R which allows us to access the database on Global Biotic Interactions (GloBI).  
  
Description from the documentation of package:  

> A programmatic interface to the web service methods provided by Global Biotic Interactions (GloBI). GloBI provides access to spatial-temporal species interaction records from sources all over the world. rglobi provides methods to search species interactions by location, interaction type, and taxonomic name. In addition, it supports Cypher, a graph query language, to allow for executing custom queries on the GloBI aggregate species interaction data set."  

To use its methods and functions, we need to install and library the package "rglobi" in R.

```r
install.packages("rglobi")
library(rglobi)
```

Users are able to search data on species interactions by location, interaction type, and taxonomic names and so on. Please check out the [rglobi vignette)(https://cran.r-project.org/web/packages/rglobi/vignettes/rglobi_vignette.html) to learn more about the use of this package.

While the r package provides built in methods and functions, it has limitation on the maximum amount of data displayed. I messed around quite a bit with the package, but soon as I found out that the entire GloBi database is only ~6.5GB and could be easily download.  I decided to take that route for further exploration. 



### Accessing all the data

#### Choice 1:
Use Pagination: https://github.com/ropensci/rglobi/blob/master/vignettes/rglobi_vignette.Rmd#L410

"By default, the amount of results are limited. If you'd like to retrieve all results, you can used pagination. For instance, to retrieve parasitic interactions using pagination, you can use:


```r
otherkeys = list("limit"=10, "skip"=0)
first_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
otherkeys = list("limit"=10, "skip"=10)
second_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
```

To exhaust all available interactions, you can keep paging results until the size of the page is less than the limit (e.g., ```nrows(interactions) < limit```)."


#### Choice 2:  
Through API: https://github.com/jhpoelen/eol-globi-data/wiki/API  
The link above contains API which provide access to interaction data for the purpose of integrating the data into wikis, custom webpages or other interaction exploration tools.

#### Choice 3:
Download the whole dataset directly at https://www.globalbioticinteractions.org/data  
Datasets are available to download in different formats including tsv, csv and N-Quads/RDF. I chose the `.tsv` version. 

## Basic data exploration and characteristics

I ended up choosing Choice 3 and explored the dataset with Python in the Jupyter notebook enviroment. One of the reason is that I don't want to be limited by the built-in functions in `rglobi` package. Importing the whole dataset allows me to explore in whatever ways I want to. Also, by Choice 3, I have the same dataset everytime so the results can be reproducible. 

If you would like to follow along to follow along on a Jupyter notebook, please checkout the notebook here: [Notebook](link.com). You will first need to download the interactions.tsv file here: [interactions.tsv.gz](https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/interactions.tsv.gz). 

In [60]:
%matplotlib inline
import pandas as pd

In [61]:
# Takes a few mintutes to load.
# If following along please download and unzip interactions.tsv.gz from 
# https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/interactions.tsv.gz
# Unziping the file is ~6.5 GB
# Don't forget to change path to the file on your computer

data = pd.read_csv('~/Desktop/interactions.tsv', delimiter='\t', encoding='utf-8')

  interactivity=interactivity, compiler=compiler, result=result)


In [62]:
# See the first few rows
data.head()

Unnamed: 0,sourceTaxonId,sourceTaxonIds,sourceTaxonName,sourceTaxonRank,sourceTaxonPathNames,sourceTaxonPathIds,sourceTaxonPathRankNames,sourceTaxonSpeciesName,sourceTaxonSpeciesId,sourceTaxonGenusName,...,eventDateUnixEpoch,argumentTypeId,referenceCitation,referenceDoi,referenceUrl,sourceCitation,sourceNamespace,sourceArchiveURI,sourceDOI,sourceLastSeenAtUnixEpoch
0,EOL:4472733,EOL:4472733 | EOL:4472733,Deinosuchus,genus,Deinosuchus,EOL:4472733,genus,,,Deinosuchus,...,,https://en.wiktionary.org/wiki/support,"Rivera-Sylva H.E., E. Frey and J.R. Guzmán-Gui...",10.4267/2042/28152,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
1,EOL:4433651,EOL:4433651 | EOL:4433651,Daspletosaurus,genus,Daspletosaurus,EOL:4433651,genus,,,Daspletosaurus,...,,https://en.wiktionary.org/wiki/support,doi:10.1666/0022-3360(2001)075<0401:GCFACT>2.0...,10.1666/0022-3360(2001)075<0401:GCFACT>2.0.CO;2,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
2,EOL_V2:24210058,EOL_V2:24210058 | OTT:3617018 | GBIF:4975216 |...,Repenomamus robustus,species,Eucarya | Opisthokonta | Metazoa | Eumetazoa |...,EOL:5610326 | EOL:2910700 | EOL:42196910 | EOL...,| | subkingdom | | | | | | | | | supe...,Repenomamus robustus,EOL_V2:24210058,Repenomamus,...,,https://en.wiktionary.org/wiki/support,doi:10.1038/nature03102,10.1038/nature03102,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
3,EOL:4433892,EOL:4433892 | EOL:4433892,Sinocalliopteryx gigas,species,Sinocalliopteryx gigas,EOL:4433892,species,Sinocalliopteryx gigas,EOL:4433892,,...,,https://en.wiktionary.org/wiki/support,doi:10.1371/journal.pone.0044012,10.1371/journal.pone.0044012,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z
4,EOL:4433892,EOL:4433892 | EOL:4433892,Sinocalliopteryx gigas,species,Sinocalliopteryx gigas,EOL:4433892,species,Sinocalliopteryx gigas,EOL:4433892,,...,,https://en.wiktionary.org/wiki/support,doi:10.1371/journal.pone.0044012,10.1371/journal.pone.0044012,,Katja Schulz. 2015. Information about dinosaur...,KatjaSchulz/dinosaur-biotic-interactions,https://github.com/KatjaSchulz/dinosaur-biotic...,,2018-12-14T23:59:22.189Z


In [99]:
# check the number of rows
len(data)

965611

In [100]:
# How many Columns
len(data.columns)

80

In [101]:
# What are the 80 columns of this dataset?
data.columns

Index(['sourceTaxonId', 'sourceTaxonIds', 'sourceTaxonName', 'sourceTaxonRank',
       'sourceTaxonPathNames', 'sourceTaxonPathIds',
       'sourceTaxonPathRankNames', 'sourceTaxonSpeciesName',
       'sourceTaxonSpeciesId', 'sourceTaxonGenusName', 'sourceTaxonGenusId',
       'sourceTaxonFamilyName', 'sourceTaxonFamilyId', 'sourceTaxonOrderName',
       'sourceTaxonOrderId', 'sourceTaxonClassName', 'sourceTaxonClassId',
       'sourceTaxonPhylumName', 'sourceTaxonPhylumId',
       'sourceTaxonKingdomName', 'sourceTaxonKingdomId', 'sourceId',
       'sourceOccurrenceId', 'sourceCatalogNumber', 'sourceBasisOfRecordId',
       'sourceBasisOfRecordName', 'sourceLifeStageId', 'sourceLifeStageName',
       'sourceBodyPartId', 'sourceBodyPartName', 'sourcePhysiologicalStateId',
       'sourcePhysiologicalStateName', 'interactionTypeName',
       'interactionTypeId', 'targetTaxonId', 'targetTaxonIds',
       'targetTaxonName', 'targetTaxonRank', 'targetTaxonPathNames',
       'targetTaxonPath

#### How many different types of taxons as sources & target? 

You can see that many of the columns start with either "source", "target".  Columns in which start with "source" describe the organims or group of organisms that act upon the "target" organism. These columns are different ways to describe those organims. The TaxonIDs columns are columns that link the organims to an established database of organims such as the [Encyclopedia of Life](https://eol.org/). The great part of these columns is that they are unique IDs. 

Let's check out how many unique organims or organims groups there are in GloBi.

In [66]:
# Source taxon
len(data['sourceTaxonId'].unique())

147510

In [67]:
#Target taxon
len(data['targetTaxonId'].unique())

106613

### What interaction types are there?

The source and target organims are connected by the action in which they interact and are described by the interaction columns which must fit into 37 interaction types. 

In [68]:
data['interactionTypeName'].unique()

array(['eats', 'preysOn', 'interactsWith', 'pollinates', 'parasiteOf',
       'pathogenOf', 'visitsFlowersOf', 'adjacentTo', 'dispersalVectorOf',
       'hasHost', 'endoparasitoidOf', 'symbiontOf', 'endoparasiteOf',
       'hasVector', 'ectoParasiteOf', 'vectorOf', 'livesOn', 'livesNear',
       'parasitoidOf', 'guestOf', 'livesInsideOf', 'farms',
       'ectoParasitoid', 'inhabits', 'kills', 'hasDispersalVector',
       'livesUnder', 'kleptoparasiteOf', 'hostOf', 'eatenBy',
       'flowersVisitedBy', 'preyedUponBy', 'hasParasite', 'pollinatedBy',
       'visits', 'commensalistOf', 'hasPathogen'], dtype=object)

In [69]:
# number of different types of interaction
len(data['interactionTypeName'].unique())

37

Each record in GloBi comes from a specific dataset. One of the great parts of GloBi is the transparency on *exactly* where that data is coming from.  GloBi has a system set up that continually gathers the information from its sources on a daily basis. Because of this, the database can fix a mistake on their end and without intervention GloBi will incorporate those changes into their data set.  You can tell where it came from by a few columns, but what is especially interesting is the `sourceNamespace` column which displays the exact place on Github in which the data is coming from.

In [70]:
# Top 10 data sources ranked by amount of records contributed to GloBi
data['sourceNamespace'].value_counts().head(10)

globalbioticinteractions/fishbase                                            504260
globalbioticinteractions/arthropodEasyCaptureAMNH                            350213
millerse/Wardeh-et-al.-2015                                                  271904
globalbioticinteractions/natural-history-museum-london-interactions-bank     242429
millerse/Dapstrom-integrated-database-and-portal-for-fish-stomach-records    225564
globalbioticinteractions/ices                                                183935
EOL/pseudonitzchia                                                           183773
globalbioticinteractions/noaa-reem                                           122328
millerse/US-National-Parasite-Collection                                      99713
globalbioticinteractions/roopnarine                                           96647
Name: sourceNamespace, dtype: int64

To look at where GloBi is getting this data from simply add the first column to github.com. 

Example: The largest contributer appears to be Fishbase [github.com/globalbioticinteractions/fishbase](github.com/globalbioticinteractions/fishbase). You can also get the status of GloBi's interaction with the data sources here: [https://www.globalbioticinteractions.org/status.html](https://www.globalbioticinteractions.org/status.html).

Many of the columns are related to the type of organim being described and the most intersting

I'm interested in how many unique interaction types records are found in GloBi.  The most interesting columns and really the heart of the database is `sourceTaxonId`, `interactionTypeName`, and `targetTaxonId`. With these three columns you can see how an animal interacts with what. 

In [89]:
data[['sourceTaxonId', 'interactionTypeName', 'targetTaxonId']]

Unnamed: 0,sourceTaxonId,interactionTypeName,targetTaxonId
0,EOL:4472733,eats,EOL_V2:42417811
1,EOL:4433651,eats,EOL_V2:42417811
2,EOL_V2:24210058,eats,EOL:4532049
3,EOL:4433892,eats,EOL_V2:4433896
4,EOL:4433892,eats,EOL:4433563
5,EOL:4433551,eats,EOL:42331729
6,EOL:4531246,eats,EOL_V2:4530741
7,EOL:4531246,eats,EOL:4653801
8,EOL:4433582,eats,EOL_V2:4531936
9,EOL:4433881,preysOn,EOL:4518630


## How to search by Organism

There are many columns that describe the species or order, you can search by any of the columns.  But, if you just want to mess around with the dataset at hand you can just search using a organim string.  I choose to search what [Carollia](https://en.wikipedia.org/wiki/Carollia), a genus of short tail fruit bats eats.

In [98]:
# Subset by the term Carollia
corollia = data[data['sourceTaxonName'].str.contains('Carollia')]

corollia = corollia.loc[corollia['interactionTypeName'] == 'eats']

# Show only relevant columns
corollia[['sourceTaxonName','sourceTaxonId', 'interactionTypeName', 'targetTaxonName','targetTaxonId']].head()

Unnamed: 0,sourceTaxonName,sourceTaxonId,interactionTypeName,targetTaxonName,targetTaxonId
785626,Carollia perspicillata,EOL:327438,eats,Terminalia catappa,GBIF:3189394
785683,Carollia perspicillata,EOL:327438,eats,Syzygium malaccense,EOL:2508662
785688,Carollia perspicillata,EOL:327438,eats,Syzygium jambos,EOL:2508661
785727,Carollia perspicillata,EOL:327438,eats,Syzygium cumini,EOL:2508660
785900,Carollia perspicillata,EOL:327438,eats,Spondias,EOL:61097


From above you can see that [Carollia perspicillata](https://en.wikipedia.org/wiki/Seba%27s_short-tailed_bat) eats yummy things like [Terminalia catappa](https://en.wikipedia.org/wiki/Terminalia_catappa) which is some type of nut and [Syzygium malaccense](https://en.wikipedia.org/wiki/Syzygium_malaccense) some apple like fruit. 

