# iDigBio

## What is iDigBio?

iDigBio, which stands for Integrated Digitized Biocollections, is a repository of over 100,000,000 specimen records and over 26,000,000 media records of digitized biological specimens. It provides search and map functions through its GUI, which makes it easy to find many types of information. Here, I will look at biodiversity in San Diego County.

## Accessing the Data

While you can access any data available through the website's GUI, I found it is more efficient to use their API. iDigBio has a variety of ways to access their API, but I used the direct Python client.

### Step 1: Setting up Python

For this tutorial, I used Anaconda to set up a Jupyter notebook locally. Download Anaconda [here](https://www.anaconda.com/download/). When that's downloaded, open up the application and select Jupyter notebook, and that should take you back to your browser with a locally-hosted notebook.

### Step 2: Setting up iDigBio

The [Python API site](https://pypi.org/project/idigbio/) gives instructions on how to set up the notebook, but I will repeat them here.

It may benefit you to update pip first. Run `pip install --upgrade pip` in your terminal.

Then, run `pip install idigbio` and let it download.

If you haven't already, also run `pip install urllib3[secure]` and `pip install pandas`. 

### Step 3: Setting up the Jupyter Notebook

When each of those are downloaded, go back to your local Jupyter notebook and run these commands:

In [1]:
import idigbio
import pandas as pd
from IPython.display import Image

### Step 4: Accessing the Actual Data

At this point, we can search for whatever information we need. Here are the basics:

Create json data
`api = idigbio.json()
json_output = api.search_records()`

or 

Create a pandas dataframe
`api = idigbio.pandas()
pandas_output = api.search_records()`

[This website](https://github.com/idigbio/idigbio-search-api/wiki) is useful in using this system. 

In [2]:
api_json = idigbio.json() #I modified this line to specify which data type we're using
json_output = api_json.search_records()
json_output

{'itemCount': 115173471,
 'lastModified': '2018-10-13T15:13:05.710Z',
 'items': [{'uuid': '061594f4-69a3-41ff-9396-dac55cc8409b',
   'type': 'records',
   'etag': '1ec2d44f898f84f67517c85417ef0269222c4cf1',
   'data': {},
   'indexTerms': {'startdayofyear': 281,
    'country': 'united states',
    'earliestepochorlowestseries': 'eocene',
    'institutionid': 'http://biocol.org/urn:lsid:biocol.org:col:34878',
    'collectioncode': 'invertebrate paleontology',
    'dqs': 0.5797101449275363,
    'countrycode': 'usa',
    'datecollected': '1967-10-08',
    'county': 'lewis',
    'lowestbiostratigraphiczone': 'narizian',
    'hasMedia': False,
    'uuid': '061594f4-69a3-41ff-9396-dac55cc8409b',
    'basisofrecord': 'fossilspecimen',
    'taxonrank': 'species',
    'order': 'neogastropoda',
    'individualcount': 3,
    'highertaxon': 'mollusca; gastropoda; neogastropoda; buccinidae',
    'locality': '[redacted]',
    'occurrenceid': 'urn:catalog:uwbm:invertebratepaleontology:66079',
    'st

In [3]:
api_pandas = idigbio.pandas()
pandas_output = api_pandas.search_records()
pandas_output

Unnamed: 0_level_0,basisofrecord,bed,catalognumber,class,collectioncode,collector,continent,coordinateuncertainty,country,countrycode,...,recordset,scientificname,specificepithet,startdayofyear,stateprovince,taxonomicstatus,taxonrank,typestatus,verbatimeventdate,verbatimlocality
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
061594f4-69a3-41ff-9396-dac55cc8409b,fossilspecimen,,66079,gastropoda,invertebrate paleontology,david nunnallee,north america,5566.0,united states,usa,...,ba77d411-4179-4dbd-b6c1-39b8a71ae795,"parvisipho lewisiana (weaver), 1912",lewisiana,281,washington,,species,,1967-1970,
98d15126-9c7b-49b9-8ab9-04c449184867,fossilspecimen,,66078,gastropoda,invertebrate paleontology,david nunnallee,north america,5566.0,united states,usa,...,ba77d411-4179-4dbd-b6c1-39b8a71ae795,parvisipho sp.,sp.,281,washington,,genus,,1967-1970,
d3747526-4aa9-4193-8864-1eb10c090f9a,fossilspecimen,bear gulch limestone,035520,holocephali,vertpaleo,richard lund & party,north america,,united states,usa,...,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,srianta dawsoni,dawsoni,179,montana,,species,y,,"usa | montana | supercrop quarry, mv7106"
bd20e1ec-0263-46ca-b492-73e3c45cf623,fossilspecimen,,66091,gastropoda,invertebrate paleontology,david nunnallee,north america,5566.0,united states,usa,...,ba77d411-4179-4dbd-b6c1-39b8a71ae795,"whitneyella buwaldana (dickerson), 1915",buwaldana,281,washington,,species,,67-69,
8b1f86b8-2a57-4f9b-a8ae-6208af69c3de,fossilspecimen,bear gulch limestone,046092,holocephali,vertpaleo,richard lund,north america,,united states,usa,...,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,netsepoye hawesi,hawesi,179,montana,,species,y,,"usa | montana | supercrop quarry, mv7106"
f67a501c-53b9-4571-8edd-8a2f6511c776,fossilspecimen,,66189,gastropoda,invertebrate paleontology,david nunnallee,north america,5566.0,united states,usa,...,ba77d411-4179-4dbd-b6c1-39b8a71ae795,"whitneyella buwaldana (dickerson), 1915",buwaldana,281,washington,,species,,67-70,
37ac603e-c90e-4c79-9d3c-8ddfa9ec5746,fossilspecimen,bear gulch limestone,035521,holocephali,vertpaleo,richard lund & party,north america,,united states,usa,...,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,srianta iarlis,iarlis,178,montana,,species,y,,"usa | montana | supercrop quarry, mv7106"
e8eee5ef-de73-474a-bd26-671137f3ea49,fossilspecimen,bear gulch limestone,062790,elasmobranchii,vertpaleo,richard lund,north america,,united states,usa,...,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,falcatus falcatus,falcatus,180,montana,,species,,,usa | montana | bear gulch general
c36f99c6-e734-4b80-9b11-9fe8da29a9d2,fossilspecimen,bear gulch limestone,062891,elasmobranchii,vertpaleo,richard lund,north america,,united states,usa,...,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,falcatus falcatus,falcatus,180,montana,,species,,,"usa | montana | cox ranch quarry, mv7253"
c646d500-7670-4ab5-abb4-8a365bfb24e2,fossilspecimen,bear gulch limestone,035466,elasmobranchii,vertpaleo,richard lund & party,north america,,united states,usa,...,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,falcatus falcatus,falcatus,179,montana,,species,,,"usa | montana | supercrop quarry, mv7106"


Here are some examples of things you can do with this search engine ([here](https://pypi.org/project/idigbio/) is documentation):

In [6]:
#Look for a certain species
record_list_bears = api_pandas.search_records(rq={"scientificname": "ursus"})
record_list_bears.head(10)

Unnamed: 0_level_0,basisofrecord,canonicalname,catalognumber,class,collectioncode,collectionid,collector,continent,coordinateuncertainty,country,...,recordnumber,recordset,scientificname,startdayofyear,stateprovince,taxonid,taxonomicstatus,taxonrank,verbatimeventdate,verbatimlocality
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
b72127a7-0fab-4882-b713-a1cb8e8c7112,fossilspecimen,ursus,12706,mammalia,vertpaleo,,harold w hamilton,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,179.0,west virginia,2433406,accepted,genus,,"usa | west virginia | trout cave entrance, lev..."
1061b2ff-56ac-48fd-8db7-8939e68b8af9,fossilspecimen,ursus,12882,mammalia,vertpaleo,,hamilton & mccrady,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,178.0,west virginia,2433406,accepted,genus,,"usa | west virginia | trout cave entrance, lev..."
80b2da27-6560-4fc0-99f5-f3c3012b5548,fossilspecimen,ursus,2507,mammalia,vertpaleo,,andrew carnegie: gift,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,179.0,california,2433406,accepted,genus,,"usa | california | rancho la brea, los angeles"
8c60341e-1af8-4f68-bb0b-ebc543985559,fossilspecimen,ursus,68946,mammalia,vertpaleo,,harold w hamilton & crew,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,179.0,virginia,2433406,accepted,genus,,"usa | virginia | holston vista cave, 0-6"" level"
ad4e182a-0751-4a29-9a63-22e11d5be620,fossilspecimen,ursus,24303,mammalia,vertpaleo,,s d dean jr,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,90.0,virginia,2433406,accepted,genus,,usa | virginia | meadowview cave
d205fbfe-c3d7-4813-a787-6d207e0b093d,fossilspecimen,ursus,12608,mammalia,vertpaleo,,a d mccrady & r h handley,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,227.0,west virginia,2433406,accepted,genus,,usa | west virginia | organ-hedricks cave system
e83b31f6-52b3-4104-bf1c-83e923583a02,fossilspecimen,ursus,24322,mammalia,vertpaleo,,h w hamilton & a l ambrose,north america,,united states,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,209.0,virginia,2433406,accepted,genus,,usa | virginia | vickers cave
66a42706-1d57-4637-ade2-e73d247390c5,fossilspecimen,ursus,12675,mammalia,vertpaleo,,"n k vereschagin, leningrad: exch",europe,,russia,...,,71b8ffab-444e-43f9-9a9c-5c42b0eaa5eb,ursus,179.0,,2433406,accepted,genus,,"russia | pogrebeneya cave, ural mountains"
15906599-7cad-4c19-a2d5-7aeb4c560392,fossilspecimen,ursus,8387,mammalia,v,,sinclair+furlong,north america,,united states,...,,5ab348ab-439a-4697-925c-d6abe0c09b92,ursus,,california,2433406,accepted,genus,,
ff60bd78-0453-4596-8068-59ae888af311,fossilspecimen,ursus,4100,mammalia,v,,sinclair+furlong,north america,,united states,...,,5ab348ab-439a-4697-925c-d6abe0c09b92,ursus,,california,2433406,accepted,genus,,


We can make charts. (This part of the tutorial is messy -- it took me quite a bit of work to figure it out, as there are no instructions.)

In [4]:
import matplotlib.pyplot as plt
summary_data = api_json.top_records(top_fields = "kingdom")
summary_data

{'kingdom': {'plantae': {'itemCount': 49227078},
  'animalia': {'itemCount': 48864501},
  'fungi': {'itemCount': 5813868},
  'chromista': {'itemCount': 552993},
  'protista': {'itemCount': 306926},
  '1999-01-27': {'itemCount': 224074},
  'protozoa': {'itemCount': 45878},
  'protoctista': {'itemCount': 41501},
  '2004-10-18': {'itemCount': 35920},
  'eubacteria': {'itemCount': 33972}},
 'itemCount': 115173471}

The information is in a dictionary of dictionaries of dictionaries, so I had to do some finagling to get the information I wanted. See below.

Notice that there are more entries than there are actual kingdoms. Plantae, animalia, etc check out but some of the entries are dates, or otherwise not correctly organized. I needed to clean this data so we just had the 6 kingdoms, but doing this lost a lot of data: over 220000 pieces of data are contained in the date-kingdoms alone. I don't have a great way to avoid this other than asking iDigBio to rearrange this data from the start. 

In [1]:
#Finding how many layers I had to go down to find kingdoms
list(summary_data.keys())[0]

NameError: name 'summary_data' is not defined

In [6]:
#Investigating that dictionary
summary_data.keys()

dict_keys(['kingdom', 'itemCount'])

In [7]:
#Finding indices of kingdoms
list(list(summary_data.values())[0].keys()).index("plantae")

0

In [8]:
#Clean data!
def get_kingdoms(dict):
    kingdoms = ["plantae", "animalia", "fungi", "protista", "protozoa", "eubacteria"]
    new_dict = {}
    big_list = list(list(dict.values())[0].keys())
    for i in big_list:
        if i in kingdoms:
            new_dict[i] = list(
                list(
                list(
                    dict.values()
                )[0].values()
            )[big_list.index(i)].values())[0]
    return new_dict
blessed = get_kingdoms(summary_data)
blessed 
#Called it 'blessed' because it took hours of frustration to get here. I did not include all of the steps 
#I took, partially because it would have made this post very long and partially because it wouldn't have been useful
#given that I gave you the correct way to use it. 

{'plantae': 49227078,
 'animalia': 48864501,
 'fungi': 5813868,
 'protista': 306926,
 'protozoa': 45878,
 'eubacteria': 33972}

Finally we have a clean set of data. As you can see, I had some messy data and cleaned it by specifying which "kingdoms" were actually kingdoms. Then I looked inside the layers of dictionaries to get the numbers I wanted. 

Let's look at this information. We can make a bar chart. 

In [None]:
plt.bar(range(len(blessed)), list(blessed.values()), align='center')
plt.xticks(range(len(blessed)), list(blessed.keys()))