# Biogeography Notebook 1

The goal of this notebook is to access and integrate diverse data sets to visualize correlations and discover patterns to address questions of species’ responses to environmental change. We will use programmatic tools to show how to use Berkeley resources such as the biodiversity data from biocollections and online databases, field stations, climate models, and other environmental data. If you have any questions getting the Jupyter notebook to run, try dropping into [data peer consulting](https://data.berkeley.edu/education/data-peer-consulting).

Before we begin analyzing and visualizing biodiversity data, this introductory notebook will help familiarize you with the basics of programming in Python.

## Table of Contents

Please complete sections 0 and 1 before coming to class.

0 - [Jupyter Notebooks](#jupyter)
    
1 - [Python Basics](#python basics)

3 - [GBIF API](#gbif)


# Part 0: Our Computing Environment, Jupyter notebooks  <a id='jupyter'></a>
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results. 

### Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)

### Code cells
Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press ▶| or hold down the `shift` key and press `return` or `enter`.

Try running this cell:

In [1]:
print("Hello, World!")

Hello, World!


And this one:

In [2]:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")

👋, 🌏!


In order to finish the setup for this notebook, run the following cell:

In [3]:
%%capture
!pip install --no-cache-dir shapely
!pip install -U folium

%matplotlib inline
import os
import time
import folium
from datetime import datetime
from shapely.geometry import Point, mapping
from shapely.geometry.polygon import Polygon
import matplotlib as mpl
from matplotlib.collections import PatchCollection
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from datascience import *
from shapely import geometry as sg, wkt
from scripts.espm_module import *
import json
import random
from IPython.core.display import display, HTML
import ipywidgets as widgets
plt.style.use('seaborn')

# Part 1: Python basics <a id='python basics'></a>
Before getting into the more high level analyses we will do on the GBIF and Cal-Adapt data, we need to cover a few of the foundational elements of programming in Python.

#### A. Expressions
The departure point for all programming is the concept of the __expression__. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions. Keep in mind that most of these just map to your understanding of mathematical expressions:

In [4]:
2 + 2

'me' + ' and I'

12 ** 2

6 + 4

10

You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call the `print` function on that expression. Functions use parenthesis around their parameters, just like in math!

In [5]:
print(2 + 2)

print('you' + ' and I')

print(12 ** 2)

print(6 + 4)

4
you and I
144
10


#### B. Variables
In the example below, `a` and `b` are Python objects known as __variables__. We are giving an object (in this case, an `integer` and a `float`, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. Variables are stored within the notebook's environment, meaning stored variable values carry over from cell to cell.

In [6]:
a = 4
b = 10/5

Notice that when you create a variable, unlike what you previously saw with the expressions, it does not print anything out.

We can continue to perform mathematical operations on these variables, which are now placeholders for what we've assigned:

In [7]:
print(a + b)

6.0


#### C. Lists
The following few cells will introduce the concept of __lists__.

A list is an ordered collection of objects. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this [documentation](https://www.tutorialspoint.com/python/python_lists.htm) for an in-depth look at the capabilities of lists.

To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list. 

In [8]:
# an empty list
lst = []
print(lst)

# reassigning our empty list to a new list
lst = [1, 3, 6, 'lists', 'are' 'fun', 4]
print(lst)

[]
[1, 3, 6, 'lists', 'arefun', 4]


To access a value in the list, put the index of the item you wish to access in brackets following the variable that stores the list. Lists in Python are zero-indexed, so the indicies for `lst` are 0, 1, 2, 3, 4, 5, and 6.

In [9]:
# Elements are selected like this:
example = lst[2]

# The above line selects the 3rd element of lst (list indices are 0-offset) and sets it to a variable named example.
print(example)

6


#### D. Dictionaries

Dictionaries are `key`-`value` pairs. Just like a word dictinary, you have a key that will index a specific definition.

In [10]:
my_dict = {'python': 'a large heavy-bodied nonvenomous constrictor snake occurring throughout the Old World tropics.'}

We can get a `value` back out by indexing the `key`:

In [11]:
my_dict['python']

'a large heavy-bodied nonvenomous constrictor snake occurring throughout the Old World tropics.'

But like real dictionaries, there can be more than one definition. You can keep a `list`, or even another dictionary within a specific `key`:

In [12]:
my_dict = {'python': ['a large heavy-bodied nonvenomous constrictor snake occurring throughout the Old World tropics.',
                      'a high-level general-purpose programming language.']}

We can index the `list` after the `key`:

In [13]:
my_dict['python'][0]

'a large heavy-bodied nonvenomous constrictor snake occurring throughout the Old World tropics.'

In [14]:
my_dict['python'][1]

'a high-level general-purpose programming language.'

---

# Part 2: GBIF API<a id='gbif'></a>

Click on the [link](http://www.gbif.org/) to the GBIF website to discover what GBIF can do!

<div class="alert alert-block alert-warning">
**QUESTION 1:** What does GBIF stand for and who is it coordinated by?
</div>

test response

The Global Biodiversity Information Facility has created an API that we can use to get data about different species at the [GBIF Web API](http://www.gbif.org/developer/summary).

You can think of a Web API call as a fancy URL. What do you think the end of this URL means?

http://api.gbif.org/v1/occurrence/search?year=1800,1899

If you're guessing that it limits the search to the years 1800-1899, you're right! Go ahead and click the URL above. You should see something like this:

```
{"offset":0,"limit":20,"endOfRecords":false,"count":5711947,"results":[{"key":14339704,"datasetKey":"857aa892-f762-11e1-a439-00145eb45e9a","publishingOrgKey":"6bcc0290-6e76-11db-bcd5-b8a03c50a862","publishingCountry":"FR","protocol":"BIOCASE","lastCrawled":"2013-09-07T07:06:34.000+0000","crawlId":1,"extensions":{},"basisOfRecord":"OBSERVATION","taxonKey":2809968,"kingdomKey":6,"phylumKey":7707728,"classKey":196,"orderKey":1169,"familyKey":7689,"genusKey":2849312,"speciesKey":2809968,"scientificName":"Orchis militaris L.","kingdom":"Plantae","phylum":"Tracheophyta","order":"Asparagales","family":"Orchidaceae","genus":"Orchis","species":"Orchis 
```

It might look like a mess, but it's not! This is actually very structured data, and can easily be put into a table like format, though often programmers don't do this because it's just as easy to keep it as is.

You might be able to pick out the curly braces `{` and think this it's a dictionary. You'd be right, except in this format we call it [JSON](https://en.wikipedia.org/wiki/JSON).

---

## *Argia arioides*

![argia arioides](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Argia_agrioides-Male-1.jpg/220px-Argia_agrioides-Male-1.jpg)

When performing data analysis, it is always important to define a question that you seek the answer to. *The goal of finding the answer to this question will ultimately drive the queries and analysis styles you choose to use/write.*

For this example, we are going to ask: **where have [*Argia agrioides*](https://www.google.com/search?q=Argia+agrioides&rlz=1C1CHBF_enUS734US734&source=lnms&tbm=isch&sa=X&ved=0ahUKEwji9t29kNTWAhVBymMKHWJ-ANcQ_AUICygC&biw=1536&bih=694) (the California Dancer dragonfly) been documented? Are there records at any of our field stations?**

The code to ask the API has already been written for us! This is often the case with programming, someone has already written the code, so we don't have to. We'll just set up the `GBIFRequest` object and assign that to the variable `req`, short for "request":

In [15]:
req = GBIFRequest()  # creating a request to the API

Great, so how do we make searches? We can use a Python `dictionary` to create our query parameters. We'll ask for the `scientificName` of the California Dancer (*Argia arioides*):

In [16]:
params = {'scientificName': 'Argia agrioides'}  # setting our parameters (the specific species we want)

Now that we have the parameters, we can feed this to our `req` variable to get back all the pages of data. We'll then make sure that each record has a `decimalLatitude`, otherwise we'll thow it out for now. Lastly, we'll print out the first five records:

In [17]:
params = {'scientificName': 'Argia agrioides'}  # setting our parameters (the specific species we want)
pages = req.get_pages(params)  # using those parameters to complete the request
records = [rec for page in pages for rec in page['results'] if rec.get('decimalLatitude')]  # sift out valid records
records[:5]  # print first 5 records

[{'acceptedScientificName': 'Argia agrioides Calvert, 1895',
  'acceptedTaxonKey': 5051459,
  'basisOfRecord': 'HUMAN_OBSERVATION',
  'catalogNumber': '20373409',
  'class': 'Insecta',
  'classKey': 216,
  'collectionCode': 'Observations',
  'coordinateUncertaintyInMeters': 122.0,
  'country': 'Mexico',
  'countryCode': 'MX',
  'crawlId': 183,
  'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7',
  'datasetName': 'iNaturalist research-grade observations',
  'dateIdentified': '2019-02-15T23:07:23',
  'day': 11,
  'decimalLatitude': 23.05612,
  'decimalLongitude': -109.684839,
  'eventDate': '2019-02-11T15:22:00',
  'eventTime': '20:22:00Z',
  'extensions': {},
  'facts': [],
  'family': 'Coenagrionidae',
  'familyKey': 8577,
  'gbifID': '2005347832',
  'genericName': 'Argia',
  'genus': 'Argia',
  'genusKey': 1422607,
  'geodeticDatum': 'WGS84',
  'http://unknown.org/nick': 'rileywalsh',
  'http://unknown.org/occurrenceDetails': 'https://www.inaturalist.org/observations/20373409',
  '

<div class="alert alert-block alert-warning">
**QUESTION 2:** Why might it be useful to know the documented occurences of a species? Name one organization which would find this information useful.
</div>

test response

<div class="alert alert-block alert-warning">
**QUESTION 3:** What is the *geographic range* of an organism
</div>

test response

<div class="alert alert-block alert-warning">
**QUESTION 4:** How do museum records help us to understand how populations are changing?
</div>

test response

### DataFrames

JSON is great, but it might be conceptually easier to make this a table. We'll use the popular [`pandas`](http://pandas.pydata.org/) Python library. In `pandas`, a DataFrame is a table that has several convenient features. For example, we can access the columns of the table like we would `dict`ionaries, and we can also treat the columns and rows themselves as Python `list`s.

In [18]:
records_df = pd.read_json(json.dumps(records))  # converts the JSON above to a dataframe
records_df.head()  # prints the first five rows of the dataframe

Unnamed: 0,acceptedNameUsage,acceptedScientificName,acceptedTaxonKey,accessRights,associatedReferences,basisOfRecord,bibliographicCitation,catalogNumber,class,classKey,...,taxonKey,taxonRank,taxonRemarks,taxonomicStatus,type,verbatimElevation,verbatimEventDate,verbatimLocality,vernacularName,year
0,,"Argia agrioides Calvert, 1895",5051459,,,HUMAN_OBSERVATION,,20373409,Insecta,216,...,5051459,SPECIES,,ACCEPTED,,,2019/02/11 3:22 PM EST,"Los Cabos, Baja California Sur, Mexico",,2019.0
1,,"Argia agrioides Calvert, 1895",5051459,,,HUMAN_OBSERVATION,,20393774,Insecta,216,...,5051459,SPECIES,,ACCEPTED,,,2019/02/15 9:38 AM MST,"Santa Rita Hot Springs, Baja California Sur, M...",,2019.0
2,,"Argia agrioides Calvert, 1895",5051459,,,HUMAN_OBSERVATION,,22377619,Insecta,216,...,5051459,SPECIES,,ACCEPTED,,,2019/04/11 10:56 AM MDT,"Los Cabos, Baja California Sur, Mexico",,2019.0
3,,"Argia agrioides Calvert, 1895",5051459,,,HUMAN_OBSERVATION,,25338726,Insecta,216,...,5051459,SPECIES,,ACCEPTED,,,2019/05/17 1:07 PM MST,"Hassayampa River Preserve, Wickenburg, Maricop...",,2019.0
4,,"Argia agrioides Calvert, 1895",5051459,,,HUMAN_OBSERVATION,,25235332,Insecta,216,...,5051459,SPECIES,,ACCEPTED,,,2019/05/15 2:53 PM -0700,"Riverside County, CA, USA",,2019.0


Since each column (or row) above can be thought of as a `list`, that means we can use list functions to interact with them! One such function is the `len` function to get the number of elements in a `list`:

In [19]:
len(records_df.columns), len(records_df)

(115, 301)

So we have 115 columns and 301 rows! That's a lot of information. What variables do we have in the columns?

In [20]:
records_df.columns

Index(['acceptedNameUsage', 'acceptedScientificName', 'acceptedTaxonKey',
       'accessRights', 'associatedReferences', 'basisOfRecord',
       'bibliographicCitation', 'catalogNumber', 'class', 'classKey',
       ...
       'taxonKey', 'taxonRank', 'taxonRemarks', 'taxonomicStatus', 'type',
       'verbatimElevation', 'verbatimEventDate', 'verbatimLocality',
       'vernacularName', 'year'],
      dtype='object', length=115)

We can use two methods from `pandas` to do a lot more. The `value_counts()` method will tabulate the frequency of the row value in a column, and the `plot.barh()` will plot us a horizontal bar chart:

In [21]:
records_df['country'].value_counts()

United States of America    251
Mexico                       50
Name: country, dtype: int64

In [22]:
records_df['country'].value_counts().plot.barh();

In [23]:
records_df['county'].value_counts().plot.barh();

<div class="alert alert-block alert-warning">
**QUESTION 5:** How many counties have only one record of *Argia agrioides*? 
</div>

test response

<div class="alert alert-block alert-warning">
**QUESTION 6:** Stanislaus County has the highest record of *Argia agrioides*. Other than high abundance in this county, why else might there be a high number of records here?
</div>

test response

In [24]:
records_df['basisOfRecord'].value_counts().plot.barh();

<div class="alert alert-block alert-warning">
**QUESTION 7:** What are some cautions that should be taken when including human observations? What are the benefits? 
</div>

test response

In [25]:
records_df['collectionCode'].value_counts().plot.barh();

<div class="alert alert-block alert-warning">
**QUESTION 8:** Each museum has a unique institution code (called a collection code). How many records belong to the Essig Museum of Entomology Collection?
</div>

test response

The `groupby()` method allows us to count based one column based on another, and then color the bar chart differently depending on a variable of our choice:

In [26]:
records_df.groupby(["collectionCode", "basisOfRecord"])['basisOfRecord'].count();

In [27]:
records_df.groupby(["collectionCode", "basisOfRecord"])['basisOfRecord'].count().unstack().plot.barh(stacked=True);

And we can use `plot.hist()` to make a histogram:

In [28]:
records_df['elevation'].plot.hist();

<div class="alert alert-block alert-warning">
**QUESTION 9:** What does plotting the elevation indicate about the distribution of *Argia agrioides*? Can you infer anything about the biology of the organism from this information?
</div>

test response

---

<div class="alert alert-block alert-info">
**EXERCISE**: Edit the code below to search for a different species you're interested in, then use  the graphing cells below to explore your data!
</div>

In [29]:
my_req = GBIFRequest()  # creating a request to the API
my_params = {'scientificName': 'Tetragnatha versicolor'}  # setting our parameters (the specific species we want)
my_pages = my_req.get_pages(my_params)  # using those parameters to complete the request
my_records = [rec for page in my_pages for rec in page['results'] if rec.get('decimalLatitude')]  # sift out valid records
my_records_df = pd.read_json(json.dumps(my_records))  # make a dataframe
my_records_df.head()  # print first 5 rows

Unnamed: 0,acceptedScientificName,acceptedTaxonKey,accessRights,associatedSequences,basisOfRecord,catalogNumber,class,classKey,collectionCode,collectionID,...,taxonRank,taxonRemarks,taxonomicStatus,type,verbatimCoordinateSystem,verbatimElevation,verbatimEventDate,verbatimLocality,vernacularName,year
0,"Tetragnatha versicolor Walckenaer, 1841",2151836,,,HUMAN_OBSERVATION,11391845,Arachnida,367,Observations,,...,SPECIES,,ACCEPTED,,,,2018-03-28 1:45:25 p. m. GMT-06:00,"Ribera 3438, Riberas del Río, 67160 Guadalupe,...",,2018
1,"Tetragnatha versicolor Walckenaer, 1841",2151836,https://www.fieldmuseum.org/about/copyright-in...,,PRESERVED_SPECIMEN,FMNHINS 0002 993 747,Arachnida,367,Insects,Insects,...,SPECIES,,ACCEPTED,PhysicalObject,,,,,,2017
2,"Tetragnatha versicolor Walckenaer, 1841",2151836,https://www.fieldmuseum.org/about/copyright-in...,,PRESERVED_SPECIMEN,FMNHINS 0002 993 724,Arachnida,367,Insects,Insects,...,SPECIES,,ACCEPTED,PhysicalObject,,,,,,2017
3,"Tetragnatha versicolor Walckenaer, 1841",2151836,http://vertnet.org/resources/norms.html,,PRESERVED_SPECIMEN,UAM:Ento:355974,Arachnida,367,Insect specimens,4,...,SPECIES,,ACCEPTED,PhysicalObject,decimal degrees,,5/13/16,USA: ALASKA: Chilkoot Trail,,2016
4,"Tetragnatha versicolor Walckenaer, 1841",2151836,http://vertnet.org/resources/norms.html,,PRESERVED_SPECIMEN,UAM:Ento:356607,Arachnida,367,Insect specimens,4,...,SPECIES,,ACCEPTED,PhysicalObject,deg. min. sec.,,13 May 2016,"USA: ALASKA: KLGO Skagway, Pullen Creek",,2016


In [30]:
my_records_df['year'].plot.hist();

In [31]:
my_records_df['county'].value_counts().plot.barh();

<div class="alert alert-block alert-warning">
**QUESTION 10:** What county has the highest number records?
</div>

test response

In [32]:
my_records_df['elevation'].plot.hist();

<div class="alert alert-block alert-warning">
**QUESTION 11:** What is the elevation range of your organism?
</div>

test response

In [33]:
my_records_df['basisOfRecord'].value_counts().plot.barh();

<div class="alert alert-block alert-warning">
**QUESTION 12:** Which has more: observations or preserved specimens? Why might this be?
</div>

test response

---

You are finished with this notebook! Please run the following cell to generate your submission file.

In [35]:
import gsExport
gsExport.generateSubmission("biogeography_notebook1.ipynb")

Processing biogeography_notebook1.ipynb
Generated notebook and autograded
Attempting to compile LaTeX


PandocMissing: Pandoc wasn't found.
Please check that pandoc is installed:
http://pandoc.org/installing.html

---

Notebook developed by: Nina Koo, Natalie Graham, Monica Wilkinson

[Data Science Modules](http://data.berkeley.edu/education/modules)