# Access locations and times of Veery encounters

For this challenge, you will use a database called the [Global
Biodiversity Information Facility (GBIF)](https://www.gbif.org/). GBIF
is compiled from species observation data all over the world, and
includes everything from museum specimens to photos taken by citizen
scientists in their backyards.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Explore GBIF</div></div><div class="callout-body-container callout-body"><p>Before your get started, go to the <a
href="https://www.gbif.org/occurrence/search">GBIF occurrences search
page</a> and explore the data.</p></div></div>

> **Contribute to open data**
>
> You can get your own observations added to GBIF using
> [iNaturalist](https://www.inaturalist.org/)!

### Set up your code to prepare for download

We will be getting data from a source called [GBIF (Global Biodiversity
Information Facility)](https://www.gbif.org/). We need a package called
`pygbif` to access the data, which may not be included in your
environment. Install it by running the cell below:

In [1]:
%%bash
pip install pygbif



<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [2]:
import pandas as pd
import geopandas as gpd
import os
import pathlib

import time
import zipfile
from getpass import getpass
from glob import glob

import pygbif.occurrences as occ
import pygbif.species as species

In [3]:
# Create data directory in the home folder
data_dir = os.path.join(
    # Home directory
    pathlib.Path.home(),
    # Earth analytics data directory
    'earth-analytics',
    'data',
    # Project directory
    'blue-jay-migration',
)
os.makedirs(data_dir, exist_ok=True)

# Define the directory name for GBIF data, create directory
gbif_dir = os.path.join(data_dir, 'blue-jay-gbif')

os.makedirs(gbif_dir, exist_ok=True)

gbif_path = os.path.join(gbif_dir, 'blue-jay.csv')

In [4]:
%%bash
find ~/earth-analytics/data/blue-jay-migration/

/home/jovyan/earth-analytics/data/blue-jay-migration/
/home/jovyan/earth-analytics/data/blue-jay-migration/blue-jay-gbif
/home/jovyan/earth-analytics/data/blue-jay-migration/blue-jay-gbif/0003394-241007104925546.csv
/home/jovyan/earth-analytics/data/blue-jay-migration/0003394-241007104925546.zip
/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir
/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir/ecoregions.cpg
/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir/ecoregions.dbf
/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir/ecoregions.shp
/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir/ecoregions.shx
/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir/ecoregions.prj


:::

### Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to save your credentials on your computer.

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset_credentials=True` instead of `reset_credentials=False`.

In [5]:
reset_credentials = False
# GBIF needs a username, password, and email
credentials = dict(
    GBIF_USER=(input, 'username'),
    GBIF_PWD=(getpass, 'password'),
    GBIF_EMAIL=(input, 'email'),
)
for env_variable, (prompt_func, prompt_text) in credentials.items():
    # Delete credential from environment if requested
    if reset_credentials and (env_variable in os.environ):
        os.environ.pop(env_variable)
    # Ask for credential and save to environment
    if not env_variable in os.environ:
        os.environ[env_variable] = prompt_func(prompt_text)

In [75]:
%%bash
echo $GBIF_USER

hannah.rieder


### Get the species key

> ** Your task**
>
> 1.  Replace the `species_name` with the name of the species you want
>     to look up
> 2.  Run the code to get the species key

In [6]:
# Query Blue Jay species, Cyanocitta cristata
species_info = species.name_lookup('Cyanocitta cristata', rank='species')

# Get the first result
first_result = species_info['results'][0]
first_result

{'key': 168007494,
 'datasetKey': 'a6c6cead-b5ce-4a4e-8cf5-1542ba708dec',
 'nubKey': 2482593,
 'parentKey': 168007493,
 'parent': 'Cyanocitta',
 'kingdom': 'Animalia',
 'phylum': 'Chordata',
 'order': 'Passeriformes',
 'family': 'Corvidae',
 'genus': 'Cyanocitta',
 'species': 'Cyanocitta cristata',
 'kingdomKey': 167920784,
 'phylumKey': 168001715,
 'classKey': 168003099,
 'orderKey': 168005163,
 'familyKey': 168007364,
 'genusKey': 168007493,
 'speciesKey': 168007494,
 'scientificName': 'Cyanocitta cristata',
 'canonicalName': 'Cyanocitta cristata',
 'nameType': 'SCIENTIFIC',
 'taxonomicStatus': 'ACCEPTED',
 'rank': 'SPECIES',
 'origin': 'SOURCE',
 'numDescendants': 0,
 'numOccurrences': 0,
 'habitats': [],
 'nomenclaturalStatus': [],
 'threatStatuses': [],
 'descriptions': [],
 'vernacularNames': [{'vernacularName': 'blåskrike', 'language': 'nob'}],
 'higherClassificationMap': {'167920784': 'Animalia',
  '168001715': 'Chordata',
  '168003099': 'Aves',
  '168005163': 'Passeriformes',


In [7]:
# Get the Cyanocitta cristata species key (nubKey)
species_key = first_result['nubKey']
species_key

2482593

In [8]:
# Check the result
first_result['species'], species_key

('Cyanocitta cristata', 2482593)

### Download data from GBIF

::: {.callout-task title=“Submit a request to GBIF”

1.  Replace `csv_file_pattern` with a string that will match **any**
    `.csv` file when used in the `glob` function. HINT: the character
    `*` represents any number of any values except the file separator
    (e.g. `/`)

2.  Add parameters to the GBIF download function, `occ.download()` to
    limit your query to:

    -   observations
    -   from 2023
    -   with spatial coordinates.

3.  Then, run the download. **This can take a few minutes**. :::

In [9]:

os.environ['GBIF_DOWNLOAD_KEY'] = '0003394-241007104925546'


In [10]:
# Only download blue jay data once
gbif_pattern = os.path.join(gbif_dir, '*.csv')
if not glob(gbif_pattern):
    # Submit query to GBIF
    # Only download once
    if not 'GBIF_DOWNLOAD_KEY' in os.environ:
        gbif_query = occ.download([
            "speciesKey = 2482593",
            "year = 2023",
            "hasCoordinate = TRUE",
        ])
        download_key = gbif_query[0]
        os.environ['GBIF_DOWNLOAD_KEY'] = gbif_query[0]
    else: 
        download_key = os.environ['GBIF_DOWNLOAD_KEY']

    # Wait for the download to build
    wait = occ.download_meta(download_key)['status']
    while not wait=='SUCCEEDED':
        wait = occ.download_meta(download_key)['status']
        time.sleep(5)

    # Download GBIF data
    download_info = occ.download_get(
        os.environ['GBIF_DOWNLOAD_KEY'], 
        path=data_dir)

    # Unzip GBIF data
    with zipfile.ZipFile(download_info['path']) as download_zip:
        download_zip.extractall(path=gbif_dir)

# Find the extracted .csv file path
gbif_path = glob(gbif_pattern)[0]

### Load the GBIF data into Python as a DataFrame, gbif_df. **MAKE SURE TO ADD NOTES ABOUT WHY WE'RE USING THIS DATA AND WHAT IT IS AND CITE IT. Why blue jays?**


In [11]:
!head -n 2 $gbif_path 

gbifID	datasetKey	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	taxonRank	scientificName	verbatimScientificName	verbatimScientificNameAuthorship	countryCode	locality	stateProvince	occurrenceStatus	individualCount	publishingOrgKey	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	coordinatePrecision	elevation	elevationAccuracy	depth	depthAccuracy	eventDate	day	month	year	taxonKey	speciesKey	basisOfRecord	institutionCode	collectionCode	catalogNumber	recordNumber	identifiedBy	dateIdentified	license	rightsHolder	recordedBy	typeStatus	establishmentMeans	lastInterpreted	mediaType	issue
4686102821	4fa7b334-ce0d-4e88-aaae-2e0c138d049e	URN:catalog:CLO:EBIRD:OBS1601518595	Animalia	Chordata	Aves	Passeriformes	Corvidae	Cyanocitta	Cyanocitta cristata		SPECIES	Cyanocitta cristata (Linnaeus, 1758)	Cyanocitta cristata		US	Hartman Reserve Nature Center	Iowa	PRESENT	1	e2e717bf-551a-4917-bdc9-4fa0f342c530	42.52484	-92.41017							2023-01-08	8	1	2023	248259

In [12]:
# Load the blue jay GBIF data
gbif_df = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    index_col='gbifID',
    usecols=['gbifID', 'decimalLatitude', 'decimalLongitude', 'month']
)
gbif_df.head()

Unnamed: 0_level_0,decimalLatitude,decimalLongitude,month
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4686102821,42.52484,-92.41017,1
4700003912,46.159824,-72.755104,1
4630316247,34.93356,-82.32221,2
4650245696,41.33835,-75.890724,2
4709169271,42.951202,-72.328316,3


### Convert the gbif_df into a GeoDataFrame.

In [13]:
# Convert the gbif_df into a GeoDataFrame.
# Only show the month and geomtery columns of the gbif_gdf.
gbif_gdf = (
    gpd.GeoDataFrame(
        gbif_df,
        geometry=gpd.points_from_xy(
            gbif_df.decimalLongitude,
            gbif_df.decimalLatitude),
        crs='EPSG:4326')
        [['month', 'geometry']]
        )

gbif_gdf


Unnamed: 0_level_0,month,geometry
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1
4686102821,1,POINT (-92.41017 42.52484)
4700003912,1,POINT (-72.7551 46.15982)
4630316247,2,POINT (-82.32221 34.93356)
4650245696,2,POINT (-75.89072 41.33835)
4709169271,3,POINT (-72.32832 42.9512)
...,...,...
4650150283,8,POINT (-89.43662 43.13623)
4816590623,10,POINT (-84.31701 33.79817)
4615093504,11,POINT (-76.13927 44.87485)
4813634349,12,POINT (-72.85446 41.85679)


### Download the ecoregion data. **MAKE SURE TO ADD NOTES ABOUT WHY WE'RE USING THIS DATA AND CITE IT**

In [14]:
# Set up the ecoregion boundary URL
ecoregions_url = ("https://storage.googleapis.com"
                  "/teow2016/Ecoregions2017.zip"
                  )

# Set up a path to save the data on your machine
ecoregions_dir = os.path.join(data_dir, 'resolve_ecoregions_dir')

# Make the ecoregions directory
os.makedirs(ecoregions_dir, exist_ok=True)

# Join ecoregions shapefile path
ecoregions_path = os.path.join(ecoregions_dir, 'ecoregions.shp')

# Only download once
if not os.path.exists(ecoregions_path):
    my_gdf = gpd.read_file(ecoregions_url)
    my_gdf.to_file(ecoregions_path)

In [15]:
%%bash
find ~/earth-analytics/data/blue-jay-migration -name '*.shp'

/home/jovyan/earth-analytics/data/blue-jay-migration/resolve_ecoregions_dir/ecoregions.shp


In [39]:
# Open up the ecoregions boundaries
ecoregions_gdf = gpd.read_file(ecoregions_path)

# Plot the ecoregions to check download
#ecoregions_gdf.plot()

In [40]:
ecoregions_gdf.head(2)

Unnamed: 0,OBJECTID,ECO_NAME,BIOME_NUM,BIOME_NAME,REALM,ECO_BIOME_,NNH,ECO_ID,SHAPE_LENG,SHAPE_AREA,NNH_NAME,COLOR,COLOR_BIO,COLOR_NNH,LICENSE,geometry
0,1.0,Adelie Land tundra,11.0,Tundra,Antarctica,AN11,1,117,9.74978,0.038948,Half Protected,#63CFAB,#9ED7C2,#257339,CC-BY 4.0,"MULTIPOLYGON (((158.7141 -69.60657, 158.71264 ..."
1,2.0,Admiralty Islands lowland rain forests,1.0,Tropical & Subtropical Moist Broadleaf Forests,Australasia,AU01,2,135,4.800349,0.170599,Nature Could Reach Half Protected,#70A800,#38A700,#7BC141,CC-BY 4.0,"MULTIPOLYGON (((147.28819 -2.57589, 147.2715 -..."


In [42]:
# Make the ecoregions_gdf easier to work with.
ecoregions_clean_gdf = ecoregions_gdf[['OBJECTID', 'ECO_NAME', 'SHAPE_AREA', 'geometry']]

ecoregions_clean_gdf.head()


Unnamed: 0,OBJECTID,ECO_NAME,SHAPE_AREA,geometry
0,1.0,Adelie Land tundra,0.038948,"MULTIPOLYGON (((158.7141 -69.60657, 158.71264 ..."
1,2.0,Admiralty Islands lowland rain forests,0.170599,"MULTIPOLYGON (((147.28819 -2.57589, 147.2715 -..."
2,3.0,Aegean and Western Turkey sclerophyllous and m...,13.844952,"MULTIPOLYGON (((26.88659 35.32161, 26.88297 35..."
3,4.0,Afghan Mountains semi-desert,1.355536,"MULTIPOLYGON (((65.48655 34.71401, 65.52872 34..."
4,5.0,Ahklun and Kilbuck Upland Tundra,8.196573,"MULTIPOLYGON (((-160.26404 58.64097, -160.2673..."


### Normalize Data

In [43]:
gbif_ecoregion_gdf = (
    ecoregions_clean_gdf
    .to_crs(gbif_gdf.crs)
    .sjoin(
        gbif_gdf,
        how='inner',
        predicate='contains')
    [['gbifID', 'OBJECTID','month', 'ECO_NAME']]
    )

gbif_ecoregion_gdf

Unnamed: 0,gbifID,OBJECTID,month,ECO_NAME
12,4813860902,13.0,11,Alberta-British Columbia foothills forests
12,4755451365,13.0,12,Alberta-British Columbia foothills forests
12,4831334232,13.0,12,Alberta-British Columbia foothills forests
12,4778369322,13.0,2,Alberta-British Columbia foothills forests
12,4610790022,13.0,2,Alberta-British Columbia foothills forests
...,...,...,...,...
833,4816429374,839.0,9,Northern Rockies conifer forests
833,4774079338,839.0,7,Northern Rockies conifer forests
833,4773268048,839.0,2,Northern Rockies conifer forests
833,4730890881,839.0,11,Northern Rockies conifer forests
