# Access locations and times of Veery encounters

For this challenge, you will use a database called the [Global
Biodiversity Information Facility (GBIF)](https://www.gbif.org/). GBIF
is compiled from species observation data all over the world, and
includes everything from museum specimens to photos taken by citizen
scientists in their backyards.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Explore GBIF</div></div><div class="callout-body-container callout-body"><p>Before your get started, go to the <a
href="https://www.gbif.org/occurrence/search">GBIF occurrences search
page</a> and explore the data.</p></div></div>

> **Contribute to open data**
>
> You can get your own observations added to GBIF using
> [iNaturalist](https://www.inaturalist.org/)!

### Set up your code to prepare for download

We will be getting data from a source called [GBIF (Global Biodiversity
Information Facility)](https://www.gbif.org/). We need a package called
`pygbif` to access the data, which may not be included in your
environment. Install it by running the cell below:

In [1]:
%%bash
pip install pygbif



<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [1]:
import os
import pathlib
import time
import zipfile
from getpass import getpass
from glob import glob
import pandas as pd

import pygbif.occurrences as occ
import pygbif.species as species

In [74]:
# Create data directory in the home folder
data_dir = os.path.join(
    # Home directory
    pathlib.Path.home(),
    # Earth analytics data directory
    'earth-analytics',
    'data',
    # Project directory
    'species-distribution',
)
os.makedirs(data_dir, exist_ok=True)

# Define the directory name for GBIF data
gbif_dir = os.path.join(data_dir, 'hermit-warbler', '2023')

In [75]:
gbif_dir

'/home/jovyan/earth-analytics/data/species-distribution/hermit-warbler/2023'

:::

### Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to save your credentials on your computer.

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset_credentials=True` instead of `reset_credentials=False`.

In [3]:
reset_credentials = False
# GBIF needs a username, password, and email
credentials = dict(
    GBIF_USER=(input, gbif_user),
    GBIF_PWD=(getpass, gbif_pwd),
    GBIF_EMAIL=(input, gbif_email),
)

for env_variable, (prompt_func, prompt_text) in credentials.items():
    # Delete credential from environment if requested
    if reset_credentials and (env_variable in os.environ):
        os.environ.pop(env_variable)
    # Ask for credential and save to environment
    if not env_variable in os.environ:
        os.environ[env_variable] = prompt_func(prompt_text)


### Get the species key

> ** Your task**
>
> 1.  Replace the `species_name` with the name of the species you want
>     to look up
> 2.  Run the code to get the species key

In [4]:
# Query species
species_info = species.name_lookup('Setophaga occidentalis', rank='SPECIES')

# Get the first result
first_result = species_info['results'][0]

# Get the species key (nubKey)
species_key = first_result['nubKey']

# Check the result
first_result['species'], species_key

('Setophaga occidentalis', 6092644)

### Download data from GBIF

::: {.callout-task title=“Submit a request to GBIF”

1.  Replace `csv_file_pattern` with a string that will match **any**
    `.csv` file when used in the `glob` function. HINT: the character
    `*` represents any number of any values except the file separator
    (e.g. `/`)

2.  Add parameters to the GBIF download function, `occ.download()` to
    limit your query to:

    -   observations
    -   from 2023
    -   with spatial coordinates.

3.  Then, run the download. **This can take a few minutes**. :::

In [81]:
# Only download once
gbif_pattern = os.path.join(gbif_dir, '*.csv')
if not glob(gbif_pattern):
    # Submit query to GBIF
    gbif_query = occ.download([
        "speciesKey = 6092644",
        "year = 2023",
        "hasCoordinate = TRUE",
    ],
    user=credentials['GBIF_USER'][1], 
    pwd=credentials['GBIF_PWD'][1], 
    email=credentials['GBIF_EMAIL'][1])

    # Only download once
    if not 'GBIF_DOWNLOAD_KEY' in os.environ:
        os.environ['GBIF_DOWNLOAD_KEY'] = gbif_query[0]

        # Wait for the download to build
        wait = occ.download_meta(os.environ['GBIF_DOWNLOAD_KEY'])['status'] 
        while not wait=='SUCCEEDED':
            wait = occ.download_meta(os.environ['GBIF_DOWNLOAD_KEY'])['status'] 
            time.sleep(5)

        # Download GBIF data
        download_info = occ.download_get(
            os.environ['GBIF_DOWNLOAD_KEY'], 
            path=data_dir)

        # Unzip GBIF data
        with zipfile.ZipFile(download_info['path']) as download_zip:
            download_zip.extractall(path=gbif_dir)

# Find the extracted .csv file path
gbif_path = glob(gbif_pattern)[0]

In [82]:
gbif_path

'/home/jovyan/earth-analytics/data/species-distribution/hermit-warbler/2023/0040392-240906103802322.csv'

### Load the GBIF data into Python

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Load GBIF data</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Look at the beginning of the file you downloaded using the code
below. What do you think the <strong>delimiter</strong> is?</li>
<li>Run the following code cell. What happens?</li>
<li>Uncomment and modify the parameters of <code>pd.read_csv()</code>
below until your data loads successfully and you have only the columns
you want.</li>
</ol></div></div>

You can use the following code to look at the beginning of your file:

In [78]:
!head -n 2 $gbif_path 

gbifID	datasetKey	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	taxonRank	scientificName	verbatimScientificName	verbatimScientificNameAuthorship	countryCode	locality	stateProvince	occurrenceStatus	individualCount	publishingOrgKey	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	coordinatePrecision	elevation	elevationAccuracy	depth	depthAccuracy	eventDate	day	month	year	taxonKey	speciesKey	basisOfRecord	institutionCode	collectionCode	catalogNumber	recordNumber	identifiedBy	dateIdentified	license	rightsHolder	recordedBy	typeStatus	establishmentMeans	lastInterpreted	mediaType	issue
4926110476	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/236396131	Animalia	Chordata	Aves	Passeriformes	Parulidae	Setophaga	Setophaga occidentalis		SPECIES	Setophaga occidentalis (J.K.Townsend, 1837)	Setophaga occidentalis		MX		México	PRESENT		28eb1a3f-1c15-4a95-931a-4af90ecb574d	19.591737	-99.055292	31.0						2023-11-17T11:45	17	

#### Year 2023

In [79]:
# Load the GBIF data
gbif_df_2023 = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    index_col='gbifID',
    on_bad_lines='skip',
    usecols=['gbifID', 'countryCode', 'stateProvince', 'decimalLatitude', 'decimalLongitude', 'month', 'year']
)

gbif_df_2023.head(2)

Unnamed: 0_level_0,countryCode,stateProvince,decimalLatitude,decimalLongitude,month,year
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4926110476,MX,México,19.591737,-99.055292,11,2023
4926108145,MX,México,19.591737,-99.055292,11,2023


In [80]:
gbif_df_2023.value_counts()

countryCode  stateProvince  decimalLatitude  decimalLongitude  month  year
US           California     34.360283        -118.395890       4      2023    194
             Arizona        32.411125        -110.706480       8      2023    133
                            31.917200        -109.279100       8      2023     97
             Oregon         44.538010        -123.528110       6      2023     96
             Arizona        31.903100        -109.277600       8      2023     78
                                                                             ... 
             Washington     47.022540        -123.133350       6      2023      1
                            47.041466        -122.494354       5      2023      1
                            47.066326        -122.498290       6      2023      1
                            47.075360        -123.270805       5      2023      1
                            47.081253        -122.718056       4      2023      1
Name: count, Length: 81

#### Year 2022

In [54]:
# Load the GBIF data
gbif_df_2022 = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    index_col='gbifID',
    on_bad_lines='skip',
    usecols=['gbifID', 'countryCode', 'stateProvince', 'decimalLatitude', 'decimalLongitude', 'month', 'year']
)

gbif_df_2022.head(2)

Unnamed: 0_level_0,countryCode,stateProvince,decimalLatitude,decimalLongitude,month,year
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4901985213,US,California,38.341145,-119.924987,6,2022
4901604313,US,California,38.341145,-119.924987,6,2022


In [55]:
gbif_df_2022.value_counts()

countryCode  stateProvince  decimalLatitude  decimalLongitude  month  year
US           California     34.360283        -118.395890       4      2022    170
             Arizona        32.392628        -110.702470       8      2022    133
             Pennsylvania   39.855350        -75.445404        11     2022    121
             Arizona        32.417030        -110.725080       8      2022    114
             California     32.553860        -117.084620       5      2022    103
                                                                             ... 
             Washington     47.500560        -123.282100       5      2022      1
                            47.472195        -123.838000       7      2022      1
                            47.469204        -122.914185       5      2022      1
                            47.468520        -123.845520       7      2022      1
                            47.464817        -123.852104       7      2022      1
Name: count, Length: 82

#### Year 2021

In [60]:
# Load the GBIF data
gbif_df_2021 = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    index_col='gbifID',
    on_bad_lines='skip',
    usecols=['gbifID', 'countryCode', 'stateProvince', 'decimalLatitude', 'decimalLongitude', 'month', 'year']
)

gbif_df_2021.head(2)

Unnamed: 0_level_0,countryCode,stateProvince,decimalLatitude,decimalLongitude,month,year
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4936044596,US,California,40.893766,-123.769021,5,2021
4908633473,US,California,40.190598,-121.102956,5,2021


In [61]:
gbif_df_2021.value_counts()

countryCode  stateProvince  decimalLatitude  decimalLongitude  month  year
US           California     34.360283        -118.39589        4      2021    179
                            32.553860        -117.08462        5      2021    145
                            37.995620        -122.97821        9      2021    125
             Arizona        31.727170        -110.88075        4      2021    103
                            31.917200        -109.27910        8      2021    101
                                                                             ... 
GT           Alta Verapaz   15.306603        -90.45372         10     2021      1
                            15.412526        -90.40979         8      2021      1
US           Washington     47.031920        -123.09377        7      2021      1
                                                               8      2021      1
                            47.040035        -122.49458        5      2021      1
Name: count, Length: 78

#### Year 2020

In [66]:
# Load the GBIF data
gbif_df_2020 = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    index_col='gbifID',
    on_bad_lines='skip',
    usecols=['gbifID', 'countryCode', 'stateProvince', 'decimalLatitude', 'decimalLongitude', 'month', 'year']
)

gbif_df_2020.head(2)

Unnamed: 0_level_0,countryCode,stateProvince,decimalLatitude,decimalLongitude,month,year
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4946437756,US,California,32.545422,-117.073311,9,2020
4946233537,US,California,35.970859,-118.485165,6,2020


In [67]:
gbif_df_2020.value_counts()

countryCode  stateProvince  decimalLatitude  decimalLongitude  month  year
US           Arizona        31.917200        -109.279100       8      2020    139
                            31.786108        -109.298134       9      2020    106
                            32.442710        -110.759865       8      2020     74
             California     33.831070        -118.123800       12     2020     71
                            37.768635        -122.474930       9      2020     58
                                                                             ... 
CR           Puntarenas     10.305712        -84.811625        9      2020      1
US           Washington     47.059536        -122.979630       6      2020      1
                            47.060440        -122.970214       5      2020      1
                            47.083153        -122.843710       6      2020      1
MX           Tlaxcala       19.458649        -98.580375        1      2020      1
Name: count, Length: 66

#### Year 2019

In [72]:
# Load the GBIF data
gbif_df_2019 = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    index_col='gbifID',
    on_bad_lines='skip',
    usecols=['gbifID', 'countryCode', 'stateProvince', 'decimalLatitude', 'decimalLongitude', 'month', 'year']
)

gbif_df_2019.head(2)

Unnamed: 0_level_0,countryCode,stateProvince,decimalLatitude,decimalLongitude,month,year
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4910852571,CA,Ontario,43.853489,-78.897137,4,2019
4908236882,US,California,38.974473,-120.39124,7,2019


In [73]:
gbif_df_2019.value_counts()

countryCode  stateProvince  decimalLatitude  decimalLongitude  month  year
CA           Ontario        43.853080        -78.896050        4      2019    159
US           Arizona        32.392628        -110.702470       8      2019    127
             California     33.820480        -118.128780       4      2019     94
             Arizona        32.411125        -110.706480       8      2019     86
             Colorado       40.009860        -105.274310       5      2019     75
                                                                             ... 
             California     37.894768        -121.938970       5      2019      1
                            37.893047        -122.243500       7      2019      1
                            37.890327        -122.314766       5      2019      1
                            37.889717        -122.235020       4      2019      1
                            37.913795        -122.284730       9      2019      1
Name: count, Length: 62