# Migration Data Download

Get occurrence data from the Global Biodiversity Information Facility
(GBIF)

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [20]:
import time
import zipfile
import os
from getpass import getpass
from glob import glob

import earthpy
import geopandas as gpd
import pandas as pd
from pathlib import Path
import pygbif.occurrences as occ
import pygbif.species as species
import requests
import shutil

For this challenge, you will need to download some data to the computer
you’re working on. We suggest using the `earthpy` library we develop to
manage your downloads, since it encapsulates many best practices as far
as:

1.  Where to store your data
2.  Dealing with archived data like .zip files
3.  Avoiding version control problems
4.  Making sure your code works cross-platform
5.  Avoiding duplicate downloads

If you’re working on one of our assignments through GitHub Classroom, it
also lets us build in some handy defaults so that you can see your data
files while you work.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Create a project folder</div></div><div class="callout-body-container callout-body"><p>The code below will help you get started with making a project
directory</p>
<ol type="1">
<li>Replace <code>'your-project-directory-name-here'</code> with a
<strong>descriptive</strong> name</li>
<li>Run the cell</li>
<li>The code should have printed out the path to your data files. Check
that your data directory exists and has data in it using the terminal or
your Finder/File Explorer.</li>
</ol></div></div>

> **File structure**
>
> These days, a lot of people find your file by searching for them or
> selecting from a `Bookmarks` or `Recents` list. Even if you don’t use
> it, your computer also keeps files in a **tree** structure of folders.
> Put another way, you can organize and find files by travelling along a
> unique **path**, e.g. `My Drive` \> `Documents` \>
> `My awesome project` \> `A project file` where each subsequent folder
> is **inside** the previous one. This is convenient because all the
> files for a project can be in the same place, and both people and
> computers can rapidly locate files they want, provided they remember
> the path.
>
> You may notice that when Python prints out a file path like this, the
> folder names are **separated** by a `/` or `\` (depending on your
> operating system). This character is called the **file separator**,
> and it tells you that the next piece of the path is **inside** the
> previous one.

In [21]:
# Create data directory
project_dir = Path(r"C:\Users\seism\Documents\GitHub\02-migration-tkbravo\vauxdata")
project_dir.mkdir(parents=True, exist_ok=True)
project = project_dir

# Display the project directory
project_dir

WindowsPath('C:/Users/seism/Documents/GitHub/02-migration-tkbravo/vauxdata')

### STEP 1: Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to enter your credentials for the rest of
your session.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-error"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div></div><div class="callout-body-container callout-body"><p>This code is <strong>interactive</strong>, meaning that it will
<strong>ask you for a response</strong>! The prompt can sometimes be
hard to see if you are using VSCode – it appears at the
<strong>top</strong> of your editor window.</p></div></div>

> **Tip**
>
> If you need to save credentials across multiple sessions, you can
> consider loading them in from a file like a `.env`…but make sure to
> add it to .gitignore so you don’t commit your credentials to your
> repository!

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset=True` instead of `reset=False`.

In [22]:
####--------------------------####
#### DO NOT MODIFY THIS CODE! ####
####--------------------------####
# This code ASKS for your credentials 
# and saves it for the rest of the session.
# NEVER put your credentials into your code!!!!

# GBIF needs a username, password, and email 
# All 3 need to match the account
reset = False

# Request and store username
if (not ('GBIF_USER'  in os.environ)) or reset:
    os.environ['GBIF_USER'] = input('GBIF username:')

# Securely request and store password
if (not ('GBIF_PWD'  in os.environ)) or reset:
    os.environ['GBIF_PWD'] = getpass('GBIF password:')
    
# Request and store account email address
if (not ('GBIF_EMAIL'  in os.environ)) or reset:
    os.environ['GBIF_EMAIL'] = input('GBIF email:')

### STEP 2: Get the taxon key from GBIF

One of the tricky parts about getting occurrence data from GBIF is that
species often have multiple names in different contexts. Luckily, GBIF
also provides a Name Backbone service that will translate scientific and
colloquial names into unique identifiers. GBIF calls these identifiers
**taxon keys**.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Put the species name, <code>{python}  scientific_name</code>, into
the correct location in the code below.</li>
<li>Examine the object you get back from the species query. What part of
it do you think might be the taxon key?</li>
<li>Extract and save the taxon key</li>
</ol></div></div>

In [23]:
backbone = species.name_backbone(name="Chaetura vauxi")

backbone
usage_key = backbone["usageKey"]

### STEP 3: Download data from GBIF

Downloading GBIF data is a multi-step process. However, we’ve provided
you with a chunk of code that handles the API communications and caches
the download. You’ll still need to customize your search.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Submit a request to GBIF</div></div><div class="callout-body-container callout-body"><ol type="1">
<li><p>Replace <code>csv_file_pattern</code> with a string that will
match <strong>any</strong> <code>.csv</code> file when used in the
<code>.rglob()</code> method. HINT: the character <code>*</code>
represents any number of any values except the file separator
(e.g. <code>/</code> on UNIX systems)</p></li>
<li><p>Add parameters to the GBIF download function,
<code>occ.download()</code> to limit your query to:</p>
<ul>
<li>observations of <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="8"></span></li>
<li>from <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="9"></span></li>
<li>with spatial coordinates.</li>
</ul></li>
<li><p>Then, run the download. <strong>This can take a few
minutes</strong>. You can check your downloads by logging on to the <a
href="https://www.gbif.org/user/download">GBIF website</a>.</p></li>
</ol></div></div>

In [24]:
# Only download once
if not any(project.rglob('*.csv')):
    # Only submit one request
    if not 'GBIF_DOWNLOAD_KEY' in os.environ:
        # Submit query to GBIF
        gbif_query = occ.download([
            f'taxonKey = {usage_key}',
            'hasCoordinate = true',
            f'year = 2023',
        ],
        format='SIMPLE_CSV',)

        gbif_key = gbif_query[0] if isinstance(gbif_query, (list, tuple)) else gbif_query 
        os.environ['GBIF_DOWNLOAD_KEY'] = gbif_key
        

# Wait for the download to build (with timeout + all terminal states)
dld_key = os.environ['GBIF_DOWNLOAD_KEY']
deadline = time.time() + 30*60  # 30 minutes hard stop
wait = occ.download_meta(dld_key)['status']

while wait not in ('SUCCEEDED', 'FAILED', 'KILLED'):
    if time.time() > deadline:
        raise TimeoutError('GBIF polling timed out.')
    time.sleep(5)
    wait = occ.download_meta(dld_key)['status']

if wait != 'SUCCEEDED':
    raise RuntimeError(f'GBIF download ended with status: {wait}')

    # Midstep because I couldn't get past this step without 503 during polling/GBIF service being flaky
    # dld_key = os.environ['GBIF_DOWNLOAD_KEY']
    # status_url = f'https://api.gbif.org/v1/occurrence/download/{dld_key}'
    # delay = 5
    # while True:
    #     try:
    #         resp = requests.get(status_url, timeout=30)
    #         if resp.status_code in (502, 503, 504):
    #             time.sleep(delay); continue
    #         resp.raise_for_status()
    #         wait = resp.json()['status']
    #     except requests.RequestException:
    #         time.sleep(delay); continue
    #     if wait in ('SUCCEEDED', 'FAILED', 'KILLED'):
    #         break
    #     time.sleep(delay)

# Download GBIF data
dld_info = occ.download_get(
    os.environ['GBIF_DOWNLOAD_KEY'], 
    path=project)
dld_path = dld_info['path']

# Unzip GBIF data
with zipfile.ZipFile(dld_path) as dld_zip:
    dld_zip.extractall(path=project)
        
# Clean up the .zip file
os.remove(dld_path)

# Find the extracted .csv file path (first result)
original_gbif_path = next(
    project.rglob('*.csv'))
original_gbif_path

INFO:Download file size: 3291201 bytes
INFO:On disk at C:\Users\seism\Documents\GitHub\02-migration-tkbravo\vauxdata/0005387-251025141854904.zip


WindowsPath('C:/Users/seism/Documents/GitHub/02-migration-tkbravo/vauxdata/0005387-251025141854904.csv')

You might notice that the GBIF data filename isn’t very
**descriptive**…at this point, you may want to clean up your data
directory so that you know what the file is later on!

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace ‘your-gbif-filename’ with a <strong>descriptive</strong>
name.</li>
<li>Run the cell</li>
<li>Check your data folder. Is it organized the way you want?</li>
</ol></div></div>

In [25]:
# Give the download a descriptive name
gbif_path = project / 'chaetura_vauxi_2023.csv'
# Move file to descriptive path
shutil.move(original_gbif_path, gbif_path)

WindowsPath('C:/Users/seism/Documents/GitHub/02-migration-tkbravo/vauxdata/chaetura_vauxi_2023.csv')

### STEP 4: Load the GBIF data into Python

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Load GBIF data</div></div><div class="callout-body-container callout-body"><p>Just like you did when wrangling your data from the data subset,
you’ll need to load your GBIF data and convert it to a GeoDataFrame.</p></div></div>

In [26]:
# Load the GBIF data
with open(gbif_path, "r", encoding="utf-8", errors="replace") as f:
    for _ in range(2):
        print(f.readline().rstrip("\n"))

gbifID	datasetKey	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	taxonRank	scientificName	verbatimScientificName	verbatimScientificNameAuthorship	countryCode	locality	stateProvince	occurrenceStatus	individualCount	publishingOrgKey	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	coordinatePrecision	elevation	elevationAccuracy	depth	depthAccuracy	eventDate	day	month	year	taxonKey	speciesKey	basisOfRecord	institutionCode	collectionCode	catalogNumber	recordNumber	identifiedBy	dateIdentified	license	rightsHolder	recordedBy	typeStatus	establishmentMeans	lastInterpreted	mediaType	issue
5840232682	50c9509d-22c7-4a22-a47d-8c48425ef4a7	https://www.inaturalist.org/observations/318065823	Animalia	Chordata	Aves	Apodiformes	Apodidae	Chaetura	Chaetura vauxi		SPECIES	Chaetura vauxi (J.K.Townsend, 1839)	Chaetura vauxi		MX		Chiapas	PRESENT		28eb1a3f-1c15-4a95-931a-4af90ecb574d	16.757107	-93.090818	488.0						2023-03-01T19:10	1	3	2023	5228612	5228612	HUMA

In [27]:
# Load the GBIF data
gbif_df = pd.read_csv(
    gbif_path, 
    delimiter='\t',
    usecols=["gbifID","decimalLatitude","decimalLongitude","month"]
)

# Convert to GeoDataFrame

gbif_gdf = (
    gpd.GeoDataFrame(
        gbif_df, 
        geometry=gpd.points_from_xy(
            gbif_df["decimalLongitude"], gbif_df["decimalLatitude"]), 
        crs="EPSG:4326")
    # Select the desired columns
    [["gbifID","month","geometry"]]
)
gbif_gdf

# Check results
# gbif_gdf.total_bounds

Unnamed: 0,gbifID,month,geometry
0,5840232682,3,POINT (-93.09082 16.75711)
1,5716113565,5,POINT (-123.17495 44.07056)
2,5716040825,2,POINT (-99.09634 19.29198)
3,5715916092,11,POINT (-96.12075 17.01428)
4,5715897879,6,POINT (-85.01513 10.71931)
...,...,...,...
46085,4028988425,1,POINT (-79.64917 9.07761)
46086,4028921985,1,POINT (-86.96908 20.47673)
46087,4028867559,2,POINT (-89.0445 13.66092)
46088,4022369471,1,POINT (-89.22574 13.54412)


# STEP -1: Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! Replace `var1` and `var2` with the variable you want to save,
separated by spaces.

In [28]:
%store gbif_gdf project 

Stored 'gbif_gdf' (GeoDataFrame)
Stored 'project' (WindowsPath)


Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!