# Migration Data Download

Get occurrence data from the Global Biodiversity Information Facility
(GBIF)

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [1]:
!pip install pygbif


Collecting pygbif
  Using cached pygbif-0.6.5-py3-none-any.whl.metadata (12 kB)
Collecting requests-cache (from pygbif)
  Using cached requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting geojson_rewind (from pygbif)
  Using cached geojson_rewind-1.2.0-py3-none-any.whl.metadata (4.5 kB)
Collecting geomet (from pygbif)
  Using cached geomet-1.1.0-py3-none-any.whl.metadata (11 kB)
Collecting appdirs>=1.4.3 (from pygbif)
  Using cached appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting attrs>=21.2 (from requests-cache->pygbif)
  Using cached attrs-25.4.0-py3-none-any.whl.metadata (10 kB)
Collecting cattrs>=22.2 (from requests-cache->pygbif)
  Using cached cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache->pygbif)
  Using cached url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Collecting typing-extensions>=4.14.0 (from cattrs>=22.2->requests-cache->pygbif)
  Using cached typing_extensions-4.15.0-py3-none-any

In [None]:
import os
import pathlib
import time
import zipfile
from getpass import getpass
from glob import glob
from pathlib import Path
import shutil
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

import pygbif.occurrences as occ
import pygbif.species as species
import requests

For this challenge, you will need to download some data to the computer
you’re working on. We suggest using the `earthpy` library we develop to
manage your downloads, since it encapsulates many best practices as far
as:

1.  Where to store your data
2.  Dealing with archived data like .zip files
3.  Avoiding version control problems
4.  Making sure your code works cross-platform
5.  Avoiding duplicate downloads

If you’re working on one of our assignments through GitHub Classroom, it
also lets us build in some handy defaults so that you can see your data
files while you work.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Create a project folder</div></div><div class="callout-body-container callout-body"><p>The code below will help you get started with making a project
directory</p>
<ol type="1">
<li>Replace <code>'your-project-directory-name-here'</code> with a
<strong>descriptive</strong> name</li>
<li>Run the cell</li>
<li>The code should have printed out the path to your data files. Check
that your data directory exists and has data in it using the terminal or
your Finder/File Explorer.</li>
</ol></div></div>

> **File structure**
>
> These days, a lot of people find your file by searching for them or
> selecting from a `Bookmarks` or `Recents` list. Even if you don’t use
> it, your computer also keeps files in a **tree** structure of folders.
> Put another way, you can organize and find files by travelling along a
> unique **path**, e.g. `My Drive` \> `Documents` \>
> `My awesome project` \> `A project file` where each subsequent folder
> is **inside** the previous one. This is convenient because all the
> files for a project can be in the same place, and both people and
> computers can rapidly locate files they want, provided they remember
> the path.
>
> You may notice that when Python prints out a file path like this, the
> folder names are **separated** by a `/` or `\` (depending on your
> operating system). This character is called the **file separator**,
> and it tells you that the next piece of the path is **inside** the
> previous one.

In [6]:
# Create data directory
data_dir = os.path.join(
    #  specific base directory path
    '/Users/niko2485/Library/CloudStorage/OneDrive-UCB-O365/Desktop/Data_ES',
    #  project folder name
    'species_migration_data'
)

# create the folder 
os.makedirs(data_dir, exist_ok=True)

print(data_dir)

/Users/niko2485/Library/CloudStorage/OneDrive-UCB-O365/Desktop/Data_ES/species_migration_data


### STEP 1: Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to enter your credentials for the rest of
your session.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-error"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div></div><div class="callout-body-container callout-body"><p>This code is <strong>interactive</strong>, meaning that it will
<strong>ask you for a response</strong>! The prompt can sometimes be
hard to see if you are using VSCode – it appears at the
<strong>top</strong> of your editor window.</p></div></div>

> **Tip**
>
> If you need to save credentials across multiple sessions, you can
> consider loading them in from a file like a `.env`…but make sure to
> add it to .gitignore so you don’t commit your credentials to your
> repository!

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset=True` instead of `reset=False`.

In [1]:
import os
os.environ["GBIF_USER"]  = "nkoppa"
os.environ["GBIF_PWD"]   = "Nk2603#$"
os.environ["GBIF_EMAIL"] = "nishakoppaa.com"
print("GBIF creds set for this session.")


GBIF creds set for this session.


In [2]:
####--------------------------####
#### DO NOT MODIFY THIS CODE! ####
####--------------------------####
# This code ASKS for your credentials 
# and saves it for the rest of the session.
# NEVER put your credentials into your code!!!!

# GBIF needs a username, password, and email 
# All 3 need to match the account
reset = False

# Request and store username
if (not ('GBIF_USER'  in os.environ)) or reset:
    os.environ['GBIF_USER'] = input('GBIF username:')

# Securely request and store password
if (not ('GBIF_PWD'  in os.environ)) or reset:
    os.environ['GBIF_PWD'] = getpass('GBIF password:')
    
# Request and store account email address
if (not ('GBIF_EMAIL'  in os.environ)) or reset:
    os.environ['GBIF_EMAIL'] = input('GBIF email:')

### STEP 2: Get the taxon key from GBIF

One of the tricky parts about getting occurrence data from GBIF is that
species often have multiple names in different contexts. Luckily, GBIF
also provides a Name Backbone service that will translate scientific and
colloquial names into unique identifiers. GBIF calls these identifiers
**taxon keys**.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Put the species name, <code>{python}  scientific_name</code>, into
the correct location in the code below.</li>
<li>Examine the object you get back from the species query. What part of
it do you think might be the taxon key?</li>
<li>Extract and save the taxon key</li>
</ol></div></div>

In [None]:


# Now this will work
backbone = species.name_backbone(name="Catharus ustulatus")
print(backbone)

taxon_key = backbone["usageKey"]
print("Taxon key:", taxon_key)

{'usageKey': 2490821, 'scientificName': 'Catharus ustulatus (Nuttall, 1840)', 'canonicalName': 'Catharus ustulatus', 'rank': 'SPECIES', 'status': 'ACCEPTED', 'confidence': 99, 'matchType': 'EXACT', 'kingdom': 'Animalia', 'phylum': 'Chordata', 'order': 'Passeriformes', 'family': 'Turdidae', 'genus': 'Catharus', 'species': 'Catharus ustulatus', 'kingdomKey': 1, 'phylumKey': 44, 'classKey': 212, 'orderKey': 729, 'familyKey': 5290, 'genusKey': 2490799, 'speciesKey': 2490821, 'class': 'Aves'}
Taxon key: 2490821


### STEP 3: Download data from GBIF

Downloading GBIF data is a multi-step process. However, we’ve provided
you with a chunk of code that handles the API communications and caches
the download. You’ll still need to customize your search.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Submit a request to GBIF</div></div><div class="callout-body-container callout-body"><ol type="1">
<li><p>Replace <code>csv_file_pattern</code> with a string that will
match <strong>any</strong> <code>.csv</code> file when used in the
<code>.rglob()</code> method. HINT: the character <code>*</code>
represents any number of any values except the file separator
(e.g. <code>/</code> on UNIX systems)</p></li>
<li><p>Add parameters to the GBIF download function,
<code>occ.download()</code> to limit your query to:</p>
<ul>
<li>observations of <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="8"></span></li>
<li>from <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="9"></span></li>
<li>with spatial coordinates.</li>
</ul></li>
<li><p>Then, run the download. <strong>This can take a few
minutes</strong>. You can check your downloads by logging on to the <a
href="https://www.gbif.org/user/download">GBIF website</a>.</p></li>
</ol></div></div>

In [None]:

csv_file_pattern = "*.csv"  
data_dir = Path(data_dir)  # Convert your existing string into a Path


# Only download once
if not any(data_dir.rglob(csv_file_pattern)):
    # Only submit one request
    if 'GBIF_DOWNLOAD_KEY' not in os.environ:
        # Submit query to GBIF
        gbif_query = occ.download(
            [
                f"taxonKey = {taxon_key}",
                "hasCoordinate = true",
                "year >= 2015",
                "year <= 2025"
            ],
            user=os.environ["GBIF_USER"],
            pwd=os.environ["GBIF_PWD"],
            email=os.environ["GBIF_EMAIL"]
        )
        # Take first result
        dkey = gbif_query[0] if isinstance(gbif_query, (list, tuple)) else gbif_query
        os.environ["GBIF_DOWNLOAD_KEY"] = dkey

    # Wait for the download to build
    dld_key = os.environ["GBIF_DOWNLOAD_KEY"]
    wait = occ.download_meta(dld_key)["status"]
    while wait != "SUCCEEDED":
        print(f"Download status: {wait}. Waiting 5 seconds...")
        time.sleep(5)
        wait = occ.download_meta(dld_key)["status"]

    # Download GBIF data
    dld_info = occ.download_get(os.environ["GBIF_DOWNLOAD_KEY"], path=data_dir)
    dld_path = dld_info["path"]

    # Unzip GBIF data
    with zipfile.ZipFile(dld_path) as dld_zip:
        dld_zip.extractall(path=data_dir)

    # Clean up the .zip file
    os.remove(dld_path)

# Find the extracted .csv file path (first result)
original_gbif_path = next(data_dir.rglob(csv_file_pattern))
original_gbif_path


Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download status: RUNNING. Waiting 5 seconds...
Download stat

INFO:Download file size: 197337680 bytes
INFO:On disk at /Users/niko2485/Library/CloudStorage/OneDrive-UCB-O365/Desktop/Data_ES/species_migration_data/0048067-251009101135966.zip


PosixPath('/Users/niko2485/Library/CloudStorage/OneDrive-UCB-O365/Desktop/Data_ES/species_migration_data/0048067-251009101135966.csv')

You might notice that the GBIF data filename isn’t very
**descriptive**…at this point, you may want to clean up your data
directory so that you know what the file is later on!

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace ‘your-gbif-filename’ with a <strong>descriptive</strong>
name.</li>
<li>Run the cell</li>
<li>Check your data folder. Is it organized the way you want?</li>
</ol></div></div>

In [None]:
# Give the download a descriptive name
gbif_path = data_dir / 'swainsons_thrush_gbif.csv'  

# Move file to descriptive path
shutil.move(str(original_gbif_path), str(gbif_path))

print(f"GBIF file moved to: {gbif_path}")

✅ GBIF file moved to: /Users/niko2485/Library/CloudStorage/OneDrive-UCB-O365/Desktop/Data_ES/species_migration_data/swainsons_thrush_gbif.csv


### STEP 4: Load the GBIF data into Python

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Load GBIF data</div></div><div class="callout-body-container callout-body"><p>Just like you did when wrangling your data from the data subset,
you’ll need to load your GBIF data and convert it to a GeoDataFrame.</p></div></div>

In [None]:
# Load the GBIF data
df_swainsons = pd.read_csv(gbif_path, sep="\t")  
df_swainsons.head()



  df_swainsons = pd.read_csv(gbif_path, sep="\t")


basisOfRecord
HUMAN_OBSERVATION      1954000
MACHINE_OBSERVATION       3761
PRESERVED_SPECIMEN         816
LIVING_SPECIMEN             71
MATERIAL_SAMPLE             14
Name: count, dtype: int64


Converting to GeoDataFrame

In [None]:

#filtering out the invalid coordinates
df_swainsons = df_swainsons[df_swainsons["decimalLatitude"].notna() & df_swainsons["decimalLongitude"].notna()]
df_swainsons = df_swainsons[
    (df_swainsons["decimalLatitude"].between(-90, 90)) &
    (df_swainsons["decimalLongitude"].between(-180, 180))
].copy()

#converting to geodataframe
swainsons_gdf = gpd.GeoDataFrame(
    df_swainsons,
    geometry=gpd.points_from_xy(df_swainsons["decimalLongitude"], df_swainsons["decimalLatitude"]),
    crs="EPSG:4326"  # WGS84 latitude/longitude
)
# Check results

basisOfRecord_table = df_swainsons['basisOfRecord'].value_counts()

# Print summary table
print(basisOfRecord_table)

# STEP -1: Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! Replace `var1` and `var2` with the variable you want to save,
separated by spaces.

In [None]:
%store df_swainsons swainsons_gdf

Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!