# Migration Data Download

Get Tasiagnunpa occurrence data from the Global Biodiversity Information
Facility (GBIF)

## Set up

To get started on this notebook, you’ll need to restore any variables
from previous notebooks to your workspace.

In [2]:
%store -r

# Import libraries

## Access locations and times of Veery encounters

For this challenge, you will use a database called the [Global
Biodiversity Information Facility (GBIF)](https://www.gbif.org/). GBIF
is compiled from species observation data all over the world, and
includes everything from museum specimens to photos taken by citizen
scientists in their backyards.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Explore GBIF</div></div><div class="callout-body-container callout-body"><p>Before your get started, go to the <a
href="https://www.gbif.org/occurrence/search">GBIF occurrences search
page</a> and explore the data.</p></div></div>

> **Contribute to open data**
>
> You can get your own observations added to GBIF using
> [iNaturalist](https://www.inaturalist.org/)!

### Set up your code to prepare for download

We will be getting data from a source called [GBIF (Global Biodiversity
Information Facility)](https://www.gbif.org/). We need a package called
`pygbif` to access the data, which may not be included in your
environment. Install it by running the cell below:

In [3]:
%%bash
pip install pygbif

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Import packages</div></div><div class="callout-body-container callout-body"><p>In the imports cell, we’ve included some packages that you will need.
Add imports for packages that will help you:</p>
<ol type="1">
<li>Work with reproducible file paths</li>
<li>Work with tabular data</li>
</ol></div></div>

In [4]:
import time
import zipfile
from getpass import getpass
from glob import glob

import pygbif.occurrences as occ
import pygbif.species as species

### Create a directory for your data

For this challenge, you will need to download some data to the computer
you’re working on. We suggest using the `earthpy` library we develop to
manage your downloads, since it encapsulates many best practices as far
as:

1.  Where to store your data
2.  Dealing with archived data like .zip files
3.  Avoiding version control problems
4.  Making sure your code works cross-platform
5.  Avoiding duplicate downloads

If you’re working on one of our assignments through GitHub Classroom, it
also lets us build in some handy defaults so that you can see your data
files while you work.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Create a project folder</div></div><div class="callout-body-container callout-body"><p>The code below will help you get started with making a project
directory</p>
<ol type="1">
<li>Replace <code>'your-project-directory-name-here'</code> with a
<strong>descriptive</strong> name</li>
<li>Run the cell</li>
<li>The code should have printed out the path to your data files. Check
that your data directory exists and has data in it using the terminal or
your Finder/File Explorer.</li>
</ol></div></div>

> **File structure**
>
> These days, a lot of people find your file by searching for them or
> selecting from a `Bookmarks` or `Recents` list. Even if you don’t use
> it, your computer also keeps files in a **tree** structure of folders.
> Put another way, you can organize and find files by travelling along a
> unique **path**, e.g. `My Drive` \> `Documents` \>
> `My awesome project` \> `A project file` where each subsequent folder
> is **inside** the previous one. This is convenient because all the
> files for a project can be in the same place, and both people and
> computers can rapidly locate files they want, provided they remember
> the path.
>
> You may notice that when Python prints out a file path like this, the
> folder names are **separated** by a `/` or `\` (depending on your
> operating system). This character is called the **file separator**,
> and it tells you that the next piece of the path is **inside** the
> previous one.

In [6]:
# Create data directory
project = earthpy.Project(
    project_dirname='your-project-directory-name-here')
# Download sample data
project.get_data()

# Display the project directory
project.project_dir

### Register and log in to GBIF

You will need a [GBIF account](https://www.gbif.org/) to complete this
challenge. You can use your GitHub account to authenticate with GBIF.
Then, run the following code to enter your credentials for the rest of
your session.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-error"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div></div><div class="callout-body-container callout-body"><p>This code is <strong>interactive</strong>, meaning that it will
<strong>ask you for a response</strong>! The prompt can sometimes be
hard to see if you are using VSCode – it appears at the
<strong>top</strong> of your editor window.</p></div></div>

> **Tip**
>
> If you need to save credentials across multiple sessions, you can
> consider loading them in from a file like a `.env`…but make sure to
> add it to .gitignore so you don’t commit your credentials to your
> repository!

> **Warning**
>
> Your email address **must** match the email you used to sign up for
> GBIF!

> **Tip**
>
> If you accidentally enter your credentials wrong, you can set
> `reset=True` instead of `reset=False`.

In [8]:
####--------------------------####
#### DO NOT MODIFY THIS CODE! ####
####--------------------------####
# This code ASKS for your credentials and saves it for the rest of the session.
# NEVER put your credentials into your code!!!!

# GBIF needs a username, password, and email -- all need to match the account
reset = False

# Request and store username
if (not ('GBIF_USER'  in os.environ)) or reset:
    os.environ['GBIF_USER'] = input('GBIF username:')

# Securely request and store password
if (not ('GBIF_PWD'  in os.environ)) or reset:
    os.environ['GBIF_PWD'] = getpass('GBIF password:')
    
# Request and store account email address
if (not ('GBIF_EMAIL'  in os.environ)) or reset:
    os.environ['GBIF_EMAIL'] = input('GBIF email:')

### Get the species key

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace the <code>species_name</code> with the name of the species
you want to look up</li>
<li>Run the code to get the species key</li>
</ol></div></div>

In [9]:
# Query species
species_info = species.name_lookup(species_name, rank='SPECIES')

# Get the first result
first_result = species_info['results'][0]

# Get the species key (speciesKey)
species_key = first_result['speciesKey']

# Check the result
first_result['species'], species_key

### Download data from GBIF

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Submit a request to GBIF</div></div><div class="callout-body-container callout-body"><ol type="1">
<li><p>Replace <code>csv_file_pattern</code> with a string that will
match <strong>any</strong> <code>.csv</code> file when used in the
<code>glob</code> function. HINT: the character <code>*</code>
represents any number of any values except the file separator
(e.g. <code>/</code>)</p></li>
<li><p>Add parameters to the GBIF download function,
<code>occ.download()</code> to limit your query to:</p>
<ul>
<li>observations of <span data-__quarto_custom="true"
data-__quarto_custom_type="Shortcode"
data-__quarto_custom_context="Inline"
data-__quarto_custom_id="1"></span></li>
<li>from 2023</li>
<li>with spatial coordinates.</li>
</ul></li>
<li><p>Then, run the download. <strong>This can take a few
minutes</strong>.</p></li>
</ol></div></div>

In [11]:
# Only download once
if not glob(str(project.project_dir / csv_file_pattern)):
    # Submit query to GBIF
    gbif_query = occ.download([
        "speciesKey = ",
        "year = ",
        "hasCoordinate = ",
    ])
    # Only download once
    if not 'GBIF_DOWNLOAD_KEY' in os.environ:
        os.environ['GBIF_DOWNLOAD_KEY'] = gbif_query[0]

        # Wait for the download to build
        wait = occ.download_meta(download_key)['status']
        while not wait=='SUCCEEDED':
            wait = occ.download_meta(download_key)['status']
            time.sleep(5)

    # Download GBIF data
    download_info = occ.download_get(
        os.environ['GBIF_DOWNLOAD_KEY'], 
        path=project.project_dir)

    # Unzip GBIF data
    with zipfile.ZipFile(download_info['path']) as download_zip:
        download_zip.extractall(path=project.project_dir)

# Find the extracted .csv file path (take the first result)
original_gbif_path = glob(str(project.project_dir / csv_file_pattern))[0]
original_gbif_path

You might notice that the GBIF data filename isn’t very
**descriptive**…at this point, you may want to clean up your data
directory so that you know what the file is later on!

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace ‘your-gbif-filename-here’ with a
<strong>descriptive</strong> name.</li>
<li>Run the cell</li>
<li>Check your data folder. Is it organized the way you want?</li>
</ol></div></div>

In [13]:
# Give the download a descriptive name
gbif_path = project.project_dir / 'your-gbif-filename-here'
shutil.move(original_gbif_path, gbif_path)
# Clean up
shutil.rmtree(download_info['path'])

### Load the GBIF data into Python

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Load GBIF data</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Look at the beginning of the file you downloaded using the code
below. What do you think the <strong>delimiter</strong> is?</li>
<li>Run the following code cell. What happens?</li>
<li>Uncomment and modify the parameters of <code>pd.read_csv()</code>
below until your data loads successfully and you have only the columns
you want.</li>
</ol></div></div>

You can use the following code to look at the beginning of your file:

In [15]:
!head -n 2 $gbif_path 

In [16]:
# Load the GBIF data
gbif_df = pd.read_csv(
    gbif_path, 
    delimiter='',
    index_col='',
    usecols=[]
)
gbif_df.head()

## Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! This code will store all your variables. You might want to
specify specific variables, especially if you have large objects in
memory that you won’t need in the future.

In [18]:
%store

Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!