# Spring returns to the Great Plains

Mapping Tasiyagnunpa migration

## Set up

To get started on this notebook, you’ll need to restore any variables
from previous notebooks to your workspace.

In [10]:
%store -r

# Import libraries
import earthpy
import pandas as pd
import geopandas as gpd

## STEP 4: Count the number of observations in each ecosystem, during each month of 2023

Much of the data in GBIF is **crowd-sourced**. As a result, we need not
just the number of observations in each ecosystem each month – we need
to **normalize** by some measure of **sampling effort**. After all, we
wouldn’t expect the same number of observations at the North Pole as we
would in a National Park, even if there were the same number organisms.
In this case, we’re normalizing using the average number of observations
for each ecosystem and each month. This should help control for the
number of active observers in each location and time of year.

### Set up your analysis

First things first – let’s load your stored variables.

In [11]:
%store -r ecoregions_gdf gbif_gdf

### Identify the ecoregion for each observation

You can combine the ecoregions and the observations **spatially** using
a method called `.sjoin()`, which stands for spatial join.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-read"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Read More</div></div><div class="callout-body-container callout-body"><p>Check out the <a
href="https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#spatial-joins"><code>geopandas</code>
documentation on spatial joins</a> to help you figure this one out. You
can also ask your favorite LLM (Large-Language Model, like ChatGPT)</p></div></div>

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Perform a spatial join</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Identify the correct values for the <code>how=</code> and
<code>predicate=</code> parameters of the spatial join.</li>
<li>Select only the columns you will need for your plot.</li>
<li>Run the code.</li>
</ol></div></div>

In [12]:
gbif_gdf.head()

Unnamed: 0_level_0,month,geometry
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1
4501319588,5,POINT (-104.94913 40.65778)
4501319649,7,POINT (-105.16398 40.26684)
4697139297,2,POINT (-109.70095 31.56917)
4735897257,4,POINT (-102.27735 40.58295)
4719794206,6,POINT (-104.51592 39.26695)


In [19]:
ecoregions_gdf.head()

Unnamed: 0_level_0,SHAPE_AREA,geometry
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.038948,"MULTIPOLYGON (((158.7141 -69.60657, 158.71264 ..."
1,0.170599,"MULTIPOLYGON (((147.28819 -2.57589, 147.2715 -..."
2,13.844952,"MULTIPOLYGON (((26.88659 35.32161, 26.88297 35..."
3,1.355536,"MULTIPOLYGON (((65.48655 34.71401, 65.52872 34..."
4,8.196573,"MULTIPOLYGON (((-160.26404 58.64097, -160.2673..."


In [20]:
gbif_ecoregion_gdf = (
    ecoregions_gdf #left
    # Match the CRS of the GBIF data and the ecoregions
    .to_crs(gbif_gdf.crs)
    # Find ecoregion for each observation
    .sjoin(
        gbif_gdf,
        how='right', # inner join is where want rows that appear in both tables 
        #left would be want all rows from right that appear in left table
        #right would be want all rows from the left that appear in right table
        #we don't want ecoregions if they don't have observations so right join
        predicate='intersects')
    # Select the required columns
    #[['ecoregion', 'SHAPE_AREA', 'month', ]]
    
)
gbif_ecoregion_gdf

Unnamed: 0_level_0,ecoregion,SHAPE_AREA,month,geometry
gbifID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4501319588,790.0,49.311356,5,POINT (-104.94913 40.65778)
4501319649,790.0,49.311356,7,POINT (-105.16398 40.26684)
4697139297,162.0,46.807295,2,POINT (-109.70095 31.56917)
4735897257,790.0,49.311356,4,POINT (-102.27735 40.58295)
4719794206,790.0,49.311356,6,POINT (-104.51592 39.26695)
...,...,...,...,...
4796460466,87.0,4.727694,4,POINT (-118.89593 35.43725)
4720342585,172.0,28.732790,4,POINT (-109.28928 40.43625)
4725888708,513.0,86.168963,6,POINT (-111.30072 47.66419)
4512646405,513.0,86.168963,7,POINT (-109.53653 50.8687)


### Count the observations in each ecoregion each month

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Group observations by ecoregion</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace <code>columns_to_group_by</code> with a list of columns.
Keep in mind that you will end up with one row for each group – you want
to count the observations in each ecoregion by month.</li>
<li>Select only month/ecosystem combinations that have more than one
occurrence recorded, since a single occurrence could be an error.</li>
<li>Use the <code>.groupby()</code> and <code>.mean()</code> methods to
compute the mean occurrences by ecoregion and by month.</li>
<li>Run the code – it will normalize the number of occurrences by month
and ecoretion.</li>
</ol></div></div>

In [None]:
occurrence_df = (
    gbif_ecoregion_gdf
    # For each ecoregion, for each month...
    .groupby(['ecoregion', 'month'])
    # ...count the number of occurrences
    #.agg(occurrences=('month', 'count'))
    .agg(
        occurrences=('SHAPE_AREA', 'count'),
        area=('SHAPE_AREA', 'first')
    )
)
occurrence_df
# Get rid of rare observations (possible misidentification?)
occurrence_df = occurrence_df[occurrence_df.occurrences>1]
display(occurrence_df)

# Take the mean by ecoregion
mean_occurrences_by_ecoregion = (
   occurrence_df
   .groupby(['ecoregion'])
   .mean()
)
display(mean_occurrences_by_ecoregion)
# Take the mean by month
mean_occurrences_by_month = (
    occurrence_df
    .groupby(['month'])
    .mean()
)
display(mean_occurrences_by_month)
##mean_occurrences_by_month


Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,area
ecoregion,month,Unnamed: 2_level_1,Unnamed: 3_level_1
12.0,4,5,17.133639
12.0,5,22,17.133639
12.0,6,46,17.133639
12.0,7,5,17.133639
12.0,8,4,17.133639
...,...,...,...
833.0,8,114,35.905513
833.0,9,166,35.905513
833.0,10,75,35.905513
833.0,11,7,35.905513


Unnamed: 0_level_0,occurrences,area
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1
12.0,12.428571,17.133639
33.0,3.000000,18.674884
43.0,96.416667,10.853227
59.0,17.181818,7.110701
60.0,3.500000,3.236377
...,...,...
790.0,3716.833333,49.311356
793.0,386.333333,1.695309
796.0,270.100000,14.520123
832.0,68.250000,4.286144


Unnamed: 0_level_0,occurrences,area
month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,289.218182,13.645478
2,276.792453,14.520851
3,282.791045,14.346628
4,562.738462,15.110027
5,911.963636,16.781587
6,636.622642,16.930725
7,319.0,17.388763
8,161.352941,16.311893
9,194.296296,16.475353
10,229.369231,16.899486


: 

In [30]:
mean_occurrences_by_ecoregion

Unnamed: 0_level_0,occurrences,area
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1
12.0,12.428571,17.133639
33.0,3.000000,18.674884
43.0,96.416667,10.853227
59.0,17.181818,7.110701
60.0,3.500000,3.236377
...,...,...
790.0,3716.833333,49.311356
793.0,386.333333,1.695309
796.0,270.100000,14.520123
832.0,68.250000,4.286144


### Normalize the observations

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Normalize</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Divide occurrences by the mean occurrences by month AND the mean
occurrences by ecoregion</li>
</ol></div></div>

In [32]:
# Normalize by space and time for sampling effort
occurrence_df['occurrences'] = (
    occurrence_df
    / mean_occurrences_by_ecoregion
    / mean_occurrences_by_month
)
occurrence_df

ValueError: Columns must be same length as key

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><p>Make sure to store the new version of your <code>DataFrame</code> for
other notebooks!</p>
<div id="015f18c7" class="cell" data-execution_count="9">
<div class="sourceCode" id="cb1"><pre
class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>store occurrence_df</span></code></pre></div>
</div></div></div>

## Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! This code will store all your variables. You might want to
specify specific variables, especially if you have large objects in
memory that you won’t need in the future.

In [18]:
%store ecoregions_gdf occurrence_df

Stored 'ecoregions_gdf' (GeoDataFrame)
Stored 'occurrence_df' (DataFrame)


Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!