# Mapping migration

Introduction to vector data operations

## STEP 0: Set up

To get started on this notebook, you’ll need to restore any variables
from previous notebooks to your workspace. To save time and memory, make
sure to specify which variables you want to load.

In [1]:
%store -r

:::

### Identify the ecoregion for each observation

You can combine the ecoregions and the observations **spatially** using
a method called `.sjoin()`, which stands for spatial join.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-read"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Read More</div></div><div class="callout-body-container callout-body"><p>Check out the <a
href="https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#spatial-joins"><code>geopandas</code>
documentation on spatial joins</a> to help you figure this one out. You
can also ask your favorite LLM (Large-Language Model, like ChatGPT)</p></div></div>

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Perform a spatial join</div></div><div class="callout-body-container callout-body"><p>Identify the correct values for the <code>how=</code> and
<code>predicate=</code> parameters of the spatial join.</p></div></div>

In [2]:
gbif_ecoregion_gdf = (
    ecoregion_gdf
    # Match the CRS of the GBIF data and the ecoregions
    .to_crs(gbif_gdf.crs)
    # Find ecoregion for each observation
    .sjoin(
        gbif_gdf,
        how='inner', 
        predicate='contains')
)
gbif_ecoregion_gdf

Unnamed: 0_level_0,Ecoregion_name,Ecoregion_area,geometry,gbifID,month
Ecoregion_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
17.0,Allegheny Highlands forests,7.958751,"POLYGON ((-75.40899 43.03974, -75.41289 43.036...",4226955000,3
17.0,Allegheny Highlands forests,7.958751,"POLYGON ((-75.40899 43.03974, -75.41289 43.036...",4197819387,3
17.0,Allegheny Highlands forests,7.958751,"POLYGON ((-75.40899 43.03974, -75.41289 43.036...",4211517458,4
17.0,Allegheny Highlands forests,7.958751,"POLYGON ((-75.40899 43.03974, -75.41289 43.036...",4233350823,7
17.0,Allegheny Highlands forests,7.958751,"POLYGON ((-75.40899 43.03974, -75.41289 43.036...",4240719419,3
...,...,...,...,...,...
789.0,Western Gulf coastal grasslands,8.340400,"MULTIPOLYGON (((-97.19822 26.06972, -97.1974 2...",4220454881,1
789.0,Western Gulf coastal grasslands,8.340400,"MULTIPOLYGON (((-97.19822 26.06972, -97.1974 2...",4192215422,1
789.0,Western Gulf coastal grasslands,8.340400,"MULTIPOLYGON (((-97.19822 26.06972, -97.1974 2...",4234175930,1
796.0,Western shortgrass prairie,49.311356,"POLYGON ((-102.11588 43.49942, -102.11336 43.4...",4253377422,12


### Count the observations in each ecoregion each month

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Group observations by ecoregion</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace <code>columns_to_group_by</code> with a list of columns.
Keep in mind that you will end up with one row for each group – you want
to count the observations in each ecoregion by month.</li>
<li>Select only month/ecosystem combinations that have more than one
occurrence recorded, since a single occurrence could be an error.</li>
<li>Use the <code>.groupby()</code> and <code>.mean()</code> methods to
compute the mean occurrences by ecoregion and by month.</li>
<li>Run the code – it will normalize the number of occurrences by month
and ecoretion.</li>
</ol></div></div>

In [3]:
occurrence_df = (
    gbif_ecoregion_gdf
    # Taking ecoregion_id and changing it into a column whereas before I had it as the index in the wrongle notebook 
    .reset_index()
    # Select only necessary columns
    #[['Ecoregion_ID', 'month']] don't need this
    # For each ecoregion, for each month...
    .groupby(['Ecoregion_ID', 'month'])
    # ...count the number of occurrences
    .agg(occurrences=('gbifID', 'count'),
        area =('Ecoregion_area', 'first'))
)
# Print occurence_df.head(5)
occurrence_df.head(5)

# Normalize by area [not originally in the directions]
occurrence_df['density'] = (
    occurrence_df.occurrences 
    / occurrence_df.area
)

occurrence_df

Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,area,density
Ecoregion_ID,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
17.0,2,5,7.958751,0.628239
17.0,3,322,7.958751,40.458611
17.0,4,244,7.958751,30.658078
17.0,5,207,7.958751,26.009107
17.0,6,75,7.958751,9.423590
...,...,...,...,...
789.0,2,7,8.340400,0.839288
789.0,10,3,8.340400,0.359695
789.0,11,18,8.340400,2.158170
789.0,12,83,8.340400,9.951561


In [4]:
# Get rid of rare observations (possible misidentification?)
occurrence_df = occurrence_df[occurrence_df.occurrences>1]
occurrence_df

Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,area,density
Ecoregion_ID,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
17.0,2,5,7.958751,0.628239
17.0,3,322,7.958751,40.458611
17.0,4,244,7.958751,30.658078
17.0,5,207,7.958751,26.009107
17.0,6,75,7.958751,9.423590
...,...,...,...,...
789.0,2,7,8.340400,0.839288
789.0,10,3,8.340400,0.359695
789.0,11,18,8.340400,2.158170
789.0,12,83,8.340400,9.951561


In [5]:
# Take the mean by ecoregion
mean_occurrences_by_ecoregion = (
    occurrence_df
    .groupby('Ecoregion_ID')
    .mean()
)
mean_occurrences_by_ecoregion

Unnamed: 0_level_0,occurrences,area,density
Ecoregion_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
17.0,93.181818,7.958751,11.708096
33.0,133.5,16.637804,8.023896
34.0,63.333333,18.674884,3.391364
35.0,70.833333,16.43362,4.31027
50.0,96.0,1.514407,63.39114
95.0,7.8,40.677412,0.191753
126.0,8.333333,28.045989,0.297131
135.0,16.0,33.943895,0.471366
140.0,224.636364,24.173692,9.292596
151.0,61.2,36.779324,1.663978


In [6]:
# Take the mean by month
mean_occurrences_by_month = (
    occurrence_df
    .groupby('month')
    .mean()
)
mean_occurrences_by_month

Unnamed: 0_level_0,occurrences,area,density
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,50.75,14.321711,7.362722
2,100.958333,15.417205,13.404983
3,354.741935,16.988,33.179387
4,252.310345,28.982096,15.928372
5,305.071429,29.03105,19.075021
6,77.708333,29.074943,5.251749
7,48.904762,22.528648,3.162021
8,31.85,28.147429,2.18028
9,33.8,23.029397,2.662009
10,48.321429,20.021578,4.865157


### Normalize the observations

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Normalize</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Divide occurrences by the mean occurrences by month AND the mean
occurrences by ecoregion</li>
</ol></div></div>

In [7]:
# Normalize by space and time for sampling effort
occurrence_df['norm_occurrences'] = (
    occurrence_df[['density']]
    /mean_occurrences_by_month[['density']]
    /mean_occurrences_by_ecoregion[['density']]
)
occurrence_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  occurrence_df['norm_occurrences'] = (


Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,area,density,norm_occurrences
Ecoregion_ID,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
17.0,2,5,7.958751,0.628239,0.004003
17.0,3,322,7.958751,40.458611,0.104149
17.0,4,244,7.958751,30.658078,0.164394
17.0,5,207,7.958751,26.009107,0.116459
17.0,6,75,7.958751,9.423590,0.153259
...,...,...,...,...,...
789.0,2,7,8.340400,0.839288,0.016217
789.0,10,3,8.340400,0.359695,0.019150
789.0,11,18,8.340400,2.158170,0.113607
789.0,12,83,8.340400,9.951561,0.415483


<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><p>Make sure to store the new version of your <code>DataFrame</code> for
other notebooks!</p>
<div id="f13606e9" class="cell" data-execution_count="9">
<div class="sourceCode" id="cb1"><pre
class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>store occurrence_df</span></code></pre></div>
</div></div></div>

# STEP -1: Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! Replace `var1` and `var2` with the variable you want to save,
separated by spaces.

In [8]:
%store occurrence_df 

Stored 'occurrence_df' (DataFrame)


Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!