# Mapping migration

Introduction to vector data operations

## STEP 0: Set up

To get started on this notebook, you’ll need to restore any variables
from previous notebooks to your workspace. To save time and memory, make
sure to specify which variables you want to load.

In [15]:
%store gbif_df gbif_gdf gdf 

Stored 'gbif_df' (DataFrame)
Stored 'gbif_gdf' (GeoDataFrame)
Stored 'gdf' (GeoDataFrame)


:::

### Identify the ecoregion for each observation

You can combine the ecoregions and the observations **spatially** using
a method called `.sjoin()`, which stands for spatial join.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-read"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Read More</div></div><div class="callout-body-container callout-body"><p>Check out the <a
href="https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#spatial-joins"><code>geopandas</code>
documentation on spatial joins</a> to help you figure this one out. You
can also ask your favorite LLM (Large-Language Model, like ChatGPT)</p></div></div>

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Perform a spatial join</div></div><div class="callout-body-container callout-body"><p>Identify the correct values for the <code>how=</code> and
<code>predicate=</code> parameters of the spatial join.</p></div></div>

In [5]:
print(gdf.head())
print(gbif_df.head())
print(gbif_gdf.head())

  eco_code  area_km2                                           geometry
0   NT0702    125589  POLYGON ((-67.98364 -13.79293, -67.98972 -13.7...
1   NT0128    114506  POLYGON ((-72.72287 -3.54497, -72.724 -3.5443,...
2   NT0169     50675  POLYGON ((-65.90067 6.64234, -65.90344 6.64368...
3   NT0124    145963  POLYGON ((-66.18742 6.44344, -66.19053 6.44392...
4   NT1401     40894  POLYGON ((-45.68811 -1.26541, -45.69585 -1.263...
       gbifID                            datasetKey  \
0  4158712344  8a863029-f435-446a-821e-275f4f641165   
1  4923515059  36f15a36-6b45-442e-9c31-cd633423aee0   
2  4923522410  36f15a36-6b45-442e-9c31-cd633423aee0   
3  4923520798  36f15a36-6b45-442e-9c31-cd633423aee0   
4  4923520314  36f15a36-6b45-442e-9c31-cd633423aee0   

                                    occurrenceID   kingdom    phylum class  \
0  https://observation.org/observation/273993634  Animalia  Chordata  Aves   
1           80de1f4e-30b8-4ffc-88e7-654a025164a5  Animalia  Chordata  Aves   
2  

In [17]:
import geopandas as gpd

gbif_gdf = gpd.GeoDataFrame(
    gbif_df,
    geometry=gpd.points_from_xy(gbif_df['decimalLongitude'], gbif_df['decimalLatitude']),
    crs="EPSG:4326"
)

gbif_ecoregion_gdf = (
    gbif_gdf
    # Match the CRS of the GBIF data and the ecoregions
    .to_crs(gdf.crs)
    # Find ecoregion for each observation
    .sjoin(
        gdf[['eco_code', 'geometry']],
        how='left', 
        predicate='within')
)

print(gbif_ecoregion_gdf.columns)
gbif_ecoregion_gdf


Index(['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class',
       'order', 'family', 'genus', 'species', 'infraspecificEpithet',
       'taxonRank', 'scientificName', 'verbatimScientificName',
       'verbatimScientificNameAuthorship', 'countryCode', 'locality',
       'stateProvince', 'occurrenceStatus', 'individualCount',
       'publishingOrgKey', 'decimalLatitude', 'decimalLongitude',
       'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation',
       'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day',
       'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord',
       'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber',
       'identifiedBy', 'dateIdentified', 'license', 'rightsHolder',
       'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted',
       'mediaType', 'issue', 'geometry', 'index_right', 'eco_code'],
      dtype='object')


Unnamed: 0,gbifID,datasetKey,occurrenceID,kingdom,phylum,class,order,family,genus,species,...,rightsHolder,recordedBy,typeStatus,establishmentMeans,lastInterpreted,mediaType,issue,geometry,index_right,eco_code
0,4158712344,8a863029-f435-446a-821e-275f4f641165,https://observation.org/observation/273993634,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,Stichting Observation International,User 1751,,,2025-09-24T01:21:44.728Z,,,POINT (-82.5092 41.9137),604.0,NA0414
1,4923515059,36f15a36-6b45-442e-9c31-cd633423aee0,80de1f4e-30b8-4ffc-88e7-654a025164a5,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,The Field Museum of Natural History,D. E. Willard,,,2025-09-19T13:12:11.782Z,,COORDINATE_ROUNDED;INSTITUTION_MATCH_FUZZY;COL...,POINT (-87.61172 41.85281),,
2,4923522410,36f15a36-6b45-442e-9c31-cd633423aee0,da8061c1-aecd-49f8-84a7-31c39358f2cf,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,The Field Museum of Natural History,D. E. Willard,,,2025-09-19T13:12:11.782Z,,COORDINATE_ROUNDED;INSTITUTION_MATCH_FUZZY;COL...,POINT (-87.61172 41.85281),,
3,4923520798,36f15a36-6b45-442e-9c31-cd633423aee0,bf61d7c8-b847-4250-8d5a-798694a2ddfc,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,The Field Museum of Natural History,D. E. Willard,,,2025-09-19T13:12:11.782Z,,COORDINATE_ROUNDED;INSTITUTION_MATCH_FUZZY;COL...,POINT (-87.61172 41.85281),,
4,4923520314,36f15a36-6b45-442e-9c31-cd633423aee0,665d4e55-daaa-40d5-8339-e4b31bb303f4,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,The Field Museum of Natural History,J. Steadham,,,2025-09-19T13:12:11.782Z,,INSTITUTION_MATCH_FUZZY;COLLECTION_MATCH_FUZZY,POINT (-87.63013 41.88036),,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165609,4423534780,b1047888-ae52-4179-9dd5-5448ea342a24,https://data.biodiversitydata.nl/xeno-canto/ob...,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,Paul Driver,Paul Driver,,,2025-09-19T12:27:44.140Z,StillImage;Sound;StillImage,CONTINENT_DERIVED_FROM_COORDINATES,POINT (-75.1254 40.0626),1904.0,NA0411
165610,4524632357,b1047888-ae52-4179-9dd5-5448ea342a24,https://data.biodiversitydata.nl/xeno-canto/ob...,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,Sue Riffe,Sue Riffe,,,2025-09-19T12:27:44.140Z,StillImage;Sound;StillImage,CONTINENT_DERIVED_FROM_COORDINATES,POINT (-89.8049 44.6967),150.0,NA0415
165611,4173211734,b1047888-ae52-4179-9dd5-5448ea342a24,https://data.biodiversitydata.nl/xeno-canto/ob...,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,David Tattersley,David Tattersley,,,2025-09-19T12:27:44.140Z,StillImage;Sound;StillImage,CONTINENT_DERIVED_FROM_COORDINATES,POINT (-82.4753 42.0478),904.0,Lake
165612,4173216429,b1047888-ae52-4179-9dd5-5448ea342a24,https://data.biodiversitydata.nl/xeno-canto/ob...,Animalia,Chordata,Aves,Passeriformes,Turdidae,Catharus,Catharus fuscescens,...,Denis Provencher,Denis Provencher,,,2025-09-19T12:27:44.140Z,Sound;StillImage;StillImage,CONTINENT_DERIVED_FROM_COORDINATES,POINT (-74.5468 46.1638),,


### Count the observations in each ecoregion each month

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Group observations by ecoregion</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace <code>columns_to_group_by</code> with a list of columns.
Keep in mind that you will end up with one row for each group – you want
to count the observations in each ecoregion by month.</li>
<li>Select only month/ecosystem combinations that have more than one
occurrence recorded, since a single occurrence could be an error.</li>
<li>Use the <code>.groupby()</code> and <code>.mean()</code> methods to
compute the mean occurrences by ecoregion and by month.</li>
<li>Run the code – it will normalize the number of occurrences by month
and ecoretion.</li>
</ol></div></div>

In [33]:
print(occurace_df.head())

NameError: name 'occurace_df' is not defined

In [39]:
occurrence_df = (
    gbif_ecoregion_gdf
    # Select only necessary columns
    [['eco_code', 'month', 'species']]
    # For each ecoregion, for each month...
    .groupby(['eco_code' , 'month'])
    # ...count the number of occurrences
    .agg(occurrences=('species', 'count'))
    
)

# Get rid of rare observations (possible misidentification?)
occurrence_df = occurrence_df[occurrence_df.occurrences >= 5]

# Take the mean by ecoregion
mean_occurrences_by_ecoregion = (
    occurrence_df
    .groupby('eco_code')
    .agg(mean_occurrences=('occurrences', 'mean'))
    
    
)
# Take the mean by month
mean_occurrences_by_month = (
    occurrence_df
    .groupby('month')
    .agg(mean_occurrences=('occurrences', 'mean')
    
))

print(mean_occurrences_by_ecoregion.head())
print(occurrence_df.head())
print(mean_occurrences_by_month.head())

          mean_occurrences
eco_code                  
Lake            344.666667
NA0401         2029.000000
NA0403          819.428571
NA0407         3621.800000
NA0408           78.200000
                occurrences
eco_code month             
Lake     4               12
         5             1792
         6               29
         7               29
         8               41
       mean_occurrences
month                  
3              8.000000
4            243.733333
5           1552.324324
6           1273.357143
7            719.000000


### Normalize the observations

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Normalize</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Divide occurrences by the mean occurrences by month AND the mean
occurrences by ecoregion</li>
</ol></div></div>

In [56]:

occurrence_df = occurrence_df.reset_index(drop=True)

mean_by_eco = mean_occurrences_by_ecoregion.rename(columns={"mean_occurrences": "mean_by_eco"}).reset_index()
mean_by_month = mean_occurrences_by_month.rename(columns={"mean_occurrences": "mean_by_month"}).reset_index()


for col in ["mean_by_eco", "mean_by_month"]:
    if col in occurrence_df.columns:
        occurrence_df = occurrence_df.drop(columns=col)


occurrence_df = (
    occurrence_df
    .merge(mean_by_eco, on="eco_code", how="left")
    .merge(mean_by_month, on="month", how="left")
)


occurrence_df = occurrence_df.loc[:, ~occurrence_df.columns.duplicated()]


occurrence_df["norm_occurrences"] = (
    occurrence_df["occurrences"] /
    occurrence_df["mean_by_eco"] /
    occurrence_df["mean_by_month"]
)


print(occurrence_df.head())







   level_0  index eco_code  month  occurrences  norm_occurrences  mean_by_eco  \
0        0      0     Lake      4           12          0.000143   344.666667   
1        1      1     Lake      5         1792          0.003349   344.666667   
2        2      2     Lake      6           29          0.000066   344.666667   
3        3      3     Lake      7           29          0.000117   344.666667   
4        4      4     Lake      8           41          0.000403   344.666667   

   mean_by_month  
0     243.733333  
1    1552.324324  
2    1273.357143  
3     719.000000  
4     295.294118  


<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><p>Make sure to store the new version of your <code>DataFrame</code> for
other notebooks!</p>
<div id="f13606e9" class="cell" data-execution_count="9">
<div class="sourceCode" id="cb1"><pre
class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>store occurrence_df</span></code></pre></div>
</div></div></div>

# STEP -1: Wrap up

Don’t forget to store your variables so you can use them in other
notebooks! Replace `var1` and `var2` with the variable you want to save,
separated by spaces.

In [57]:
%store occurrence_df mean_occurrences_by_ecoregion mean_occurrences_by_month

Stored 'occurrence_df' (DataFrame)
Stored 'mean_occurrences_by_ecoregion' (DataFrame)
Stored 'mean_occurrences_by_month' (DataFrame)


Finally, be sure to `Restart` and `Run all` to make sure your notebook
works all the way through!