# STEP 4: Count the number of observations in each ecosystem, during 
# each month of 2023

Much of the data in GBIF is **crowd-sourced**. As a result, we need not
just the number of observations in each ecosystem each month – we need
to **normalize** by some measure of **sampling effort**. After all, we
wouldn’t expect the same number of observations in the Arctic as we
would in a National Park, even if there were the same number of Veeries.
In this case, we’re normalizing using the average number of observations
for each ecosystem and each month. This should help control for the
number of active observers in each location and time of year.

### Set up your analysis

First things first – let’s load your stored variables.

In [1]:
%store -r

### Identify the ecoregion for each observation

You can combine the ecoregions and the observations **spatially** using
a method called `.sjoin()`, which stands for spatial join.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-read"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Read More</div></div><div class="callout-body-container callout-body"><p>Check out the <a
href="https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#spatial-joins"><code>geopandas</code>
documentation on spatial joins</a> to help you figure this one out. You
can also ask your favorite LLM (Large-Language Model, like ChatGPT)</p></div></div>

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Perform a spatial join</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Identify the correct values for the <code>how=</code> and
<code>predicate=</code> parameters of the spatial join.</li>
<li>Select only the columns you will need for your plot.</li>
<li>Run the code.</li>
</ol></div></div>

In [2]:
ecoreg_gdf.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [3]:
gbif_gdf.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [4]:
ecoreg_gdf.head()

Unnamed: 0_level_0,OBJECTID,ECO_NAME,BIOME_NUM,BIOME_NAME,REALM,ECO_BIOME_,NNH,ECO_ID,SHAPE_LENG,SHAPE_AREA,NNH_NAME,COLOR,COLOR_BIO,COLOR_NNH,LICENSE,geometry
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,1.0,Adelie Land tundra,11.0,Tundra,Antarctica,AN11,1,117,9.74978,0.038948,Half Protected,#63CFAB,#9ED7C2,#257339,CC-BY 4.0,"MULTIPOLYGON (((158.7141 -69.60657, 158.71264 ..."
1,2.0,Admiralty Islands lowland rain forests,1.0,Tropical & Subtropical Moist Broadleaf Forests,Australasia,AU01,2,135,4.800349,0.170599,Nature Could Reach Half Protected,#70A800,#38A700,#7BC141,CC-BY 4.0,"MULTIPOLYGON (((147.28819 -2.57589, 147.2715 -..."
2,3.0,Aegean and Western Turkey sclerophyllous and m...,12.0,"Mediterranean Forests, Woodlands & Scrub",Palearctic,PA12,4,785,162.523044,13.844952,Nature Imperiled,#FF7F7C,#FE0000,#EE1E23,CC-BY 4.0,"MULTIPOLYGON (((26.88659 35.32161, 26.88297 35..."
3,4.0,Afghan Mountains semi-desert,13.0,Deserts & Xeric Shrublands,Palearctic,PA13,4,807,15.084037,1.355536,Nature Imperiled,#FA774D,#CC6767,#EE1E23,CC-BY 4.0,"MULTIPOLYGON (((65.48655 34.71401, 65.52872 34..."
4,5.0,Ahklun and Kilbuck Upland Tundra,11.0,Tundra,Nearctic,NE11,1,404,22.590087,8.196573,Half Protected,#4C82B6,#9ED7C2,#257339,CC-BY 4.0,"MULTIPOLYGON (((-160.26404 58.64097, -160.2673..."


In [5]:
gbif_ecoregion_gdf = (
    ecoreg_gdf
    # Match the CRS of the GBIF data and the ecoregions
    .to_crs(gbif_gdf.crs)
    # Find ecoregion for each observation
    .sjoin(
        gbif_gdf,
        how='inner', 
        predicate='contains')
    # Select the required columns
    [['OBJECTID', 'gbifID', 'ECO_NAME','BIOME_NUM','BIOME_NAME', 'month', 'SHAPE_AREA']]
)
gbif_ecoregion_gdf

Unnamed: 0_level_0,OBJECTID,gbifID,ECO_NAME,BIOME_NUM,BIOME_NAME,month,SHAPE_AREA
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12,13.0,4743927038,Alberta-British Columbia foothills forests,5.0,Temperate Conifer Forests,5,17.133639
12,13.0,4621947377,Alberta-British Columbia foothills forests,5.0,Temperate Conifer Forests,5,17.133639
12,13.0,4761090115,Alberta-British Columbia foothills forests,5.0,Temperate Conifer Forests,6,17.133639
12,13.0,4765238615,Alberta-British Columbia foothills forests,5.0,Temperate Conifer Forests,7,17.133639
12,13.0,4630693711,Alberta-British Columbia foothills forests,5.0,Temperate Conifer Forests,6,17.133639
...,...,...,...,...,...,...,...
839,845.0,4633848077,North Atlantic moist mixed forests,4.0,Temperate Broadleaf & Mixed Forests,10,5.586107
839,845.0,4749131402,North Atlantic moist mixed forests,4.0,Temperate Broadleaf & Mixed Forests,9,5.586107
839,845.0,4763942306,North Atlantic moist mixed forests,4.0,Temperate Broadleaf & Mixed Forests,9,5.586107
839,845.0,4746476478,North Atlantic moist mixed forests,4.0,Temperate Broadleaf & Mixed Forests,9,5.586107


### Count the observations in each ecoregion each month

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Group observations by ecoregion</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace <code>columns_to_group_by</code> with a list of columns.
Keep in mind that you will end up with one row for each group – you want
to count the observations in each ecoregion by month.</li>
<li>Select only month/ecosystem combinations that have more than one
occurrence recorded, since a single occurrence could be an error.</li>
<li>Use the <code>.groupby()</code> and <code>.mean()</code> methods to
compute the mean occurrences by ecoregion and by month.</li>
<li>Run the code – it will normalize the number of occurrences by month
and ecoretion.</li>
</ol></div></div>

In [12]:
occurrence_df = (
    gbif_ecoregion_gdf
    #reset index
    .reset_index()
    # For each ecoregion, for each month...
    .groupby(['ecoregion', 'month'])
    # ...count the number of occurrences
    .agg(occurrences=('gbifID', 'count'),
         area=('SHAPE_AREA', 'first'))
)

In [13]:
#Normalize by area
occurrence_df['density'] = (
    occurrence_df.occurrences / occurrence_df.area
)
# Get rid of rare observations (possible misidentification?)
occurrence_df = occurrence_df[occurrence_df.occurrences > 1]
occurrence_df
#records total (rows) drops from 350 to 308

Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,area,density
ecoregion,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12,5,2,17.133639,0.116729
12,6,2,17.133639,0.116729
12,7,2,17.133639,0.116729
16,4,2,7.958751,0.251296
16,5,2980,7.958751,374.430624
...,...,...,...,...
833,7,293,35.905513,8.160307
833,8,40,35.905513,1.114035
833,9,11,35.905513,0.306360
839,9,25,5.586107,4.475389


In [14]:
# Take the mean by ecoregion
mean_occ_ecoregion = (
    occurrence_df
    .groupby('ecoregion')
    .mean()
)
# Take the mean by month
mean_occ_month = (
    occurrence_df
    .groupby('month')
    .mean()
)

In [15]:
mean_occ_ecoregion

Unnamed: 0_level_0,occurrences,area,density
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12,2.000000,17.133639,0.116729
16,1425.333333,7.958751,179.090084
22,3.000000,3.346216,0.896535
32,930.857143,16.637804,55.948319
33,243.142857,18.674884,13.019779
...,...,...,...
804,6.000000,5.968650,1.005253
827,9.000000,0.610793,14.734931
832,104.500000,4.286144,24.380889
833,235.400000,35.905513,6.556096


### Normalize the observations

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Normalize</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Divide occurrences by the mean occurrences by month AND the mean
occurrences by ecoregion</li>
</ol></div></div>

In [16]:
# Normalize by space and time for sampling effort
occurrence_df['norm_occurrences'] = (
    occurrence_df[['density']]
    / mean_occ_ecoregion[['density']]
    / mean_occ_month[['density']]
)
occurrence_df

Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,area,density,norm_occurrences
ecoregion,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12,5,2,17.133639,0.116729,0.010811
12,6,2,17.133639,0.116729,0.015697
12,7,2,17.133639,0.116729,0.029305
16,4,2,7.958751,0.251296,0.000080
16,5,2980,7.958751,374.430624,0.022602
...,...,...,...,...,...
833,7,293,35.905513,8.160307,0.036475
833,8,40,35.905513,1.114035,0.014753
833,9,11,35.905513,0.306360,0.001704
839,9,25,5.586107,4.475389,0.056981


<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><p>Make sure to store the new version of your <code>DataFrame</code> for
other notebooks!</p>
<div id="2e01613b" class="cell" data-execution_count="9">
<div class="sourceCode" id="cb1"><pre
class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>store occurrence_df</span></code></pre></div>
</div></div></div>

In [17]:
%store occurrence_df gbif_ecoregion_gdf

Stored 'occurrence_df' (DataFrame)
Stored 'gbif_ecoregion_gdf' (DataFrame)
