# STEP 4: Count the number of observations in each ecosystem, during

each month of 2023

Much of the data in GBIF is **crowd-sourced**. As a result, we need not
just the number of observations in each ecosystem each month – we need
to **normalize** by some measure of **sampling effort**. After all, we
wouldn’t expect the same number of observations in the Arctic as we
would in a National Park, even if there were the same number of Veeries.
In this case, we’re normalizing using the average number of observations
for each ecosystem and each month. This should help control for the
number of active observers in each location and time of year.

### Set up your analysis

First things first – let’s load your stored variables.

In [3]:
%store -r

### Identify the ecoregion for each observation

You can combine the ecoregions and the observations **spatially** using
a method called `.sjoin()`, which stands for spatial join.

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-read"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Read More</div></div><div class="callout-body-container callout-body"><p>Check out the <a
href="https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#spatial-joins"><code>geopandas</code>
documentation on spatial joins</a> to help you figure this one out. You
can also ask your favorite LLM (Large-Language Model, like ChatGPT)</p></div></div>

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Perform a spatial join</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Identify the correct values for the <code>how=</code> and
<code>predicate=</code> parameters of the spatial join.</li>
<li>Select only the columns you will need for your plot.</li>
<li>Run the code.</li>
</ol></div></div>

In [4]:
# Code combines two geodataframes using the .sjoin function. Need to tell the new geodataframe how the two gdfs interact.
# Using the .sjoin function, you need to define the how and predicate arguments
# For how, it can be left (1st gdf), right (2nd gdf), or inner (both). Essentially which spatial data will you use
# predicate is how those gdfs interacts; like intersect, contains,..
gbif_ecoregion_gdf = (
    ecoregions_gdf
    # Match the CRS of the GBIF data and the ecoregions
    .to_crs(gbif_gdf.crs)
    # Find ecoregion for each observation
    .sjoin(
        gbif_gdf,
        how='inner', 
        predicate='contains')
    # Select the required columns
     [['index_right', 'OBJECTID', 'month', 'ECO_NAME' ]] 
     .rename(columns={'OBJECTID': 'ecoregion_id',
                      'index_right': 'obs_id'})   
)
gbif_ecoregion_gdf

Unnamed: 0_level_0,obs_id,ecoregion_id,month,ECO_NAME
ecoregion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12,108467,13.0,5,Alberta-British Columbia foothills forests
12,32240,13.0,5,Alberta-British Columbia foothills forests
12,134218,13.0,6,Alberta-British Columbia foothills forests
12,38202,13.0,7,Alberta-British Columbia foothills forests
12,68377,13.0,6,Alberta-British Columbia foothills forests
...,...,...,...,...
839,145323,845.0,10,North Atlantic moist mixed forests
839,94857,845.0,9,North Atlantic moist mixed forests
839,69125,845.0,9,North Atlantic moist mixed forests
839,141052,845.0,9,North Atlantic moist mixed forests


### Count the observations in each ecoregion each month

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Group observations by ecoregion</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Replace <code>columns_to_group_by</code> with a list of columns.
Keep in mind that you will end up with one row for each group – you want
to count the observations in each ecoregion by month.</li>
<li>Select only month/ecosystem combinations that have more than one
occurrence recorded, since a single occurrence could be an error.</li>
<li>Use the <code>.groupby()</code> and <code>.mean()</code> methods to
compute the mean occurrences by ecoregion and by month.</li>
<li>Run the code – it will normalize the number of occurrences by month
and ecoretion.</li>
</ol></div></div>

In [5]:
# Code counts the number of times the bird was recorded in each ecoregion during each month
# First use .groupby to group two columns, then count the occurrence where there is an observation by month.([]) with this function
occurrence_df = (
    gbif_ecoregion_gdf
    # For each ecoregion, for each month...
    .groupby(['ecoregion_id', 'month'])
    # ...count the number of occurrences
    .agg(occurrences=('obs_id', 'count'))
    )
occurrence_df

Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences
ecoregion_id,month,Unnamed: 2_level_1
13.0,5,2
13.0,6,2
13.0,7,2
17.0,4,2
17.0,5,2980
...,...,...
839.0,7,293
839.0,8,40
839.0,9,11
845.0,9,25


In [6]:
# Get rid of rare observations (possible misidentification?)
# include occurrence only if obs > 1
# selcting these data drop 42 wimes with only 1 obs
occurrence_df = occurrence_df[occurrence_df.occurrences>1]
occurrence_df

Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences
ecoregion_id,month,Unnamed: 2_level_1
13.0,5,2
13.0,6,2
13.0,7,2
17.0,4,2
17.0,5,2980
...,...,...
839.0,7,293
839.0,8,40
839.0,9,11
845.0,9,25


In [7]:
# Take the mean by ecoregion
mean_occurrences_by_ecoregion = (
    occurrence_df
    .groupby('ecoregion_id')
    .mean()
)
mean_occurrences_by_ecoregion

Unnamed: 0_level_0,occurrences
ecoregion_id,Unnamed: 1_level_1
13.0,2.000000
17.0,1425.333333
23.0,3.000000
33.0,930.857143
34.0,243.142857
...,...
810.0,6.000000
833.0,9.000000
838.0,104.500000
839.0,235.400000


In [8]:
# Take the mean by month
mean_occurrences_by_month = (
    occurrence_df
    .groupby('month')
    .mean()
)
mean_occurrences_by_month

Unnamed: 0_level_0,occurrences
month,Unnamed: 1_level_1
1,7.666667
2,7.333333
3,5.666667
4,140.472222
5,1207.169492
6,1041.571429
7,572.894737
8,164.972973
9,260.897436
10,32.825


### Normalize the observations

<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It: Normalize</div></div><div class="callout-body-container callout-body"><ol type="1">
<li>Divide occurrences by the mean occurrences by month AND the mean
occurrences by ecoregion</li>
</ol></div></div>

In [9]:
# Normalize by space and time for sampling effort
occurrence_df['norm_occurrences'] = (
    occurrence_df
    /mean_occurrences_by_ecoregion
    /mean_occurrences_by_month
)
occurrence_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  occurrence_df['norm_occurrences'] = (


Unnamed: 0_level_0,Unnamed: 1_level_0,occurrences,norm_occurrences
ecoregion_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1
13.0,5,2,0.000828
13.0,6,2,0.000960
13.0,7,2,0.001746
17.0,4,2,0.000010
17.0,5,2980,0.001732
...,...,...,...
839.0,7,293,0.002173
839.0,8,40,0.001030
839.0,9,11,0.000179
845.0,9,25,0.005989


In [11]:
%store occurrence_df

Stored 'occurrence_df' (DataFrame)


<link rel="stylesheet" type="text/css" href="./assets/styles.css"><div class="callout callout-style-default callout-titled callout-task"><div class="callout-header"><div class="callout-icon-container"><i class="callout-icon"></i></div><div class="callout-title-container flex-fill">Try It</div></div><div class="callout-body-container callout-body"><p>Make sure to store the new version of your <code>DataFrame</code> for
other notebooks!</p>
<div id="2e01613b" class="cell" data-execution_count="9">
<div class="sourceCode" id="cb1"><pre
class="sourceCode python cell-code"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>store occurrence_df</span></code></pre></div>
</div></div></div>