#### Introduction
To create a successful GAM model, it is important to first ensure the data meets some
fundamental criteria. Models can be impacted when the sample size is too small; the 
gaps in the data are too large; or there are too many zero values. Computational 
performance is another consideration. Here the data is filtered leaving only site 
species combinations that are suitable for modelling. 

#### Aim
To identify site/species combinations that meet the entry requirements for the
stage 1 GAM. Prepare the data for the stage 1 GAM optimisation and validation.

#### Workflow
1) The data is filtered to leave only site/species combinations with baselines. These
records will be used to optimise and validate the model.
2) The data is filtered to leave only records between 1983 and 2003 (this is to reduce computational expense).
3) A count is applied to surveys within 5 years of the baseline
4) A count is applied to surveys within 10 years of the baseline
5) The data is filtered to ensure a year gap greater than 5 years between surveys does 
not exist
6) The data is filtered to ensure the proportion of zero values is <= 0.15. 

In [1]:
# Importing the required packages
import numpy as np
import pandas as pd
import os
from pathlib import Path

# Importing localised file directory
project_root = Path(os.environ['butterfly_project'])

# Importing data
ukbms = pd.read_csv(project_root/'Data'/'UKBMS'/'ukbms_master_v1.csv', index_col=0)

#### Finding Site/Species Combinations with Baselines

In [2]:
site_species_with_base_all = ukbms[ukbms['year']==1993][['site_code', 'species_code']]

# Finding the records of baseline site/species combinations
site_species_with_base_records = (
    ukbms.merge(site_species_with_base_all, 
                on=['site_code', 'species_code'], 
                how='inner')
)

# Filtering to 'hide' baseline year records from baseline site/species combinations
site_species_validation = (
    site_species_with_base_records[
    (site_species_with_base_records['year']!=1993) 
    # To predict the missing baseline, GAMS will only use records within 10 years of the
    # baseline year. This is will help to reduce computational processing. 
    & (site_species_with_base_records['year']>=1983) 
    & (site_species_with_base_records['year']<=2003)
    ]
)

# Removing redundant columns
site_species_validation = (
    site_species_validation
    .drop(columns=['country',
                   'site_name',
                   'species',
                   'common_name',
                   'gridreference',
                   'easting',
                   'northing'])
)

#### Assessing site/species Combinations Suitable for Stage 1 Gam

#### 1 Data Point Must be Within 5 Years of Baseline Year

In [3]:
survey_within_5 = (
    site_species_validation[
    (site_species_validation['year']>=1988) 
    & (site_species_validation['year']<=1998)
    ]
)

#### 6 Data Points Must be Within 10 Years of the Baseline Year

In [4]:
survey_within_10_agg = (
    site_species_validation
    .groupby(['site_code','species_code']) # unique site/species combinations
    .agg(survey_count=('species_code','count')) # a count is applied to each combination
)

# Filtering by the count
count_within_10 = (
    survey_within_10_agg[
    survey_within_10_agg['survey_count']>=6 # the min count within 10 years is 6
    ] 
)

# Merging the first two conditions: 
counts_filter = (
    survey_within_5 # 1 survey within 5 years
    .merge(count_within_10, # 6 surveys within 10 years of baseline year
           how='inner',
           on=['site_code','species_code'])
    .drop(columns=['survey_count']) # column no longer required
)

See cell above:

Now all remianing site/species combinations have at least 1 survey within 5 of the baseline year and 6 within 10. 

See cell below:

'counts_filter' contains records within 5 years of baseline only. To acquire records within 
10 years of baseline, 'counts_filter' is joined with 'site_species_validation'

In [5]:
year_filter = (
    site_species_validation
    .merge(counts_filter[['site_code', 'species_code']].drop_duplicates(),
           how='inner', 
           on=['site_code', 'species_code'])
)

Now all remaining site/species combinations have at least 1 survey within 5 of 
baseline year and 6 within 10, and span 10 years either side of the baseline year
(1983 to 2003). 

#### Ensuring a Year Gap >5 Years Between Surveys Does Not Exist. 
This is important to ensure compatibility with GAM

In [6]:
# First the data is sorted into site/species combinations and ordered by year.
survey_year_diff = (
    year_filter.sort_values(['site_code',
                          'species_code',
                          'year'])
    .reset_index(drop=True) # index must be reset after records re-ordered
)
    
# A new column is added detailing the number of years between the current and previous 
# survey for that site/species combination
survey_year_diff['diff_previous'] = (
    survey_year_diff
    .groupby(['site_code','species_code'])['year']
    .diff() # computes difference between current and previous row
)

# A new column is added detailing the number of years between the current and following
# survey for that site/species combination
survey_year_diff['diff_next'] = (
    survey_year_diff
    .groupby(['site_code','species_code'])['year']
    .diff(-1)*-1 # computes: (current row value - next row value)*-1
)

In [7]:
survey_year_diff['consecutive_survey_group'] = ( # new column name
    survey_year_diff
    .groupby(['site_code','species_code'])['diff_previous']
    .transform(lambda x: (x>5) # year gap exceeds 5 years
               # Also if x in 'diff_previous' 'isnull()' (the first survey in a new site 
               # species combination) a new survey group is created
               | (x.isnull()))
    .cumsum() # Creates 'survey group' by adding 1 every time a new group is identified.
)

In [8]:
# A count is applied to each survey group to determine the number of 'consecutive 
# surveys' for each site/species combination.
survey_year_diff['consecutive_surveys'] = (
    survey_year_diff
    .groupby(['consecutive_survey_group'])['consecutive_survey_group']
    .transform('count')
)

# Groupings must be of at least 6 (minimum number of data points required for the GAM). 
records_retained = (
     # removing groups with less than 6 data points
    survey_year_diff[survey_year_diff['consecutive_surveys']>=6]
    .reset_index(drop=True)
)

This leaves only groups with minimum 6 consecutive surveys, that are spaced by no more 
than 5 years, that have at least 1 survey within 5 years of 1993, and 6 surveys within
10 years of 1993. All records are between 1983 and 2003.

There could still be multiple survey groups for one site/species combination (ie: 1983-1988 
& 1998-2003)

In [9]:
# A new column is created containing the first cs_group number to appear for each 
# site/species combination.
records_retained['first_cs_group'] = (
    records_retained
    .groupby(['site_code', 'species_code'])['consecutive_survey_group']
    # if more than 1 'consecutive_survey_group' in a site/species combination exists, 
    # the first will be selected using .min()
    .transform(lambda x: x.min()) 
)

# Remove multiple groupings in each site/species combination by filtering against the 
# 'first_cs_group' column
records_retained = (
    records_retained[
    # If multiple grouings exist, only the first will be retained
    records_retained['first_cs_group']==records_retained['consecutive_survey_group']
    ].reset_index(drop=True) # index is reset following row removal
)

#### Limiting the Proportion of Zero Values in Each Survey Group
Needs to be <=15% to be suitable for GAM

In [10]:
proportion_zero = (
    records_retained
    .groupby(['consecutive_survey_group'])['site_index']
    # a zero count is applied to each group and divided by the group size
    .transform(lambda x: sum(x==0)/len(x)) 
)

# Adding proportion of zeroes column to main dataframe
records_retained['proportion_zero'] = proportion_zero

In [11]:
# Filtering each site/species combination to leave only those with <=15% zero records
gam_1_accept = (
    records_retained[
    records_retained['proportion_zero']<=0.15 # 'proportion_zero' column is filtered
    ].reset_index(drop=True)
)

This leaves only site/species groups with 6-20 data points. 

In [12]:
print(gam_1_accept['consecutive_surveys'].value_counts())

consecutive_surveys
20    8040
12    5292
11    5181
13    5031
19    4940
17    4641
10    4360
14    4242
18    4230
16    3472
15    3300
9     2745
8     1392
7     1295
6      366
Name: count, dtype: int64


In [13]:
# exporting filtered data to csv file
gam_1_accept.to_csv(project_root/'Data'/'UKBMS'/'gam_optimisation'/'gam_1_accept_validation.csv')

#### Identifying site/species Combinations that were Not Suitable for the GAM Method
Baselines will be approximated from this data using another method. 

In [14]:
gam_1_indicator = (
    site_species_validation[['site_code','species_code','year', 'site_index']]
    .merge(gam_1_accept, 
           on=['site_code','species_code'], 
           how='left',
           suffixes=['','_drop'],
           # Creates new column. Reveals records that only appear in left dataframe
           indicator=True) 
)

# filtering df using the indicator column
gam_1_reject = (
    gam_1_indicator[gam_1_indicator['_merge']=='left_only'] # filter
    [['site_code', 'species_code', 'year', 'site_index']]
    .reset_index(drop=True) # rows removed, new index required
)

In [15]:
# exporting rejected records to a csv file
gam_1_reject.to_csv(project_root/'Data'/'UKBMS'/'gam_optimisation'/'gam_1_reject_validation.csv')