#### Introduction
To create a successful GAM model, it is important to first ensure the data meets some
fundamental criteria. Models can be impacted when the sample size is too small; the 
gaps in the data are too large; or there are too many zero values. Computational 
performance is another consideration. Here the data is filtered leaving only site 
species combinations that are suitable for modelling. 

#### Aim
To identify site/species combinations that meet the entry requirements for the
stage 1 GAM. Prepare the data for the stage 1 GAM.

#### Workflow
1) The dataset is filtered to leave only site/species combinations without baselines.
2) The data is filtered to leave only records between 1983 and 2003 (this is to reduce computational expense).
3) A count is applied to surveys within 5 years of the baseline
4) A count is applied to surveys within 10 years of the baseline
5) The data is filtered to ensure a year gap greater than 5 years between surveys does 
not exist
6) The data is filtered to ensure the proportion of zero values is <= 0.15. 

In [2]:
# Importing the required packages
import numpy as np
import pandas as pd
import os
from pathlib import Path

# Importing localised file directory
project_root = Path(os.environ['butterfly_project'])

# Importing data
ukbms = pd.read_csv(project_root/'Data'/'UKBMS'/'ukbms_master_v1.csv', index_col=0)

#### Filtering to site/species Combinations Without a Baseline

In [3]:
site_species_with_base = ukbms[ukbms['year']==1993][['site_code', 'species_code']]

In [4]:
# Labelling all site/species combinations without a baseline record
site_species_indicator = (
    ukbms
    .merge(site_species_with_base,
                how='left',
                on=['site_code',
                    'species_code'],
                # creates new column showing records that appear in left df only
                indicator=True) 
)

In [5]:
# Filtering
site_species_without_base = (
    site_species_indicator[
    # Only records appearing in left df are retained
    site_species_indicator['_merge']=='left_only'
    ].reset_index(drop=True) # index reset following row removal
)

#### Filtering to Years Between 1983 and 2003

In [6]:
# To predict the missing baseline, GAMS will only use records within 10 years of the 
# baseline year
site_species_review = (
    site_species_without_base[
    # To predict the missing baseline, GAMS will only use records within 10 years of the
    # baseline year. This is will help to reduce computational processing.
    (site_species_without_base['year']>=1983) 
    & (site_species_without_base['year']<=2003)
    ].reset_index(drop=True)
)

# Removing redundant columns
site_species_review = site_species_review.drop(columns=['country',
                                                        'site_name',
                                                        'species',
                                                        'common_name',
                                                        'gridreference',
                                                        'easting',
                                                        'northing',
                                                        '_merge'])

#### Assessing site/species Combinations Suitable for Stage 1 Gam

#### 1 Data Point Must be Within 5 Years of Baseline Year

In [7]:
survey_within_5 = (
    site_species_review[
    (site_species_review['year']>=1988) & (site_species_review['year']<=1998)
    ]
)

#### 6 Data Points Must be Within 10 years of the Baseline Year

In [8]:
survey_within_10_agg = (
    site_species_review
    .groupby(['site_code','species_code']) # unique site/species combinations
    .agg(survey_count=('species_code','count')) # a count is applied to each combination 
)

# Filtering by the count
count_within_10 = (
    survey_within_10_agg[
    survey_within_10_agg['survey_count']>=6 # the min count within 10 years is 6
    ])

In [None]:
# Merging the first two conditions: 
counts_filter = (
    survey_within_5 # 1 survey within 5 years
    .merge(count_within_10, # 6 surveys within 10 years of baseline year
           how='inner',
           on=['site_code','species_code'])
    .drop(columns=['survey_count']) # column no longer required
)

See cell above:

Now all remianing site/species combinations have at least 1 survey within 5 of 
baseline year and 6 within 10. 

See cell below:

'counts_filter' contains records within 5 years of baseline only. To acquire records within 
10 years of baseline, 'counts_filter' is joined with 'site_species_validation'

In [None]:
year_filter = (
    site_species_review
    .merge(counts_filter[['site_code', 'species_code']].drop_duplicates(),
           how='inner',
           on=['site_code', 'species_code'])
)

Now all remianing site/species combinations have at least 1 survey within 5 of 
baseline year and 6 within 10, and span 10 years either side of the baseline year
(1983 to 2003). 

#### Ensuring That a Year Gap >5 Years Between Surveys Does Not Exist. 
This is important to ensure compatibility with GAM

In [None]:
# First the data is sorted into site/species combinations and ordered by year.
survey_year_diff = (
    year_filter
    .sort_values(['site_code','species_code','year'])
    .reset_index(drop=True) # index must be reset after records re-ordered
)

# A new column is added detailing the number of years between the current and previous 
# survey for that site/species combination
survey_year_diff['diff_previous'] = (
    survey_year_diff
    .groupby(['site_code','species_code'])['year']
    .diff() # computes difference between current and previous row
)

# A new column is added detailing the number of years between the current and following
# survey for that site/species combination
survey_year_diff['diff_next'] = (
    survey_year_diff
    .groupby(['site_code','species_code'])['year']
    .diff(-1)*-1 # computes: (current row value - next row value)*-1
)

In [12]:
survey_year_diff['consecutive_survey_group'] = ( # new column name
    survey_year_diff
    .groupby(['site_code','species_code'])['diff_previous']
    .transform(lambda x: (x>5) # year gap exceeds 5 years
               # Also if x in 'diff_previous' 'isnull()' (the first survey in a new site 
               # species combination) a new survey group is created
               | (x.isnull()))
    .cumsum() # Creates 'survey group' by adding 1 every time a new group is identified.
)

In [None]:
# A count is applied to each survey group to determine the number of 'consecutive 
# surveys' for each site/species combination.
survey_year_diff['consecutive_surveys'] = (
    survey_year_diff
    .groupby(['consecutive_survey_group'])['consecutive_survey_group']
    .transform('count')
)

# Groupings must be of at least 6 (minimum number of data points required for the GAM). 
records_retained = (
    survey_year_diff[
    # removing groups with less than 6 data points
    survey_year_diff['consecutive_surveys']>=6 
    ].reset_index(drop=True)
)

This leaves only groups with minimum 6 consecutive surveys, that are spaced by no more 
than 5 years, that have at least 1 survey within 5 years of 1993, and 6 surveys within
10 years of 1993. All records are between 1983 and 2003.

There could still be multiple groups for one site/species combination (ie: 1983-1988 
& 1998-2003)

In [None]:
# A new column is created containing the first cs_group number to appear for each 
# site/species combination.
records_retained['first_cs_group'] = (
    records_retained
    .groupby(['site_code', 'species_code'])['consecutive_survey_group']
    # if more than 1 'consecutive_survey_group' in a site/species combination exists, 
    # the first will be selected using .min()
    .transform(lambda x: x.min())
)

In [15]:
# Remove multiple groupings in each site/species combination by filtering against the 
# 'first_cs_group' column
records_retained = (
    records_retained[
    # If multiple grouings exist, only the first will be retained
    records_retained['first_cs_group']==records_retained['consecutive_survey_group']
    ].reset_index(drop=True) # index is reset following row removal
)

#### Limiting the Proportion of Zero Values in Each Survey Group
needs to be <=15% to be suitable for GAM

In [16]:
proportion_zero = (
    records_retained.groupby(['consecutive_survey_group'])['site_index']
    # a zero count is applied to each group and divided by the group size
    .transform(lambda x: sum(x==0)/len(x))
)

# Adding proportion of zeroes column to main dataframe
records_retained['proportion_zero'] = proportion_zero

In [17]:
# Filtering each site/species combination to leave only those with <=15% zero records
gam_1_accept = (
    records_retained[
    records_retained['proportion_zero']<=0.15 # 'proportion_zero' column is filtered
    ].reset_index(drop=True)
)

This leaves only site/species groups with 6-20 data points. 

In [None]:
print(gam_1_accept['consecutive_surveys'].value_counts())

consecutive_surveys
7     5369
8     5040
9     3978
10    3030
6     2886
14     826
16     656
13     611
11     583
12     552
15     525
17     493
19     456
18     360
20     160
Name: count, dtype: int64


In [19]:
# Creating csv file for stage 1 GAM analysis
gam_1_accept.to_csv(project_root/'Data'/'UKBMS'/'gam_1'/'gam_1_accept.csv')

#### Identifying site/species Combinations that were not Suitable for the GAM Method
Baselines will be approximated from this data using another method.

In [20]:
gam_1_indicator = (
    site_species_review[['site_code','species_code','year', 'site_index']]
    .merge(gam_1_accept[['site_code','species_code']],
           on=['site_code','species_code'],
           how='left',
           suffixes=['','_drop_col'],
           # Creates new column. Reveals records that only appear in left dataframe
           indicator=True)
)

gam_1_reject = (
    gam_1_indicator[gam_1_indicator['_merge']=='left_only'] # filter
    [['site_code','species_code','year','site_index']] 
    .reset_index(drop=True) # rows removed, new index required
)

In [21]:
# Creating csv file for site/species combinations not accepted for gam stage 1.
gam_1_reject.to_csv('project_root'/'Data'/'UKBMS'/'gam_1'/'gam_1_reject.csv')