#### Introduction
Before baseline indices can be predicted, it is important to optimise and validate the
prediction model. 
Several model optimisation methods exist in Pygam (GCV, AIC, UBRE). However these
metrics are designed for evaluating the model 'fit' over a range of values (the fit is 
optimised by balancing predictive power and complexity). The fit of the model in this 
context is not important, neither the accuracy to which the model predicts surrounding 
years, only the accuracy of its baseline year prediction. For this reason another 
optimisation workflow is selected: each model is used to predict the missing 
baseline year. In another section these predictions are grouped and compared with the 
actual values. The performance metrics are then computed for each model (correlation 
coefficient, RMSE, MSE, ME and visualisations), and the optimal model is selected 
based on these. 

#### Aim 
The aim of this section is to generate GAM baseline predictions using different input
parameters (lambda and spline count). The predicitve performance of each model is 
assessed in another section. 

#### Workflow
1) The dataset is log trnaformed.
2) The dataset is filtered by sample size (A seperate model is optimised for sample
sizes 6-8, 9-11, 12-14, 15-17, 18-20). 
3) The model input parameters are defined.
4) A 'for loop' is used to generate models and compute predcitions. These are stored
in a dictionary.
5) The prediction data is cleaned and exported for performance evaluation.

In [2]:
# Importing packages
import pandas as pd
import numpy as np
from pygam import LinearGAM, s 
import os
from pathlib import Path

# Importing localised file directory
project_root = Path(os.environ['butterfly_project'])

# Importing data
gam_1_accept = pd.read_csv(project_root/'Data'/'UKBMS'/'gam_optimisation'/'gam_1_accept_validation.csv', index_col=0)

#### Log Transforming Site Index Scores Using log1p().
log1p() is the same as log(x+1). It enables the inclusion of zero values. 
Indices are transformed due to a large right-skew in the data. 

In [None]:
gam_1_accept['log_site_index'] = np.log1p(gam_1_accept['site_index'])

#### Filtering the Dataset by Sample Size
Survey groups are partitioned by the number of consecutive surveys. This is because 
different sample sizes are likely to require different GAM parameters.

In [None]:
gam_1_accept = (
    gam_1_accept[
    (gam_1_accept['consecutive_surveys']>=18) # enter min 'consecutive_surveys' here
    & (gam_1_accept['consecutive_surveys']<=20) # enter max 'consecutive_surveys' here
    ].reset_index(drop=True)
)

#### Defining the Hyperparameters
The parameters to be optimised are lambda and spline count. 

In [4]:
lam=(np.flip(np.logspace(-3,3,11))) # range of lambda values
splines=np.arange(4,11) # range of spline values

# 'product' is used to create an iterator for all combinations of the input parameters
from itertools import product 
combinations = product(lam, splines)

#### Creating a Baseline Model for All site/species Groups and Hyperparameter Combinations
Two loops are created:
- the outer loop runs through all the combinations of hyperparameters. These are fed into the inner loop. 
- the inner loop runs through all the combinations of 'consecutive survey groups'. These arefed directly into the GAM. This way predictions for all survey groups are created from a single combination of hyperparameters.
- Once the inner loop is complete, the predictions are stored, and the process repeats for another combination of hyperparameters.

In [5]:
# An empty dictionary is created. This will store all the model predictions.
parameter_data = {} 
# The outer loop
for lam, splines in combinations: 
    # Creating a unique list of all consecutive survey groups
    all_groups = list(gam_1_accept['consecutive_survey_group'].drop_duplicates()) 
    # Empty lists are created to store the model output. 
    prediction_data = [] # becomes populated by predictions generated in inner loop.
    edof = [] # effective degrees of freedom.

    # Now the inner loop is created. This loops through all consecutive survey groups. 
    for cs_group in all_groups: 
        # gam_1_accept dataset is filtered by 'consecutive_survey_group' in each loop
        cs_dataset = (
            gam_1_accept[
            gam_1_accept['consecutive_survey_group']==cs_group
            ]
        )
        model_all = (
            LinearGAM(s(0, n_splines=splines), lam=lam)
            .fit(cs_dataset[['year']], # x-axis years specific to 'cs_dataset'
                 cs_dataset['log_site_index']) # y-axis indices specific to cs_dataset.
        )
        # Defining the baseline year in a 1x1 df. This is needed because .predict()
        # only accepts 2D input. 
        explanatory_data = pd.DataFrame([1993], columns=['explanatory_data'])
        # using the model to generate a single baseline prediction for each cs_group
        prediction_data.append(
            model_all.predict(explanatory_data)[0] # [0] extracts value from 1D array
        )
        # the edof for each model is extracted
        edof.append(model_all.statistics_['edof'])

    # At the end of every parameter loop, populated lists are aggregated to form a 
    # single df.
    prediction_data = pd.DataFrame({'consecutive_survey_group':all_groups,
                                    'prediction_data':prediction_data, 
                                    'edof':edof})
    # Saving each df of parameter-specific predictions in a dictionary. 
    # The key of each df is the parameters used in the model. 
    parameter_data[(lam, splines)] = prediction_data

#### Cleaning Prediction Data

In [6]:
# To compare baseline prediction data with the actual data, site/species combinations 
# will need to be referenced. For this reason site and species code columns are added. 
for key, value in parameter_data.items():
    parameter_data[key] = (
        value.merge(
            gam_1_accept[[
                'species_code',
                'site_code',
                'consecutive_survey_group']]
            # Each consecutive survey group is formed of multiple years. Only unique 
            # consecutive survey group numbers are required.
            .drop_duplicates(), 
            on='consecutive_survey_group', 
            how='inner')
    )

In [7]:
# Converting logspace values to log base 10 for improved readability.
parameter_data_cleaned = {}
# Loops through all keys and values in 'parameter_data' dictionary.
for (lam, splines), value in parameter_data.items(): 
    # np.log(10) is used for the transformation. Data is stored in new dictionary. 
    parameter_data_cleaned[(round(np.log10(lam),1),splines)] = value

In [8]:
# Exporting dictionary
import pickle
with open(project_root/'Data'/'UKBMS'/'gam_optimisation'/'obs_18_19_20'/'obs_18_19_20_parameters.pkl','wb') as file:
    pickle.dump(parameter_data_cleaned, file)