#### Introduction
Using the optimised model parameters from the GAM stage 1 validation section, to 
predict missing baseline indices. 

#### Aim:
To compute baseline predictions using the optimised GAM stage 1 models. 

#### Workflow: 
1) The dataset is filtered by sample size (A seperate model is optimised for sample
sizes 6-8, 9-11, 12-14, 15-17, 18-20).
2) The model input parameters are defined (Lambda and spline count)
3) The model is created and predictions generated. 
4) The predictions are stored and exported in csv format. 

In [2]:
# Importing packages
import pandas as pd
import numpy as np
from pygam import LinearGAM, s
import os
from pathlib import Path

# Importing localised file directory
project_root = Path(os.environ['butterfly_project'])

# Importing data
gam_1_accept = pd.read_csv(project_root/'Data'/'UKBMS'/'gam_1'/'gam_1_accept.csv', index_col=0)

#### Log Transforming Site Index Scores using log1p().
log1p() is the same as log(x+1). It enables the inclusion of zero values. 
Indices are transformed due to a large right-skew in the data. 

In [None]:
gam_1_accept['log_site_index'] = np.log1p(gam_1_accept['site_index'])

#### Filtering the Dataset by Sample Size
Survey groups are partitioned by the number of consecutive surveys. This is because 
different sample sizes are likely to require different GAM parameters.

In [None]:
gam_1_accept = (
    gam_1_accept[
    (gam_1_accept['consecutive_surveys']>=12) # enter min 'consecutive_surveys' here
    & (gam_1_accept['consecutive_surveys']<=14) # enter max 'consecutive_surveys' here
    ].reset_index(drop=True)
)

#### Optimal Parameters from Model Optimisation are Defined

In [None]:
lam=10**0.6 # enter optimal lambda value here
splines=4 # enter optimal spline count here

#### Creating a Baseline Model for All Survey Groups with the Optimal Hyperparameters
A single loop is used to run through all the 'consecutive survey groups'. 

In [5]:
# Creating a unique list of all consecutive survey groups.
all_groups = list(gam_1_accept['consecutive_survey_group'].drop_duplicates())
# Empty lists are created to store the model output. 
prediction_data = []
edof = []

# A consecutive survey subset is acquired by filtering 'cs_dataset' with iterator value
for cs_group in all_groups: 
    cs_dataset = (
        gam_1_accept[
        gam_1_accept['consecutive_survey_group']==cs_group
        ]
    )
    model_all = (
        LinearGAM(s(0, n_splines=splines), lam=lam)
        .fit(cs_dataset[['year']], # x-axis years specific to 'cs_dataset'
             cs_dataset['log_site_index']) # y-axis indices specific to cs_dataset.
    )
    # Defining the baseline year in a 1x1 df. This is needed because .predict()
    # only accepts 2D input. 
    explanatory_data = pd.DataFrame([1993], columns=['explanatory_data'])
    # using the model to generate a single baseline prediction for each cs_group
    prediction_data.append(
        model_all.predict(explanatory_data)[0] # [0] extracts value from 1D array
    ) 
    edof.append(model_all.statistics_['edof']) # effective degrees of freedom.

# Populated lists are aggregated to form a single df.
prediction_data = pd.DataFrame({'consecutive_survey_group':all_groups,
                                'prediction_data':prediction_data,
                                'edof':edof})

In [6]:
# Saving predictions to csv
prediction_data.to_csv(project_root/'Data'/'UKBMS'/'gam_1'/'obs_12_13_14'/'obs_12_13_14_predictions.csv')