# Cross validation

We introduce here resampling and cross-validation for predictive models in R. A random cross-validation approach is used.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read the data

Data from [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488) (data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))

It's data on fish catch (multiple fish species) over time in different regions of Alaska.

In [None]:
url= "https://zenodo.org/records/4987796/files/stema_data.csv"
fish = pd.read_csv(url)
fish.head()

-   **CPUE**: target variable, "catch per unit effort"
-   **SST**: sea surface temperature
-   **CV**: actually, the coefficient of variation for SST is used $\rightarrow$ the coefficient of variation is an improved measure of seasonal SST over the mean, because it standardizes scale and allows us to consider the changes in variation of SST with the changes in mean over (Hannah Correia, 2018 - Ecology and Evolution)
-   **SSTcvW1-5**: CPUE is influenced by survival in the first year of life. Water temperature affects survival, and juvenile fish are more susceptible to environmental changes than adults. Therefore, CPUE for a given year is likely linked to the winter SST at the juvenile state. Since this survey targets waters during the summer and the four species covered reach maturity at 5--8 years, SST was lagged for years one through five to allow us to capture the effect of SST on the juvenile stages. All five lagged SST measures were included for modeling.

### Data preprocessing

-   `V1` is record ID
-   `Station` indicates the fishing station

We will not consider these variables in the predictive model: remove here, or use `tidymodels` `roles`?

In order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):

In [None]:
fish = fish.drop(['Unnamed: 0', 'Latitude', 'Longitude', 'Station'], axis=1)

In [None]:
## mutate variable
fish['avg'] = fish.groupby(['Species', 'Area', 'Year'])['CPUE'].transform('mean')
fish['std'] = 0.1 * fish['avg']

In [None]:
fish['noise'] = np.random.normal(loc=0, scale=fish['std'])
fish['CPUE'] = fish['CPUE'] + fish['noise']

In [None]:
fish = fish.drop(['avg', 'std', 'noise'], axis=1)

In [None]:
fish.head()

In [None]:
y = np.array(fish['CPUE'])
X = np.array(fish[['Year','SST_cvW', 'SST_cvW5', 'SST_cvW4','SST_cvW3','SST_cvW2','SST_cvW1']])

#### One-hot encoding of categorical variables

In [None]:
categorical_columns = fish.select_dtypes(include=['object']).columns.tolist()
categorical_columns

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first')

In [None]:
X_ohe = ohe.fit_transform(fish[categorical_columns]).toarray()
#one_hot_array = encoder.fit_transform(df[['color']]).toarray()

In [None]:
X_ohe

In [None]:
X_ohe.shape

In [None]:
X = np.concatenate((X, X_ohe), axis=1)
X.shape

### Cross validation: training/test split

We start with a simple random split: 80% data for training, 10% test data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

In [None]:
X_train.shape

In [None]:
X_test.shape

#### Data normalization

We normalise the data using **standardization**: we want our numerical features to have zero mean and unit variance.

First, we subset the trainig data by taking only the numerical features (first 7 columns). Please note that we are using the **training data for normalization**: this is important, since in real applications you don't have yet the test data.

In [None]:
ncols = X_train.shape[1]
X_temp = X_train[:,0:7] ## the last index in the range is not included

In [None]:
X_train.mean(axis=0)

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_temp)

In [None]:
X_train_scaled.mean(axis=0) ## all features have now zero mean

In [None]:
X_train_scaled.std(axis=0)

In [None]:
X_train_scaled = np.concatenate((X_train_scaled, X_train[:,7:ncols]), axis=1)
X_train_scaled.shape

In [None]:
X_temp = X_test[:,0:7]
X_test_scaled = scaler.transform(X_temp)
X_test_scaled = np.concatenate((X_test_scaled, X_test[:,7:ncols]), axis=1)

### Linear regression model

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train_scaled, y_train)

In [None]:
reg.coef_

In [None]:
print(reg.score(X_train_scaled,y_train))

In [None]:
y_hat = reg.predict(X_test_scaled)

plt.scatter(y_hat, y_test, alpha=0.5)
plt.show()

In [None]:
np.corrcoef(y_test,reg.predict(X_test_scaled))

## k-fold cross validation

We now implement k-fold cross-validation to measure the performance of a statistical (machine learning) model.

The general scheme is depicted in the image below (from scikit-learn.org):

!['k-fold-cv'](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

We first define a function that carries out the statistical model:

- normalise the data
- fit the linear model
- evaluate the prformance of the model

(For simplicity, we are normalising also OHE categorical variables)

In [None]:
def train_test_model(feat_train, targ_train, feat_val, targ_val):

  ## data normalization
  scaler = preprocessing.StandardScaler()
  feat_train_scaled = scaler.fit_transform(feat_train)
  feat_val_scaled = scaler.transform(feat_val)

  ## fit linear regression model
  modfit = LinearRegression().fit(feat_train_scaled, targ_train)

  ## model evaluation
  y_hat = modfit.predict(feat_val_scaled)
  pears_corr = np.corrcoef(targ_val, y_hat)[0,1]
  mse = np.sum(((targ_val - y_hat)**2))/len(y_hat)
  rmse = np.sqrt(mse)

  return((pears_corr, rmse))

In [None]:
from sklearn.model_selection import KFold

# KFold split
nsplits = 10
kf = KFold(n_splits=nsplits)
res = []

for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}:")

    ## train and validatino sets
    val_X = X[test_index,:]
    val_y = y[test_index]

    train_X = X[train_index,:]
    train_y = y[train_index]

    print("size of train set:", len(train_y))
    print("size of test set:", len(val_y))

    temp = train_test_model(train_X, train_y, val_X, val_y)
    print(temp)

    res.append(temp)


The results have been stored in a list of tuples (Pearson correlation coefficient, RMSE):

In [None]:
res

We can get the average correlation between predictions and observations in the test data:

In [None]:
avg_corr = np.mean([x[0] for x in res])
print("Average correlation between predictions and observations is", round(avg_corr, 3))

In [None]:
avg_rmse = np.mean([x[1] for x in res])
print("Average RMSE of model predictions is", round(avg_rmse, 3))