# Block cross validation

After covering random cross-validation, we now introduce a more advanced topic: cross-validation for data with temporal, spatial, hierarchical or phylogenetic structure (stratified data).

We are using the same dataset on fish catch.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read data

Data from [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488) (data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))

It's data on fish catch (multiple fish species) over time in different regions of Alaska.

In [None]:
url= "https://zenodo.org/records/4987796/files/stema_data.csv"
fish = pd.read_csv(url)
fish.head()

### Data preprocessing

-   `V1` is record ID
-   `Station` indicates the fishing station

We will not consider these variables in the predictive model.

In order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):

In [None]:
fish = fish.drop(['Unnamed: 0', 'Latitude', 'Longitude', 'Station'], axis=1)

In [None]:
## mutate variable
fish['avg'] = fish.groupby(['Species', 'Area', 'Year'])['CPUE'].transform('mean')
fish['std'] = 0.1 * fish['avg']

In [None]:
fish['noise'] = np.random.normal(loc=0, scale=fish['std'])
fish['CPUE'] = fish['CPUE'] + fish['noise']

In [None]:
fish = fish.drop(['avg', 'std', 'noise'], axis=1)

In [None]:
## sanity check!
fish.head()

## Block validation strategies

We first block by time (longitudinal data), using the variable `Year`:

### 1. Define the data split

We order data by Year: data are balanced, there are 292 records per year. The last 4 Years of data therefore represent 17.39% of the data

In [None]:
fish['Year'].value_counts()

In [None]:
train_set = fish.loc[fish['Year'] < 2009]
test_set = fish.loc[fish['Year'] >= 2009]

A little sanity check on the data:

In [None]:
test_set['Year'].value_counts()

In [None]:
train_set['Year'].value_counts()

We prepare the arrays for the linear model (we don't -and can't- use `Year` in the model now):

In [None]:
y_train = np.array(train_set['CPUE'])
X_train = np.array(train_set[['SST_cvW', 'SST_cvW5', 'SST_cvW4','SST_cvW3','SST_cvW2','SST_cvW1']])

### 2. Data preprocessing

#### One-hot encoding of categorical variables

We now one-hot-encode categorical variables in the training set:

In [None]:
from sklearn.preprocessing import OneHotEncoder

categorical_columns = train_set.select_dtypes(include=['object']).columns.tolist()
categorical_columns = [x for x in categorical_columns if x != 'Year']
ohe = OneHotEncoder(drop='first')
X_train_ohe = ohe.fit_transform(train_set[categorical_columns]).toarray()
X_train_ohe

In [None]:
X_train = np.concatenate((X_train, X_train_ohe), axis=1)
X_train.shape

#### Data normalization

We normalise the data using **standardization**: we want our numerical features to have zero mean and unit variance.

First, we subset the trainig data by taking only the numerical features (first 7 columns). Please note that we are using the **training data for normalization**: this is important, since in real applications you don't have yet the test data.

In [None]:
from sklearn import preprocessing

ncols = X_train.shape[1]
X_temp = X_train[:,0:6] ## the last index in the range is not included

scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_temp)

X_train_scaled = np.concatenate((X_train_scaled, X_train[:,6:ncols]), axis=1)
X_train_scaled.shape

And now the **test set**:

In [None]:
## 1) arrays
y_test = np.array(test_set['CPUE'])
X_test = np.array(test_set[['SST_cvW', 'SST_cvW5', 'SST_cvW4','SST_cvW3','SST_cvW2','SST_cvW1']])

In [None]:
## 2) OHE
X_test_ohe = ohe.fit_transform(test_set[categorical_columns]).toarray()
X_test_ohe

In [None]:
X_test = np.concatenate((X_test, X_test_ohe), axis=1)
X_test.shape

In [None]:
## 3) normalization
X_temp = X_test[:,0:6] ## the last index in the range is not included

scaler = preprocessing.StandardScaler()
X_test_scaled = scaler.fit_transform(X_temp)

X_test_scaled = np.concatenate((X_test_scaled, X_test[:,6:ncols]), axis=1)
X_test_scaled.shape

### 3. Define and fit the model


In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train_scaled, y_train)

In [None]:
print(reg.score(X_train_scaled,y_train))

### 4. Get model predictions and evaluate the model

We evaluate the model on the test data (new data: only the last 4 years):

In [None]:
y_hat = reg.predict(X_test_scaled)
dd = np.array([y_test, y_hat])
dd = dd.T

df = pd.DataFrame(dd, columns = ['y_test','y_hat'])

In [None]:
df['Species'] = np.array(test_set['Species'])
df

In [None]:
from plotnine import ggplot, aes, geom_point


plot = (
    ggplot(df, aes(x='y_hat', y='y_test')) +
    geom_point(aes(color='Species'))
)

plot.draw()

In [None]:
df[['y_hat','y_test']].corr()

In [None]:
mse = ((df['y_hat'] - df['y_test'])**2).sum()/(len(df))
rmse = np.sqrt(mse)
avg = df['y_test'].mean()
nrmse = 100*(rmse/avg)

In [None]:
print("The RMSE is ", round(rmse, 3))
print("The NRMSE (normalised RMSE) is", round(nrmse, 2), "%")

#### Predictions by year?

In [None]:
years = test_set['Year'].unique()
df['Year'] = np.array(test_set['Year'])

In [None]:
df

In [None]:
res = []
for y in years:
  print("processing year",y)

  temp = df.loc[df['Year'] == y]
  lincorr = np.array(temp[['y_hat','y_test']].corr())[0,1]
  mse = ((temp['y_hat'] - temp['y_test'])**2).sum()/(len(temp))
  rmse = np.sqrt(mse)
  avg = temp['y_test'].mean()
  nrmse = 100*(rmse/avg)

  #print("correlation:", round(lincorr,3))
  #print("RMSE:", round(rmse,3))
  #print("NRMSE:", round(nrmse, 2), "%")

  temp = {'year':y,'corr':lincorr, 'rmse':rmse, 'nrmse':nrmse}
  res.append(temp)

In [None]:
res_by_year = pd.DataFrame(res)
res_by_year

#### EXERCISE: let's look at predictions by fish species

In [None]:
## TASK 1: create an array or list of fish species

In [None]:
## TASK 2: add a column with fish species to the dataframe with predictions and observations on the test data

In [None]:
## TASK 3: calculate average prediction metrics by fish species and store the results in a dataframe

In [None]:
## TASK 4: visualize results (table; plot?)

### Blocking by space

In [None]:
## 1) define the split
areas = fish['Area'].unique()
train_set = fish.loc[fish['Area'] != "West Yakutat"]
test_set = fish.loc[fish['Area'] == "West Yakutat"]

In [None]:
## 2) OHE
from sklearn.preprocessing import OneHotEncoder

categorical_columns = train_set.select_dtypes(include=['object']).columns.tolist()
categorical_columns = [x for x in categorical_columns if x != 'Area']
ohe = OneHotEncoder(drop='first')

## training set
y_train = np.array(train_set['CPUE'])
X_train = np.array(train_set[['SST_cvW', 'SST_cvW5', 'SST_cvW4','SST_cvW3','SST_cvW2','SST_cvW1']])
X_train_ohe = ohe.fit_transform(train_set[categorical_columns]).toarray()
X_train = np.concatenate((X_train, X_train_ohe), axis=1)

## test set
y_test = np.array(test_set['CPUE'])
X_test = np.array(test_set[['SST_cvW', 'SST_cvW5', 'SST_cvW4','SST_cvW3','SST_cvW2','SST_cvW1']])
X_test_ohe = ohe.fit_transform(test_set[categorical_columns]).toarray()
X_test = np.concatenate((X_test, X_test_ohe), axis=1)


print("training set size", X_train.shape)
print("test set size", X_test.shape)

In [None]:
## 3) normalization
from sklearn import preprocessing

## training set
ncols = X_train.shape[1]
X_temp = X_train[:,0:6] ## the last index in the range is not included
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_temp)
X_train_scaled = np.concatenate((X_train_scaled, X_train[:,6:ncols]), axis=1)

## test set
X_temp = X_test[:,0:6] ## the last index in the range is not included
scaler = preprocessing.StandardScaler()
X_test_scaled = scaler.fit_transform(X_temp)
X_test_scaled = np.concatenate((X_test_scaled, X_test[:,6:ncols]), axis=1)

In [None]:
## 4) fit model and get predictions
from sklearn.linear_model import LinearRegression

## fit model
reg = LinearRegression().fit(X_train_scaled, y_train)
print("R^2 is", reg.score(X_train_scaled,y_train))

## predictions
y_hat = reg.predict(X_test_scaled)
dd = np.array([y_test, y_hat])
dd = dd.T
df = pd.DataFrame(dd, columns = ['y_test','y_hat'])
df.shape

In [None]:
## 4) visualize results
from plotnine import ggplot, aes, geom_point


plot = (
    ggplot(df, aes(x='y_hat', y='y_test')) +
    geom_point()
)

plot.draw()

In [None]:
lincorr = np.array(df[['y_hat','y_test']].corr().iloc[0,1])
mse = ((df['y_hat'] - df['y_test'])**2).sum()/(len(df))
rmse = np.sqrt(mse)
avg = df['y_test'].mean()
nrmse = 100*(rmse/avg)

res = pd.DataFrame({'corr':[lincorr], 'rmse':rmse, 'nrmse':nrmse})
res

---

**Could we block by phylogeny/genetics with this dataset?**

**Is it possible to block along more than one dimension?**