# Block cross validation

After covering random cross-validation, we now introduce a more advanced topic: cross-validation for data with temporal, spatial, hierarchical or phylogenetic structure (stratified data).

We are using the same dataset on fish catch.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read data

Data from [Spatiotemporally explicit model averaging for forecasting of Alaskan groundfish catch](https://onlinelibrary.wiley.com/doi/10.1002/ece3.4488) (data repo [here](https://zenodo.org/record/4987796#.ZHcLL9JBxhE))

It's data on fish catch (multiple fish species) over time in different regions of Alaska.

In [None]:
url= "https://zenodo.org/records/4987796/files/stema_data.csv"
fish = pd.read_csv(url)
fish.head()

### Data preprocessing

-   `V1` is record ID
-   `Station` indicates the fishing station

We will not consider these variables in the predictive model.

In order to accommodate variation in SST among stations, the CPUE value has been replicated multiple times. This would defeat our purpose of analysing data by group (fish species) over space and time: with only one value per group, a statistical analysis is a bit hard to be performed (no variation). Therefore, to the original CPUE values we add some random noise proportional to the average (by species, area, year):

In [None]:
fish = fish.drop(['Unnamed: 0', 'Latitude', 'Longitude', 'Station'], axis=1)

In [None]:
## mutate variable
fish['avg'] = fish.groupby(['Species', 'Area', 'Year'])['CPUE'].transform('mean')
fish['std'] = 0.1 * fish['avg']

In [None]:
fish['noise'] = np.random.normal(loc=0, scale=fish['std'])
fish['CPUE'] = fish['CPUE'] + fish['noise']

In [None]:
fish = fish.drop(['avg', 'std', 'noise'], axis=1)

In [None]:
## sanity check!
fish.head()

We prepare the arrays for the linear model:

## Block validation strategies

We first block by time (longitudinal data), using the variable `Year`:

### 1. Define the data split

We order data by Year: data are balanced, there are 292 records per year. The last 4 Years of data therefore represent 17.39% of the data

In [None]:
fish['Year'].value_counts()

In [None]:
train_set = fish.loc[fish['Year'] < 2009]
test_set = fish.loc[fish['Year'] >= 2009]

In [None]:
y = np.array(fish['CPUE'])
X = np.array(fish[['Year','SST_cvW', 'SST_cvW5', 'SST_cvW4','SST_cvW3','SST_cvW2','SST_cvW1']])

#### One-hot encoding of categorical variables

In [None]:
from sklearn.preprocessing import OneHotEncoder

categorical_columns = fish.select_dtypes(include=['object']).columns.tolist()
ohe = OneHotEncoder(drop='first')
X_ohe = ohe.fit_transform(fish[categorical_columns]).toarray()
X_ohe

In [None]:
X = np.concatenate((X, X_ohe), axis=1)
X.shape