## Live Notebook 1 - Validation (one more time)

This notebook is intended to be worked on during the online session in individual breakout rooms. The focus is less on generating new code and more on experimenting with the existing layout. Interesting - unexpected as well as expected - results should be recorded by the groups directly in the notebook.
In the last sessions we talked about Train/Test method, resampling, k-Fold corss-validation and bootstrapping. In this notebook we will do our own experiment(s) to investigate those methods a bit more detailed.

In [None]:
from sklearn.datasets import make_regression, make_multilabel_classification
from sklearn.model_selection import KFold, train_test_split, ShuffleSplit,cross_validate
from mlxtend.evaluate import BootstrapOutOfBag
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')

### Data Creation

At first we will define the parameters for the generatitve model. In this case it will be a very simple linear model, where we can vary the number of samples `n_samp`, the number of features `n_feat`, the number of informative features `n_inf`, which enables us to simulate bad chosen features, the standard deviation of gaussian noise `noise_std` and `bias`. This is where you have to try to generate different artificial datasets to see how the data itself influences the output of different validation strategies.

In [None]:
n_samp = 1000
n_feat = 19
n_inf = 18
noise_std = 5 
bias = 1

In [None]:
X, y, true_coef = make_regression(n_samples=n_samp, n_features=n_feat, random_state=0, noise=noise_std,bias=bias,n_informative=n_inf, coef=True)

In [None]:
#randomly split train and test
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=17)

### Regression Model

To keep the experiments handy and clear, we will use `LinearRegression()` as estimator for our experiments. This allows us to focus on the performance estimation without having to tune hyper-parameters. But you can replace the LR with different algorithms (bearning in mind to set fixed hyper-parameters).

In [None]:
reg = LinearRegression()
#reg = DummyRegressor()

### Validation Methods

For an easier interpretation and comparison we will use the same number of splits for all different validation methods (except for the train/test split). In the cell below we adress typical values (k=5 and k=10)for k as well as the most "extreme" choices (k=2 and k=n_samples). Despite those values you can specify stepsize, start and endpoint for k values.

In [None]:
k_max = 800 #second largest k (max for generating the grid)
ks = np.array([2,5,10]) 
ks = np.append(ks,np.arange(15,k_max, 30)) #grid of k values
ks = np.append(ks,X.shape[0]) #k=n_samples

#### K Fold CV 

Now we are ready to perform our experiments. We set up a dataframe for easier plotting afterwards and store every single result. The number of folds differes, therefore we fill the columns with `np.nan`, which will later be ignored when computing statistics or ding plots.

In [None]:
#K-Fold on X and y (aka "full" dataset) to estimate performance
kf_df = pd.DataFrame()
kf_cf = pd.DataFrame()
for kk in ks:
    cv = KFold(kk, shuffle=True, random_state=2)
    c_v = cross_validate(reg, X, y,cv=cv, scoring='neg_mean_squared_error', n_jobs=-1, return_estimator=True)
    n = np.empty((np.max(ks)-kk))
    n[:] = np.nan
    kf_df['k=' + str(kk)] = (np.append(-c_v['test_score'], n)).tolist()
    tmp = pd.DataFrame([est.coef_ for est in c_v['estimator']])
    tmp['k'] = kk*np.ones((len(tmp)))
    kf_cf = kf_cf.append(tmp)
kf_cf.columns = ['coef_' + str(i) for i in range(n_feat)] + ['k']
kf_df.head(6)

#### Resampling

In resampling we also have to decide for a split ratio between train and test sets. We could also vary this parameter through the experiment. 

In [None]:
#resampling on X and y (aka "full" dataset) to estimate performance
re_df = pd.DataFrame()
re_cf = pd.DataFrame()
test_ratio = .3
for kk in ks:
    cv = ShuffleSplit(n_splits=kk, test_size=test_ratio, random_state=350)
    c_v = cross_validate(reg, X, y,cv=cv, scoring='neg_mean_squared_error', n_jobs=-1, return_estimator=True)
    n = np.empty((np.max(ks)-kk))
    n[:] = np.nan
    re_df['splits=' + str(kk)] = (np.append(-c_v['test_score'], n)).tolist()
    tmp = pd.DataFrame([est.coef_ for est in c_v['estimator']])
    tmp['splits'] = kk*np.ones((len(tmp)))
    re_cf = re_cf.append(tmp)
re_cf.columns = ['coef_' + str(i) for i in range(n_feat)] + ['splits']
re_df.head(6)

#### Bootstrapping

Bootstrapping is less prominently used in Machine Learning than in classical statistics, maybe it is the reason that there is no direct sklearn implemenation (anymore). We use **mlxtend**'s bootstrapping. 

In [None]:
#bootstrapping on X and y (aka "full" dataset) to estimate performance
bs_df = pd.DataFrame()
bs_cf = pd.DataFrame()
for kk in ks:
    cv = BootstrapOutOfBag(n_splits=int(kk), random_seed=456)
    c_v = cross_validate(reg, X, y,cv=cv, scoring='neg_mean_squared_error', n_jobs=-1, return_estimator=True)
    n = np.empty((np.max(ks)-kk))
    n[:] = np.nan
    bs_df['splits=' + str(kk)] = (np.append(-c_v['test_score'], n)).tolist()
    tmp = pd.DataFrame([est.coef_ for est in c_v['estimator']])
    tmp['splits'] = kk*np.ones((len(tmp)))
    bs_cf = bs_cf.append(tmp)
bs_cf.columns = ['coef_' + str(i) for i in range(n_feat)] + ['splits']
bs_df.head(6)

#### Single Train/Test Split

In [None]:
#fit on whole dataset (no test set at all)
y_pred = reg.fit(X_train,y_train).predict(X_test)
test_error = mean_squared_error(y_test,y_pred)
test_error

#### No Split

In [None]:
#fit on whole dataset (no test set at all)
y_pred = reg.fit(X,y).predict(X)
train_error = mean_squared_error(y,y_pred)
train_error

#### True Error
As we generated the data by a pre-defined model, we can caclulate the best possible performance (of a linear model) by miltuplying the features with the coefficients and adding the bias.

In [None]:
#True model on "full data set"
true_error = mean_squared_error(y,bias+np.matmul(X,true_coef))
true_error

### Plotting the Results

Now we can plot the reults from our experiments. As we recorded each error, we calculate the mean and the 95% confidence interval of the MSE's from different folds/splits/rounds.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(12,4))
sns.pointplot(data=kf_df, estimator=np.nanmean, ci=95, ax=axs[0])
sns.pointplot(data=re_df,estimator=np.nanmean, ci=95, ax=axs[1])
sns.pointplot(data=bs_df,estimator=np.nanmean, ci=95, ax=axs[2])

axs[0].set_title('k-Fold')
axs[1].set_title('Resampling')
axs[2].set_title('Bootstrap')

# find min and max values for the plot
minmax = [true_error,train_error, test_error]
for ax in axs: 
    minmax = minmax + list((ax.get_yaxis().get_data_interval()))

for ax in axs: 
    ax.plot([0,len(ks)-1],[true_error,true_error],'--r', label='true model')
    ax.plot([0,len(ks)-1],[train_error,train_error],':k', label='no split')
    ax.plot([0,len(ks)-1],[test_error,test_error],':m', label='test-set method')
    ax.set_xticklabels(ax.get_xticklabels(),rotation=70)
    ax.set_ylim(np.min(minmax)-.2,np.max(minmax)+.2)
    ax.set_ylabel('MSE')
    ax.legend()
    ax.grid()
plt.tight_layout()

We can also check the standard deviation of the regression coefficients throughout the different splits in the different methods.

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(14,4))

low_lim=0
hi_lim = np.max([kf_cf.groupby('k').std().max().max(), re_cf.groupby('splits').std().max().max(), bs_cf.groupby('splits').std().max().max()])
cmap = 'Blues'

sns.heatmap(kf_cf.groupby('k').std(), ax=axs[0],vmin=low_lim, vmax=hi_lim, cmap=cmap,cbar_kws={'label': 'std of coef'})
axs[0].set_title('k-Fold')
axs[0].set_yticklabels(axs[0].get_yticklabels(),rotation=0)

sns.heatmap(re_cf.groupby('splits').std(), ax=axs[1],vmin=low_lim, vmax=hi_lim,cmap=cmap,cbar_kws={'label': 'std of coef'})
axs[1].set_title('Resampling')
axs[1].set_yticklabels(axs[1].get_yticklabels(),rotation=0)
sns.heatmap(bs_cf.groupby('splits').std(), ax=axs[2],vmin=low_lim, vmax=hi_lim,cmap=cmap,cbar_kws={'label': 'std of coef'})
axs[2].set_title('Bootstrap')
axs[2].set_yticklabels(axs[2].get_yticklabels(),rotation=0)

plt.tight_layout()

If you still have not had enough, you can look at the distributions (of the dependent variable) for individual training and test splits in a final step of this notebook. In this notebook we have only used random splits, so it is possible that the distribution of the training and test set looks different. If you use the test set method with a single test set, there may be larger deviations depending on the data set. Here we refer again to the reading (Applied Predictive Modeling - Chapter 4, or the paper of Westad et al) that one should think when splitting in particular also of the feature space and not only of the dependent variable.

In [None]:
kk=5 #k or number of splits
train_inds=[]
test_inds=[]

cv_kf = KFold(kk, shuffle=True, random_state=2)
cv_re = ShuffleSplit(n_splits=kk, test_size=test_ratio, random_state=350)
cv_bs = BootstrapOutOfBag(kk, random_seed=456)
cvs = [cv_kf,cv_bs,cv_re, cv_bs]

for ct in range(3):
    train_idx, test_idx = [],[]
    for train_index, test_index in cvs[ct].split(X, y):
        train_idx.append(train_index)
        test_idx.append(test_index)
    train_inds.append(train_idx)
    test_inds.append(test_idx)

In [None]:
fold=4 #particular fold to visualize
fig, axs = plt.subplots(1, 3, figsize=(14,4))
axs[0].set_title('k-Fold - fold=' + str(fold))
axs[1].set_title('Resampling - fold=' + str(fold))
axs[2].set_title('Bootstrapping - fold=' + str(fold))
for ct in range(3):
    dest=pd.DataFrame()
    dset = pd.DataFrame(data={'set': ['train']*len(train_inds[ct][fold]),'y': y[train_inds[ct][fold]]})
    dset = dset.append(pd.DataFrame(data={'set': ['test']*len(test_inds[ct][fold]),'y': y[test_inds[ct][fold]]}))
    sns.kdeplot(data=dset, x="y", hue="set", ax=axs[ct])
    axs[ct].grid()
plt.tight_layout()