<a href="https://colab.research.google.com/github/ab-sa/Statistical-Machine-Learning-2/blob/main/Lecture3_CV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#!pip install fast_ml
#from fast_ml.model_development import train_valid_test_split
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

Import Credit data

In [None]:
Credit = pd.read_csv('Credit.csv')
print('Dimension of the data: ' + str(Credit.shape))
Credit.head()

Dimension of the data: (400, 12)


Unnamed: 0,ID,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
1,2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
2,3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
3,4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
4,5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331


Three models we will compare:

*   **Model 1**: Balance vs Income
*   **Model 2**: Balance vs Age
*   **Model 3**: Balance vs Income & Age

There is no specific reason for choosing these three. Could compare others as well.

**Part 1: Random single split**

In [None]:
# using train_test_split:
X_train, X_rem, y_train, y_rem = train_test_split(Credit[['Income', 'Age']], Credit['Balance'], train_size=0.7,
                                                  random_state=123)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)

print('Dimension of training set: ', X_train.shape)
print('Dimension of validation set: ', X_valid.shape)
print('Dimension of test data: ', X_test.shape)

Dimension of training set:  (280, 2)
Dimension of validation set:  (60, 2)
Dimension of test data:  (60, 2)


Fit regression models:

In [None]:
reg_income = LinearRegression().fit(X_train[['Income']], y_train)
reg_age = LinearRegression().fit(X_train[['Age']], y_train)
reg_both = LinearRegression().fit(X_train, y_train)

print('Evaluation on training set:')
print('R-squared of 1st model in training set:  %.3f' % reg_income.score(X_train[['Income']], y_train))
print('R-squared of 2nd model in training set:  %.3f' % reg_age.score(X_train[['Age']], y_train))
print('R-squared of 3rd model in training set:  %.3f' % reg_both.score(X_train, y_train))

print('Evaluation on validation set:')
print('R-squared of 1st model in validation set:  %.3f' % reg_income.score(X_valid[['Income']], y_valid))
print('R-squared of 2nd model in validation set:  %.3f' % reg_age.score(X_valid[['Age']], y_valid))
print('R-squared of 3rd model in validation set:  %.3f' % reg_both.score(X_valid, y_valid))

print('Evaluation on test set:')
print('R-squared of 1st model in test set:  %.3f' % reg_income.score(X_test[['Income']], y_test))
print('R-squared of 2nd model in test set:  %.3f' % reg_age.score(X_test[['Age']], y_test))
print('R-squared of 3rd model in test set:  %.3f' % reg_both.score(X_test, y_test))

Evaluation on training set:
R-squared of 1st model in training set:  0.195
R-squared of 2nd model in training set:  0.001
R-squared of 3rd model in training set:  0.201
Evaluation on validation set:
R-squared of 1st model in validation set:  0.225
R-squared of 2nd model in validation set:  -0.047
R-squared of 3rd model in validation set:  0.216
Evaluation on test set:
R-squared of 1st model in test set:  0.268
R-squared of 2nd model in test set:  -0.007
R-squared of 3rd model in test set:  0.286


**Part 2: Cross-Validation model selection (training and validation)**

Part a: single cross-validation splits:

In [None]:
cv_train_income = cross_val_score(reg_income, X_train[['Income']], y_train, cv=5, scoring='r2')
cv_train_age = cross_val_score(reg_age, X_train[['Age']], y_train, cv=5, scoring='r2')
cv_train_both = cross_val_score(reg_both, X_train, y_train, cv=5, scoring='r2')

print('R-squared from all the 5 CV-splits:')
print(cv_train_income)

print('5-fold CV evaluation:')
print('R-squared of 1st model:  %.3f' % cv_train_income.mean())
print('R-squared of 2nd model:  %.3f' % cv_train_age.mean())
print('R-squared of 3rd model:  %.3f' % cv_train_both.mean())

R-squared from all the 5 CV-splits:
[ 0.01777396  0.28479734  0.35022801  0.17351359 -0.07407509]
5-fold CV evaluation:
R-squared of 1st model:  0.150
R-squared of 2nd model:  -0.023
R-squared of 3rd model:  0.149


Use MSPE (i.e., MSE on the validation set) as scoring:

In [None]:
cv_train_income = cross_val_score(reg_income, X_train[['Income']], y_train, cv=5, scoring='neg_mean_squared_error')
cv_train_age = cross_val_score(reg_age, X_train[['Age']], y_train, cv=5, scoring='neg_mean_squared_error')
cv_train_both = cross_val_score(reg_both, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

print('MSPE from all the 5 CV-splits:')
print(cv_train_income)

print('5-fold CV evaluation:')
print('MSPE of 1st model:  %.3f' % cv_train_income.mean())
print('MSPE of 2nd model:  %.3f' % cv_train_age.mean())
print('MSPE of 3rd model:  %.3f' % cv_train_both.mean())

MSPE from all the 5 CV-splits:
[-180158.42187158 -184843.65342784 -138043.19447173 -183224.78379393
 -182552.29294358]
5-fold CV evaluation:
MSPE of 1st model:  -173764.469
MSPE of 2nd model:  -213943.901
MSPE of 3rd model:  -174559.561


Part b: repeated cross-validation splits:

In [None]:
rcv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=1)

rcv_train_income = cross_val_score(reg_income, X_train[['Income']], y_train, cv=rcv, scoring='r2')
rcv_train_age = cross_val_score(reg_age, X_train[['Age']], y_train, cv=rcv, scoring='r2')
rcv_train_both = cross_val_score(reg_both, X_train, y_train, cv=rcv, scoring='r2')

print('R-squared from all the repated 5 CV-splits:')
print(rcv_train_income)

print('Repeated 5fold-CV evaluation:')
print('R-squared of 1st model:  %.3f' % rcv_train_income.mean())
print('R-squared of 2nd model:  %.3f' % rcv_train_age.mean())
print('R-squared of 3rd model:  %.3f' % rcv_train_both.mean())

R-squared from all the repated 5 CV-splits:
[ 0.17194121  0.15616365  0.30756076  0.25092088  0.02412717  0.13184194
  0.1975393   0.27248505  0.14921142  0.10805965 -0.04077381  0.3299576
 -0.01821455  0.1892389   0.26007928  0.0221067   0.21839027  0.17674844
  0.22417827  0.13712557  0.17208348  0.1525262   0.33086803 -0.08061536
  0.16667139  0.31722581  0.20894364  0.04623132  0.15563472  0.12559902
  0.27064257  0.11935011  0.00363911  0.13505857  0.10064417  0.16979493
  0.34564228  0.04201788  0.10426409  0.12511056  0.27825538  0.20612789
  0.13028285  0.01506747  0.19245142  0.28013117  0.28819213  0.24489328
  0.12153946 -0.13886158]
Repeated 5fold-CV evaluation:
R-squared of 1st model:  0.158
R-squared of 2nd model:  -0.037
R-squared of 3rd model:  0.156


**Part 3: Bootstrap**

Single bootstrap split:

In [None]:
train_id_bs = resample(range(400), random_state=1)
valid_id_bs = [item for item in range(400) if item not in train_id_bs]
X_train_bs = Credit.iloc[train_id_bs, [1, 5]]
y_train_bs = Credit['Balance'][train_id_bs]
X_valid_bs = Credit.iloc[valid_id_bs, [1, 5]]
y_valid_bs = Credit['Balance'][valid_id_bs]

print('Number of uniqiue values in Training set: ', len(set(train_id_bs)))
print(X_train_bs.head())
print(X_train_bs.shape)
print(y_train_bs.head())
print(X_valid_bs.head())
print(X_valid_bs.shape)

Number of uniqiue values in Training set:  259
     Income  Age
37   30.007   69
235  10.503   25
396  13.364   65
72   22.939   47
255  58.063   50
(400, 2)
37     1093
235     191
396     480
72      663
255     118
Name: Balance, dtype: int64
   Income  Age
0  14.891   34
4  55.882   68
5  80.180   77
6  20.996   37
9  71.061   41
(141, 2)


Fit the same linear regression models on the boostrap splits:

In [None]:
reg_income_bs = LinearRegression().fit(X_train_bs[['Income']], y_train_bs)
reg_age_bs = LinearRegression().fit(X_train_bs[['Age']], y_train_bs)
reg_both_bs = LinearRegression().fit(X_train_bs, y_train_bs)

print('Evaluation on boostrap training set:')
print('R-squared of 1st model in boostrap training set:  %.3f' % reg_income.score(X_train_bs[['Income']], y_train_bs))
print('R-squared of 2nd model in boostrap training set:  %.3f' % reg_age.score(X_train_bs[['Age']], y_train_bs))
print('R-squared of 3rd model in boostrap training set:  %.3f' % reg_both.score(X_train_bs, y_train_bs))

print('Evaluation on validation set:')
print('R-squared of 1st model in boostrap validation set:  %.3f' % reg_income.score(X_valid_bs[['Income']], y_valid_bs))
print('R-squared of 2nd model in boostrap validation set:  %.3f' % reg_age.score(X_valid_bs[['Age']], y_valid_bs))
print('R-squared of 3rd model in boostrap validation set:  %.3f' % reg_both.score(X_valid_bs, y_valid_bs))

Evaluation on boostrap training set:
R-squared of 1st model in boostrap training set:  0.137
R-squared of 2nd model in boostrap training set:  -0.005
R-squared of 3rd model in boostrap training set:  0.142
Evaluation on validation set:
R-squared of 1st model in boostrap validation set:  0.270
R-squared of 2nd model in boostrap validation set:  -0.005
R-squared of 3rd model in boostrap validation set:  0.284
