<a href="https://colab.research.google.com/github/ab-sa/Statistical-Machine-Learning3/blob/main/Lecture3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#!pip install fast_ml
#from fast_ml.model_development import train_valid_test_split
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

Import Credit data

In [None]:
Credit = pd.read_csv('Credit.csv')
print('Dimension of the data: ' + str(Credit.shape))
Credit.head()

Three models we will compare:

*   **Model 1**: Balance vs Income
*   **Model 2**: Balance vs Age
*   **Model 3**: Balance vs Income & Age

There is no specific reason for choosing these three. Could compare others as well.

**Part 1: Random single split**

In [None]:
# using train_test_split:
X_train, X_rem, y_train, y_rem = train_test_split(Credit[['Income', 'Age']], Credit['Balance'], train_size=0.7,
                                                  random_state=123)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)

print('Dimension of training set: ', X_train.shape)
print('Dimension of validation set: ', X_valid.shape)
print('Dimension of test data: ', X_test.shape)

Fit regression models:

In [None]:
reg_income = LinearRegression().fit(X_train[['Income']], y_train)
reg_age = LinearRegression().fit(X_train[['Age']], y_train)
reg_both = LinearRegression().fit(X_train, y_train)

print('Evaluation on training set:')
print('R-squared of 1st model in training set:  %.3f' % reg_income.score(X_train[['Income']], y_train))
print('R-squared of 2nd model in training set:  %.3f' % reg_age.score(X_train[['Age']], y_train))
print('R-squared of 3rd model in training set:  %.3f' % reg_both.score(X_train, y_train))

print('Evaluation on validation set:')
print('R-squared of 1st model in validation set:  %.3f' % reg_income.score(X_valid[['Income']], y_valid))
print('R-squared of 2nd model in validation set:  %.3f' % reg_age.score(X_valid[['Age']], y_valid))
print('R-squared of 3rd model in validation set:  %.3f' % reg_both.score(X_valid, y_valid))

print('Evaluation on test set:')
print('R-squared of 1st model in test set:  %.3f' % reg_income.score(X_test[['Income']], y_test))
print('R-squared of 2nd model in test set:  %.3f' % reg_age.score(X_test[['Age']], y_test))
print('R-squared of 3rd model in test set:  %.3f' % reg_both.score(X_test, y_test))

**Part 2: Cross-Validation model selection (training and validation)**

Part a: single cross-validation splits:

In [None]:
cv_train_income = cross_val_score(reg_income, X_train[['Income']], y_train, cv=5, scoring='r2')
cv_train_age = cross_val_score(reg_age, X_train[['Age']], y_train, cv=5, scoring='r2')
cv_train_both = cross_val_score(reg_both, X_train, y_train, cv=5, scoring='r2')

print('R-squared from all the 5 CV-splits:')
print(cv_train_income)

print('5-fold CV evaluation:')
print('R-squared of 1st model:  %.3f' % cv_train_income.mean())
print('R-squared of 2nd model:  %.3f' % cv_train_age.mean())
print('R-squared of 3rd model:  %.3f' % cv_train_both.mean())

Use MSPE (i.e., MSE on the validation set) as scoring:

In [None]:
cv_train_income = cross_val_score(reg_income, X_train[['Income']], y_train, cv=5, scoring='neg_mean_squared_error')
cv_train_age = cross_val_score(reg_age, X_train[['Age']], y_train, cv=5, scoring='neg_mean_squared_error')
cv_train_both = cross_val_score(reg_both, X_train, y_train, cv=5, scoring='neg_mean_squared_error')

print('MSPE from all the 5 CV-splits:')
print(cv_train_income)

print('5-fold CV evaluation:')
print('MSPE of 1st model:  %.3f' % cv_train_income.mean())
print('MSPE of 2nd model:  %.3f' % cv_train_age.mean())
print('MSPE of 3rd model:  %.3f' % cv_train_both.mean())

Part b: repeated cross-validation splits:

In [None]:
rcv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=1)

rcv_train_income = cross_val_score(reg_income, X_train[['Income']], y_train, cv=rcv, scoring='r2')
rcv_train_age = cross_val_score(reg_age, X_train[['Age']], y_train, cv=rcv, scoring='r2')
rcv_train_both = cross_val_score(reg_both, X_train, y_train, cv=rcv, scoring='r2')

print('R-squared from all the repated 5 CV-splits:')
print(rcv_train_income)

print('Repeated 5fold-CV evaluation:')
print('R-squared of 1st model:  %.3f' % rcv_train_income.mean())
print('R-squared of 2nd model:  %.3f' % rcv_train_age.mean())
print('R-squared of 3rd model:  %.3f' % rcv_train_both.mean())

**Part 3: Bootstrap**

Single bootstrap split:

In [None]:
train_id_bs = resample(range(400), random_state=1)
valid_id_bs = [item for item in range(400) if item not in train_id_bs]
X_train_bs = Credit.iloc[train_id_bs, [1, 5]]
y_train_bs = Credit['Balance'][train_id_bs]
X_valid_bs = Credit.iloc[valid_id_bs, [1, 5]]
y_valid_bs = Credit['Balance'][valid_id_bs]

print('Number of uniqiue values in Training set: ', len(set(train_id_bs)))
print(X_train_bs.head())
print(X_train_bs.shape)
print(y_train_bs.head())
print(X_valid_bs.head())
print(X_valid_bs.shape)

Fit the same linear regression models on the boostrap splits:

In [None]:
reg_income_bs = LinearRegression().fit(X_train_bs[['Income']], y_train_bs)
reg_age_bs = LinearRegression().fit(X_train_bs[['Age']], y_train_bs)
reg_both_bs = LinearRegression().fit(X_train_bs, y_train_bs)

print('Evaluation on boostrap training set:')
print('R-squared of 1st model in boostrap training set:  %.3f' % reg_income.score(X_train_bs[['Income']], y_train_bs))
print('R-squared of 2nd model in boostrap training set:  %.3f' % reg_age.score(X_train_bs[['Age']], y_train_bs))
print('R-squared of 3rd model in boostrap training set:  %.3f' % reg_both.score(X_train_bs, y_train_bs))

print('Evaluation on validation set:')
print('R-squared of 1st model in boostrap validation set:  %.3f' % reg_income.score(X_valid_bs[['Income']], y_valid_bs))
print('R-squared of 2nd model in boostrap validation set:  %.3f' % reg_age.score(X_valid_bs[['Age']], y_valid_bs))
print('R-squared of 3rd model in boostrap validation set:  %.3f' % reg_both.score(X_valid_bs, y_valid_bs))