## Cross-Validation

**Cross-validation**, a technique where the dataset is split into multiple folds for repeated training and testing, helps assess model performance more reliably. We will demonstrate how it is particularly valuable when working with limited data.

Importing necessary libraries and in-built dataset

In [1]:
from sklearn.datasets import load_diabetes  # Diabetes dataset
from sklearn.linear_model import LinearRegression  # Linear regression model
from sklearn.model_selection import train_test_split, cross_val_score  # Train-test split and cross-validation
from sklearn.metrics import mean_squared_error  # MSE
import numpy as np; import pandas as pd  # For data processing
import warnings; warnings.filterwarnings('ignore')

For the purposes of this demonstration, we will select a sample of only 100 rows of our dataset

In [2]:
data_full = load_diabetes()
df = pd.DataFrame(data_full.data, columns = data_full.feature_names)
df['target'] = data_full.target

data = df.sample(n = 100, random_state = 42)  # Randomly sample 100 rows

# Split into X and y
X = data[data_full.feature_names]
y = data['target']

df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [None]:
print('Dimensions of X:', X.shape); print('Dimensions of y:', y.shape)  # Checking the number of records and features

In [None]:
df.describe().T  # Verifying that the predictors are on the same scale

### Linear Regression with Train-Test Split

Let's split our model using a basic train-test split, i.e., **hold-out validation**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print('Dimensions of X_train:', X_train.shape); print('Dimensions of y_train:', y_train.shape)
print('Dimensions of X_test:', X_test.shape); print('Dimensions of y_test:', y_test.shape)

Let's now fit this data to a linear model

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)  # Fitting linear regression model on training data

y_pred_train = lr_model.predict(X_train); y_pred_test = lr_model.predict(X_test)  # Training and testing predictions

mse_train = mean_squared_error(y_train, y_pred_train); mse_test = mean_squared_error(y_test, y_pred_test)  # Training and testing MSEs

print('Training MSE =', np.round(mse_train, 2)); print('Testing MSE =', np.round(mse_test, 2))

We can see that our model has overfit on the training data, as the MSE for testing predictions is far larger than for the training ones

### Linear Regression with Cross-Validation

Now, we will use cross-validation, dividing the data into five folds (each with an 80:20 train:test split)

In [None]:
lr_model = LinearRegression()
cv_scores = cross_val_score(lr_model, X, y, cv = 5, scoring = 'neg_mean_squared_error')  # No need to use split data
print('Cross-validation MSE scores:', np.abs(np.round(cv_scores, 2))); print('Mean cross-validation MSE score:', np.abs(np.round(np.mean(cv_scores), 2)))

Our cross-validation MSE score is far closer to the training MSE than with just the train-test split

### Linear Regression with Cross-Validation and Train-Test Split

Ideally, in a machine learning pipeline, we want to typically incorporate both cross-validation as well as testing

We will split the data into training and testing and then set aside the testing data entirely; this will serve as "unseen" data for our cross-validated model. The model will be cross-validated on various folds of only the training set.

In [None]:
lr_model = LinearRegression()
cv_scores = cross_val_score(lr_model, X_train, y_train, cv = 5, scoring = 'neg_mean_squared_error')  # Use training data for cross-validation
print('Cross-validation MSE scores:', np.abs(np.round(cv_scores, 2))); print('Mean MSE score:', np.abs(np.round(np.mean(cv_scores), 2)))
lr_model.fit(X_train, y_train)  # Train on the training data
y_pred_test = lr_model.predict(X_test); mse_test = mean_squared_error(y_test, y_pred_test)  # Testing performance
print('Testing MSE =', np.round(mse_test, 2))

In spite of the training set now containing only 80% of the total data, it still generalises better to "unseen" data than the model without cross-validation