
# Train-Test Splits and Cross-Validation

Training a machine learning model involves using data to learn patterns and make predictions. There is, however, a risk to training a model too well on some particular dataset, which is that the model may not generalize well to new, unseen data. To mitigate this risk, we use techniques like train-test splits and cross-validation.

## Train-Test Splits

A train-test split divides the dataset into two parts:

-   ****Training Set****: Used to fit the model.
-   ****Test Set****: Used to evaluate the model's performance on unseen data.

The split is usually done before any preprocessing of the data to avoid data leakage from the test set into the training set.

The split ratio can vary, but common practices include:

-   ****80/20 Split****: 80% training, 20% testing.
-   ****70/30 Split****: 70% training, 30% testing.
-   ****90/10 Split****: 90% training, 10% testing.

This ensures that the model is trained on a substantial amount of data while still having enough data to evaluate its performance.

## Cross-Validation

Each machine learning model usually depends on a set of hyperparameters that need to be manually tuned to achieve the best performance. Cross-validation (CV) is a technique that helps to tune these hyperparameters and assess the model's performance more robustly.

Cross-validation involves splitting the training dataset into multiple subsets (folds) and training the model multiple times:

-   ****$K$-Fold Cross-Validation****: The dataset is divided into $K$ subsets (folds). The model is trained $K$ times, each time using $K - 1$ folds for training and 1 fold for validation. The performance is averaged over all $K$ iterations.
-   ****Leave-One-Out Cross-Validation (LOOCV)****: A special case of $K$-fold where $K$ is equal to the number of samples in the dataset. Each sample is used once as a validation set while the rest are used for training. This is computationally expensive but can be useful for small datasets.
-   ****Stratified Cross-Validation****: Ensures that each fold has the same proportion of classes as the entire dataset. This is particularly useful for imbalanced datasets.

## Practical Demonstration

We will demonstrate the use of train-test splits and cross-validation using the California housing dataset, which is a regression dataset available in the `sklearn.datasets` module.

We will go through the Machine Learning workflow steps, but skip some of the steps that are not relevant for this demonstration.

-   Loading the dataset

In [None]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
df = data.frame

-   Data exploration

In [None]:
import pandas as pd

# Display the first few rows of the dataset
print(df.head())

# Plot the correlation matrix of the features and the target variable
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of California Housing Dataset')
plt.show()

Since the only meaningful correlation with the target variable (house value) is with the `MedInc` (median income) feature, we will keep only this feature for our demonstration.

In [None]:
# Select only the 'MedInc' feature for simplicity
X = df[['MedInc']]
y = df['MedHouseVal']
print(X.head())
print(y.head())

-   Train-test split

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score

# Simple train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This code splits the dataset into training and testing sets, with 20% of the data reserved for testing. The `random_state` parameter ensures that the split is reproducible.

Let's now visualize the distribution of the target variable in the training and test sets.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(y_train, bins=30, color='blue', alpha=0.7, label='Train Set')
plt.title('Training Set Target Distribution')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.histplot(y_test, bins=30, color='orange', alpha=0.7, label='Test Set')
plt.title('Test Set Target Distribution')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

This visualization shows the distribution of house values in both the training and test sets, indicating that the split has preserved the overall distribution of the target variable.

-   Model training and evaluation:

We begin by training a simple linear regression model on the reduced training set and then evaluating its performance on the test set using the $R^2$ metric, which indicates how well the model explains the variance in the target variable.

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

model = LinearRegression()
model.fit(X_train, y_train)
score_test = model.score(X_test, y_test)
print(f"Test set R²: {score_test:.2f}")

We can try and improve the model evaluation by using cross-validation, which will give us a better estimate of the model's performance by averaging the results over multiple train-test splits.

In [None]:
# Cross-validation
model = LinearRegression()
cv_scores = cross_val_score(model, X, y, cv=5, scoring="r2")
print(f"Cross-validation R² scores: {cv_scores}")
print(f"CV R²: {np.mean(cv_scores):.2f} +/- {np.std(cv_scores):.2f}")

This code performs 5-fold cross-validation on the entire dataset, providing a more robust estimate of the model's performance. The `cross_val_score` function automatically handles the train-test splits internally, allowing us to focus on the model evaluation.

We can use the `cross_val_predict` function to get the predicted values for each fold, which can be useful for further analysis or visualization.

In [None]:
from sklearn.model_selection import cross_val_predict

# Get cross-validated predictions
y_pred = cross_val_predict(model, X_train, y_train, cv=5)
print(f"Cross-validated predictions: {y_pred[:10]}")

There was no real improvement in the model's performance, but this is expected since we are using a very simple model with only one feature. In practice, cross-validation is particularly useful when dealing with more complex models and larger datasets, as it helps to ensure that the model generalizes well to unseen data.

## Hands-on Exercises

Using the Ames Housing dataset:

-   Load the dataset
-   Explore the dataset and select a few relevant features (e.g., `GrLivArea`, `OverallQual`, `YearBuilt`, `TotalBsmtSF`).
-   Do a standard 80/20 train-test split.
-   Train a `LinearRegression` model and compute $R^2$ on the test set.
-   Use 5-fold cross-validation (`cross_val_score`) on the entire dataset and compare the average $R^2$.
-   Optional: Try changing the `cv` value (e.g., 10) and see how scores vary.

## Summary

In this section, we learned about the importance of train-test splits and cross-validation in machine learning. We demonstrated how to perform a train-test split, train a model, and evaluate its performance using the California housing dataset. We also explored the use of cross-validation to obtain a more robust estimate of model performance.