## What is Cross Validation?

Cross-validation is a technique used in machine learning to evaluate the performance of a model and assess its ability to generalize to new, unseen data. It involves partitioning the available data into subsets, performing training and testing on these subsets iteratively, and then aggregating the evaluation results to obtain an overall performance estimate.

The main idea behind cross-validation is to simulate the model's performance on unseen data by using different subsets of the available data for training and testing. This helps to provide a more reliable estimate of the model's performance and reduces the risk of overfitting or underfitting.

Here's a step-by-step overview of the cross-validation process:

- Partition the data: The available data is divided into K equally sized or stratified subsets called "folds" (usually referred to as K-fold cross-validation).

- Train and test: Perform K iterations, where in each iteration, one fold is used as the test set and the remaining K-1 folds are used as the training set.

- Model training: Train the model on the training set, using the specified algorithm and hyperparameters.

- Model evaluation: Evaluate the trained model on the test set and calculate the chosen performance metric(s) (e.g., accuracy, mean squared error, etc.).

- Repeat and aggregate: Repeat steps 2-4 for each fold, using a different fold as the test set in each iteration. Aggregate the performance metrics from each iteration to obtain an overall performance estimate.

By performing cross-validation, you gain insights into how well the model generalizes to unseen data and can assess its stability and robustness. It helps to provide a more comprehensive evaluation of the model's performance compared to a single train-test split. Additionally, cross-validation can assist in hyperparameter tuning and model selection by comparing the performance of different models or parameter settings.

Common cross-validation techniques include `K-fold cross-validation`, `stratified K-fold cross-validation` (when dealing with imbalanced datasets), leave-one-out cross-validation (K = number of samples), and `holdout validation` (a single train-test split). The choice of the appropriate cross-validation technique depends on the dataset characteristics, available computational resources, and the specific requirements of the problem at hand.

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold, LeaveOneOut, train_test_split

In [2]:
# Generate a synthetic dataset for regression
X, y = make_regression(n_samples=100, n_features=10, random_state=42)

In [7]:
X[0]

array([ 0.15039379,  0.95042384, -0.75913266, -2.12389572, -0.57690366,
       -0.59939265, -0.52575502, -0.83972184,  0.34175598,  1.87617084])

In [9]:
y[0]

-60.4869547841584

In [3]:
def apply_cross_validation(X, y, cv_type):
    if cv_type == 'kfold':
        cv = KFold(n_splits=5, shuffle=True, random_state=42)
    elif cv_type == 'stratified':
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    elif cv_type == 'leaveoneout':
        cv = LeaveOneOut()
    elif cv_type == 'holdout':
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        model = LinearRegression()
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        print(f'Holdout validation score: {score}')
        return
    else:
        raise ValueError("Invalid cross-validation type.")

    model = LinearRegression()
    scores = cross_val_score(model, X, y, scoring='r2', cv=cv)

    print(f'{cv_type.capitalize()} validation scores: {scores}')
    print(f'{cv_type.capitalize()} validation mean score: {scores.mean()}')
    print(f'{cv_type.capitalize()} validation standard deviation: {scores.std()}')

In [5]:
# Apply K-fold cross-validation
apply_cross_validation(X, y, 'kfold')

# # Apply Stratified K-fold cross-validation
# apply_cross_validation(X, y, 'stratified')

# # Apply Leave-One-Out cross-validation
# apply_cross_validation(X, y, 'leaveoneout')

# # Apply Holdout validation
# apply_cross_validation(X, y, 'holdout')

Kfold validation scores: [1. 1. 1. 1. 1.]
Kfold validation mean score: 1.0
Kfold validation standard deviation: 0.0
