<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_8/Section_6_Python_Example__Cross_Validation_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 6: Python example - cross-validation techniques

Cross-validation is a cornerstone technique in model validation, crucial for assessing how the results of a statistical analysis will generalize to an independent data set. It is particularly vital in scenarios where the goal is to predict, and one needs to estimate how accurately a predictive model will perform in practice. This section provides a detailed Python example demonstrating how to implement various cross-validation techniques using the scikit-learn library.

1. Setting Up the Environment:

Ensure that your Python environment includes scikit-learn, a versatile library that offers robust tools for cross-validation. Install it via pip if it's not already included:

In [None]:
pip install scikit-learn

2. Importing Required Libraries:

Begin by importing necessary libraries. We'll need scikit-learn for the modeling and cross-validation tools, pandas for data manipulation, and NumPy for numerical operations:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, cross_val_score, LeaveOneOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

3. Preparing the Data:

We'll use a synthetic dataset for classification, which scikit-learn can generate easily:

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=3, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# Convert to DataFrame for easier manipulation
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(20)])
df['Target'] = y

4. Implementing Cross-Validation:

K-Fold Cross-Validation:

K-Fold is a widely used method in cross-validation that involves dividing the data into 'k' consecutive folds, ensuring every data point gets to be in a test set exactly once and in a training set 'k-1' times.

In [None]:
# Setting up 10-Fold Cross-Validation
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
# Model instantiation
model = RandomForestClassifier(n_estimators=100, random_state=1)
# Array to store scores
scores = []
for train_index, test_index in kfold.split(X):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  model.fit(X_train, y_train)
  predictions = model.predict(X_test)
  accuracy = accuracy_score(y_test, predictions)
  scores.append(accuracy)
print(f'K-Fold Cross-Validation Scores: {scores}')
print(f'Mean Accuracy: {np.mean(scores)}')

Leave-One-Out Cross-Validation:

Leave-One-Out (LOO) is another technique, which is a particular case of k-fold cross-validation where the number of folds equals the number of data points. It's more computationally expensive but provides a thorough assessment of model stability and effectiveness.

In [None]:
# Using Leave-One-Out Cross-Validation
loo = LeaveOneOut()
scores_loo = []
for train_index, test_index in loo.split(X):
  X_train, X_test = X[train_index], X[test_index]
  y_train, y_test = y[train_index], y[test_index]
  model.fit(X_train, y_train)
  prediction = model.predict(X_test)
  accuracy = accuracy_score(y_test, prediction)
  scores_loo.append(accuracy)
print(f'Leave-One-Out Cross-Validation Score: {np.mean(scores_loo)}')

Bootstrap Method:

Bootstrap involves randomly selecting a subset of data for each sample with replacement and typically used to estimate summary statistics. We'll use bootstrapping to assess the stability of our model's performance.

In [None]:
from sklearn.utils import resample
# Define the number of bootstrap samples
n_iterations = 1000
n_size = int(len(df) * 0.50) # 50% of the dataset
# Model instantiation
model = RandomForestClassifier(n_estimators=100)
scores_bootstrap = []
for i in range(n_iterations):
  # Prepare train and test sets
  train = resample(df, n_samples=n_size)
  test = df[~df.index.isin(train.index)]
  X_train, y_train = train.drop('Target', axis=1), train['Target']
  X_test, y_test = test.drop('Target', axis=1), test['Target']
  # Fit model
  model.fit(X_train, y_train)
  # Evaluate model
  predictions = model.predict(X_test)
  accuracy = accuracy_score(y_test, predictions)
  scores_bootstrap.append(accuracy)
# Calculating the confidence intervals of model accuracy
alpha = 0.95 p = ((1.0-alpha)/2.0) * 100
lower = max(0.0, np.percentile(scores_bootstrap, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(scores_bootstrap, p))
print('Bootstrap Cross-Validation Score (95%% confidence interval): %.1f%% (%.1f%%, %.1f%%)' % (np.mean(scores_bootstrap)*100, lower*100, upper*100))

## Explanation:

In this example:

We set up 1000 iterations (n_iterations) for the bootstrap process, where in each iteration, a subset of data is randomly selected with replacement.

The size of each bootstrap sample is set to 50% of the original data, but this can be adjusted depending on the specifics of your dataset and the stability of the model.

We train a RandomForestClassifier on each of these bootstrap samples.

The model's accuracy is recorded for each iteration.

Finally, we calculate the 95% confidence intervals from the bootstrap results, providing a range that contains the true model accuracy with 95% confidence.

This bootstrap approach helps in understanding the variability and the reliability of the model's performance, complementing other cross-validation techniques like K-Fold and LOO, which primarily focus on model validation and stability across different subsets of the dataset. By integrating bootstrapping, you gain deeper insights into the potential variability in your model’s performance due to sample-specific idiosyncrasies, which is invaluable in practical applications where predictions are subject to uncertainty.

5. Conclusion:

Implementing cross-validation in Python using scikit-learn is a powerful way to evaluate the generalizability and stability of predictive models. By leveraging techniques like K-Fold and Leave-One-Out cross-validation, data scientists can rigorously assess model performance, helping to ensure that the models are robust and perform well across different subsets of data.

Bootstrapping is a powerful statistical method used to estimate quantities about a population by averaging estimates from multiple small data samples, technically known as resampling with replacement. This technique is particularly useful for assessing the reliability of model estimates and providing measures of accuracy such as the standard error and confidence intervals.

These approaches are indispensable in predictive modelling, particularly when dealing with limited data or when aiming to achieve highly accurate predictions.