# Cross-Validation in Python

Cross-validation is a technique to assess model performance by splitting data into multiple subsets, training on some, and testing on others. This helps prevent overfitting and provides a robust estimate of model generalization.

This notebook demonstrates cross-validation using scikit-learn with a linear regression model on synthetic data.

## 1. Import Libraries

We need scikit-learn for data generation, model, and cross-validation utilities.

In [None]:
%pip install scikit-learn

In [1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
import numpy as np

## 2. Generate Synthetic Data

Create a simple regression dataset with 100 samples, 1 feature, and some noise.

In [2]:
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)
print("Data shape:", X.shape, y.shape)

Data shape: (100, 1) (100,)


## 3. Initialize Model

Use a linear regression model for this example.

In [3]:
model = LinearRegression()

## 4. Set Up K-Fold Cross-Validation

Configure 10-fold cross-validation with shuffling to ensure random splits.

In [4]:
cv = KFold(n_splits=10, shuffle=True, random_state=1)

## 5. Perform Cross-Validation

Evaluate the model using negative mean absolute error (MAE). The negative sign is used because scikit-learn's convention maximizes scores.

In [5]:
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv)
print("Cross-validation scores (negative MAE):", scores)
print(f"Mean MAE: {-scores.mean():.3f}")
print(f"Standard deviation: {scores.std():.3f}")

Cross-validation scores (negative MAE): [-0.05488021 -0.08413882 -0.07235758 -0.08132091 -0.08722358 -0.04575729
 -0.06697136 -0.08161812 -0.08234569 -0.06934001]
Mean MAE: 0.073
Standard deviation: 0.013


## 6. Interpretation

- **Mean MAE**: Average error across all folds, indicating model performance.
- **Standard Deviation**: Variability of scores, showing consistency across folds.

This setup ensures the model is evaluated on different data subsets, providing a reliable performance estimate.