# Validation and Data Leakage Lab

One of the most common mistakes in machine learning is improper validation. This can lead to models that seem great during development but fail in the real world.

**What you'll learn:**
- Why proper validation matters
- How to use cross-validation
- What data leakage is and how to avoid it
- Common mistakes that cause leakage

**Time:** ~30 minutes

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, r2_score
from sklearn.datasets import fetch_california_housing, load_breast_cancer

np.random.seed(42)
print("Setup complete!")

---

## Part 1: Why Validation Matters

The goal of ML is to build models that work on **new, unseen data**. If we only evaluate on training data, we can't tell if our model is actually learning or just memorizing.

### Overfitting Demo

Let's see what happens when we evaluate on training data vs test data.

In [None]:
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Train a very deep decision tree (prone to overfitting)
overfit_model = DecisionTreeRegressor(max_depth=None, random_state=42)  # No depth limit!
overfit_model.fit(X_train, y_train)

# Evaluate on training data
train_score = overfit_model.score(X_train, y_train)
print(f"Training R²: {train_score:.4f}")

# Evaluate on test data
test_score = overfit_model.score(X_test, y_test)
print(f"Test R²:     {test_score:.4f}")

**Wow!** The model gets nearly perfect score on training data but much worse on test data. This is **overfitting** — the model memorized the training data instead of learning general patterns.

If we had only looked at training performance, we'd think we had a great model!

In [None]:
# Compare with a simpler model
simple_model = DecisionTreeRegressor(max_depth=5, random_state=42)
simple_model.fit(X_train, y_train)

print("Simple Decision Tree (max_depth=5):")
print(f"  Training R²: {simple_model.score(X_train, y_train):.4f}")
print(f"  Test R²:     {simple_model.score(X_test, y_test):.4f}")

The simpler model has lower training score but similar (or better!) test score. It generalizes better.

---

## Part 2: Cross-Validation

A single train/test split can be unreliable — results depend on which data ends up in which set. **Cross-validation** solves this by using multiple splits.

### K-Fold Cross-Validation

The data is split into K parts ("folds"). We train K times, each time using a different fold as the test set:

```
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]
```

This gives us 5 scores instead of 1, and every data point gets to be in the test set exactly once.

In [None]:
# Simple cross-validation with sklearn
model = LinearRegression()

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("5-Fold Cross-Validation Results:")
print(f"  Scores: {scores}")
print(f"  Mean:   {scores.mean():.4f}")
print(f"  Std:    {scores.std():.4f}")

Now we have a more reliable estimate of model performance:
- **Mean** tells us the expected performance
- **Std** tells us how much it varies (lower is better)

In [None]:
# Compare models using cross-validation
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree (depth=5)': DecisionTreeRegressor(max_depth=5, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=50, random_state=42, n_jobs=-1)
}

print("Model Comparison (5-fold CV):")
print(f"{'Model':<25} {'Mean R²':<10} {'Std':<10}")
print("-" * 45)

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    print(f"{name:<25} {scores.mean():<10.4f} {scores.std():<10.4f}")

---

## Part 3: Data Leakage

**Data leakage** occurs when information from outside the training data is used to create the model. This leads to overly optimistic performance estimates that don't hold up in reality.

### Common Causes of Leakage

1. **Preprocessing before splitting** — Fitting scalers/encoders on all data
2. **Target leakage** — Features that contain information about the target that wouldn't be available at prediction time
3. **Train-test contamination** — Test data somehow influencing training

### Demo: Preprocessing Leakage

Let's see what happens when we scale data incorrectly.

In [None]:
# WRONG WAY: Scale before splitting
print("=== WRONG: Scale before split ===")

scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)  # Fit on ALL data - LEAKAGE!

X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(
    X_scaled_wrong, y, test_size=0.2, random_state=42
)

model_wrong = LinearRegression()
model_wrong.fit(X_train_w, y_train_w)
score_wrong = model_wrong.score(X_test_w, y_test_w)
print(f"Test R²: {score_wrong:.4f}")

In [None]:
# RIGHT WAY: Split first, then scale
print("\n=== RIGHT: Split first, then scale ===")

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler_right = StandardScaler()
X_train_scaled = scaler_right.fit_transform(X_train_r)  # Fit only on training data
X_test_scaled = scaler_right.transform(X_test_r)         # Transform test data

model_right = LinearRegression()
model_right.fit(X_train_scaled, y_train_r)
score_right = model_right.score(X_test_scaled, y_test_r)
print(f"Test R²: {score_right:.4f}")

In this case, the difference is small because Linear Regression isn't very sensitive to scaling. But for other algorithms (like neural networks or SVMs), improper scaling can cause significant leakage.

**The principle is important:** Always split first, then preprocess.

### Demo: Target Leakage (The Dangerous Kind)

Target leakage is when a feature directly or indirectly contains information about the target. Let's create a dramatic example.

In [None]:
# Load breast cancer dataset (classification)
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = cancer.target  # 0 = malignant, 1 = benign

print(f"Original features: {X_cancer.shape[1]}")
print(f"Target distribution: {np.bincount(y_cancer)}")

In [None]:
# Simulate a "leaky" feature: treatment_started
# In reality, this would only be known AFTER diagnosis (our target)
# So it shouldn't be used as a feature for prediction

# Create a leaky feature: "treatment started" is highly correlated with being diagnosed
np.random.seed(42)
X_leaky = X_cancer.copy()
# Malignant cases (y=0) mostly have treatment started
# Benign cases (y=1) mostly don't have treatment
X_leaky['treatment_started'] = np.where(
    y_cancer == 0,
    np.random.choice([0, 1], size=len(y_cancer), p=[0.1, 0.9]),  # 90% of malignant have treatment
    np.random.choice([0, 1], size=len(y_cancer), p=[0.9, 0.1])   # 10% of benign have treatment
)

print("Created 'treatment_started' feature (THIS IS LEAKY!)")

In [None]:
# Split the data
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(
    X_leaky, y_cancer, test_size=0.2, random_state=42
)

# Train with the leaky feature
model_leaky = RandomForestClassifier(n_estimators=50, random_state=42)
model_leaky.fit(X_train_l, y_train_l)

print("=== Model WITH Leaky Feature ===")
print(f"Training Accuracy: {model_leaky.score(X_train_l, y_train_l):.4f}")
print(f"Test Accuracy:     {model_leaky.score(X_test_l, y_test_l):.4f}")

In [None]:
# Look at feature importances
importance_leaky = pd.DataFrame({
    'Feature': X_leaky.columns,
    'Importance': model_leaky.feature_importances_
}).sort_values('Importance', ascending=False).head(10)

print("\nTop 10 Feature Importances:")
print(importance_leaky)

The leaky feature (`treatment_started`) is the most important! The model is essentially "cheating" by using information that wouldn't be available in a real prediction scenario.

In [None]:
# Now train without the leaky feature
X_clean = X_cancer.copy()  # No leaky feature

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_clean, y_cancer, test_size=0.2, random_state=42
)

model_clean = RandomForestClassifier(n_estimators=50, random_state=42)
model_clean.fit(X_train_c, y_train_c)

print("=== Model WITHOUT Leaky Feature ===")
print(f"Training Accuracy: {model_clean.score(X_train_c, y_train_c):.4f}")
print(f"Test Accuracy:     {model_clean.score(X_test_c, y_test_c):.4f}")

The clean model has a lower score, but it's an **honest** score. This is what you'd actually get in production.

### How to Detect Target Leakage

Warning signs:
- **Unrealistically high performance** (especially if you're new to the problem)
- **A single feature dominates** importance
- **Feature that seems too good** — ask "would I have this at prediction time?"

---

## Part 4: Proper Validation Pipeline

Here's a checklist for proper validation:

### ✅ Validation Checklist

1. **Split data first** — Before any preprocessing
2. **Fit preprocessors on training data only** — Then transform both train and test
3. **Check for leaky features** — Would this info be available at prediction time?
4. **Use cross-validation** — Don't rely on a single split
5. **Keep a holdout test set** — Final evaluation on data never used during development
6. **Be skeptical of great results** — If it seems too good, it probably is

In [None]:
# Example of a proper pipeline using sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline that handles preprocessing correctly
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Cross-validation automatically handles the train/test split correctly for each fold
scores = cross_val_score(pipeline, X_cancer, y_cancer, cv=5, scoring='accuracy')

print("Proper Pipeline with Cross-Validation:")
print(f"  Scores: {scores}")
print(f"  Mean:   {scores.mean():.4f}")
print(f"  Std:    {scores.std():.4f}")

Using a Pipeline ensures that preprocessing is done correctly within each cross-validation fold — the scaler is fit only on the training portion of each fold.

---

## Summary

### Key Takeaways

| Concept | What it means |
|---------|---------------|
| **Overfitting** | Model works on training data but fails on new data |
| **Cross-validation** | Multiple train/test splits for reliable evaluation |
| **Data leakage** | Using information that wouldn't be available in production |
| **Pipeline** | Ensures preprocessing is done correctly |

### Rules to Remember

1. **Never evaluate only on training data**
2. **Split first, preprocess second**
3. **Fit on train, transform on both**
4. **If results seem too good, investigate**
5. **Use pipelines to avoid mistakes**

---

## Exercises

1. **Cross-validation experiment:** Try different values of `cv` (3, 5, 10). How does the mean and std change?

2. **Create your own leaky feature:** Add a feature that's derived from the target (e.g., `y + small_noise`). See how it affects the model.

3. **Pipeline practice:** Create a pipeline with a scaler and Random Forest. Use cross-validation to evaluate it.

---

## Next Steps

Ready to submit to Kaggle? Continue to the **[Kaggle Submission Lab](./kaggle_submission.ipynb)**.