# Bias-Variance Tradeoff and Learning Curves

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Understand bias and variance in machine learning
- Analyze bias-variance tradeoff through learning curves
- Identify and handle overfitting/underfitting
- Select optimal model complexity using validation sets

## ðŸ”— Prerequisites

- âœ… Understanding of regression and model evaluation
- âœ… Python 3.8+ installed

---

## Official Structure Reference

This notebook covers practical activities from **Course 04, Unit 2**:
- Analyzing bias-variance tradeoff through learning curves
- Identifying and handling overfitting/underfitting
- Selecting optimal model complexity using validation sets
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction to Bias-Variance Tradeoff

**Bias**: Error from overly simplistic assumptions (underfitting)
**Variance**: Error from sensitivity to small fluctuations (overfitting)
**Tradeoff**: Finding the right balance between model complexity and generalization


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

print("âœ… Libraries imported successfully!")


âœ… Libraries imported successfully!


In [2]:
# Generate sample data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + 0.3 * np.random.randn(100)

print(f"Data shape: X={X.shape}, y={y.shape}")


Data shape: X=(100, 1), y=(100,)


## Part 1: Learning Curves - Analyzing Model Performance


In [3]:
# Create learning curves for different model complexities
degrees = [1, 3, 10]
models = {}

for degree in degrees:
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)), ('linear', LinearRegression())
    ])
    
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='neg_mean_squared_error'
    )
    
    models[degree] = {
        'train_sizes': train_sizes,
        'train_mean': -train_scores.mean(axis=1), 'train_std': train_scores.std(axis=1),
        'val_mean': -val_scores.mean(axis=1), 'val_std': val_scores.std(axis=1)
    }
    
    print(f"Degree {degree}:")
    print(f"  Final Train MSE: {models[degree]['train_mean'][-1]:.4f}")
    print(f"  Final Val MSE: {models[degree]['val_mean'][-1]:.4f}")
    print(f"  Gap (overfitting indicator): {models[degree]['val_mean'][-1] - models[degree]['train_mean'][-1]:.4f}")


Degree 1:
  Final Train MSE: 0.4825
  Final Val MSE: 0.8638
  Gap (overfitting indicator): 0.3812
Degree 3:
  Final Train MSE: 0.2482
  Final Val MSE: 11.9686
  Gap (overfitting indicator): 11.7204
Degree 10:
  Final Train MSE: 0.0634
  Final Val MSE: 455.6392
  Gap (overfitting indicator): 455.5758


## Part 2: Identifying Overfitting and Underfitting

Let's compare model performance to identify optimal complexity.


In [4]:
# Analyze bias-variance tradeoff
print("=" * 60)
print("Bias-Variance Analysis:")
print("=" * 60)

# Degree 1 (Underfitting - High Bias)
print("\nDegree 1 (Linear - Underfitting):")
print("  Characteristic: High bias, low variance")
print("  Train MSE: {:.4f}".format(models[1]['train_mean'][-1]))
print("  Val MSE: {:.4f}".format(models[1]['val_mean'][-1]))
print("  Gap: {:.4f} (small gap, but both high = underfitting)".format(
    models[1]['val_mean'][-1] - models[1]['train_mean'][-1]))

# Degree 3 (Balanced)
print("\nDegree 3 (Balanced):")
print("  Characteristic: Balanced bias-variance")
print("  Train MSE: {:.4f}".format(models[3]['train_mean'][-1]))
print("  Val MSE: {:.4f}".format(models[3]['val_mean'][-1]))
print("  Gap: {:.4f} (reasonable gap)".format(
    models[3]['val_mean'][-1] - models[3]['train_mean'][-1]))

# Degree 10 (Overfitting - High Variance)
print("\nDegree 10 (Overfitting - High Variance):")
print("  Characteristic: Low bias, high variance")
print("  Train MSE: {:.4f}".format(models[10]['train_mean'][-1]))
print("  Val MSE: {:.4f}".format(models[10]['val_mean'][-1]))
print("  Gap: {:.4f} (large gap = overfitting)".format(
    models[10]['val_mean'][-1] - models[10]['train_mean'][-1]))


Bias-Variance Analysis:

Degree 1 (Linear - Underfitting):
  Characteristic: High bias, low variance
  Train MSE: 0.4825
  Val MSE: 0.8638
  Gap: 0.3812 (small gap, but both high = underfitting)

Degree 3 (Balanced):
  Characteristic: Balanced bias-variance
  Train MSE: 0.2482
  Val MSE: 11.9686
  Gap: 11.7204 (reasonable gap)

Degree 10 (Overfitting - High Variance):
  Characteristic: Low bias, high variance
  Train MSE: 0.0634
  Val MSE: 455.6392
  Gap: 455.5758 (large gap = overfitting)


## Summary

### Key Concepts:
1. **Bias**: Model too simple, cannot capture patterns (underfitting)
2. **Variance**: Model too complex, memorizes noise (overfitting)
3. **Learning Curves**: Show train/validation performance vs training set size
4. **Optimal Complexity**: Balance where validation error is minimized

### How to Use:
- **Large gap between train/val**: Overfitting â†’ reduce complexity
- **Both train/val high**: Underfitting â†’ increase complexity
- **Both train/val similar and low**: Good model

**Reference:** Course 04, Unit 2: "Analyzing bias-variance tradeoff through learning curves" and "Identifying and handling overfitting/underfitting"
