<a href="https://colab.research.google.com/github/asupraja3/ml-ng-notebooks/blob/main/Overfitting_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optional Lab (Ungraded): Overfitting — Colab Notebook

**Goals**  
In this lab, you will:
- See how **model complexity** (e.g., polynomial degree) affects **bias/variance**.  
- Explore the impact of **noise** and **outliers** on overfitting.  
- Use **regularization** (L2) to mitigate overfitting.  
- Compare performance on **train vs. test** sets.  
- Try both **regression** and **classification** examples.

> This re-creation mirrors the spirit of Andrew Ng's ML "Overfitting" optional lab. It provides interactive controls and detailed comments so you can deeply understand what's happening.

In [None]:
#@title Setup (installs and imports)
%pip -q install ipywidgets scikit-learn --upgrade

import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive_output, IntSlider, FloatSlider, Dropdown, Checkbox, VBox, HBox, Button
from IPython.display import display, clear_output
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score, log_loss

%config InlineBackend.figure_format = 'retina'
plt.rcParams.update({"figure.figsize": (7, 4.5), "axes.grid": True, "grid.alpha": 0.3})
rng = np.random.default_rng(42)
print("Setup complete.")


## What is Overfitting?

Overfitting happens when a model learns **noise** or **idiosyncrasies** in the training set rather than the underlying pattern.  
Symptoms:
- Very **low training error** but **high test error**.
- Model appears **too wiggly/complex** relative to the amount and quality of data.

**Causes**
- High model complexity (e.g., high polynomial degree, very deep trees).
- Small datasets or high noise.
- Outliers that pull the model in weird directions.

**Solutions**
- Get **more data** (reduces variance).
- **Reduce model complexity** (e.g., lower degree).
- Apply **regularization** (e.g., L2).
- **Feature selection** / remove noisy features.
- Use **cross-validation** and **early stopping**.

In [None]:
#@title Helpers: Data generators for regression and classification
def make_regression_data(n=30, noise=10.0, x_min=0.0, x_max=30.0):
    """
    Create 1-D regression data from a (hidden) quadratic with noise.
    y_true = 0.5*x^2 - 3x + 10
    Returns: x (n,), y (n,), y_true(x) (n,)
    """
    x = rng.uniform(x_min, x_max, size=n)
    y_true = 0.5 * x**2 - 3.0 * x + 10.0
    y = y_true + rng.normal(0.0, noise, size=n)
    return x, y, y_true

def make_classification_data(n=60, noise=0.25):
    """
    Create 2-D classification data separable by a wiggly boundary.
    The label is determined by sign of a smooth function + noise.
    Returns: X (n,2), y (n,), plus a function handle f(X) for 'ideal' curve.
    """
    X = rng.uniform(-1.0, 1.0, size=(n, 2))
    f = lambda x0, x1: (2*x0**3 - x0) + np.sin(3*x1)
    logits = f(X[:,0], X[:,1]) + rng.normal(0, noise, size=n)
    y = (logits > 0).astype(int)
    return X, y, f


## Interactive Demo — Regression (Polynomial + L2)

Use the controls to:
- Change **polynomial degree** (complexity).
- Adjust **noise** (data uncertainty).
- Add **L2 regularization** (λ).  
- See train/test **MSE** and compare the **fit curve** to the (hidden) ideal.

In [None]:
#@title Regression Playground
degree = IntSlider(value=3, min=0, max=12, step=1, description='Degree')
n_samples = IntSlider(value=30, min=10, max=200, step=5, description='Samples')
noise = FloatSlider(value=10.0, min=0.0, max=50.0, step=1.0, description='Noise')
lam = FloatSlider(value=0.0, min=0.0, max=100.0, step=1.0, description='λ (L2)')
test_frac = FloatSlider(value=0.3, min=0.1, max=0.6, step=0.05, description='Test frac')

def plot_regression(degree, n_samples, noise, lam, test_frac):
    # 1) Data
    x, y, y_true = make_regression_data(n=n_samples, noise=noise)
    idx = rng.permutation(n_samples)
    n_test = int(n_samples * test_frac)
    test_idx, train_idx = idx[:n_test], idx[n_test:]
    x_train, y_train = x[train_idx], y[train_idx]
    x_test, y_test = x[test_idx], y[test_idx]

    # 2) Pipeline: Poly -> Standardize -> Ridge
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=degree, include_bias=True)),
        ("scale", StandardScaler(with_mean=False)),
        ("ridge", Ridge(alpha=lam, fit_intercept=True, random_state=0))
    ])

    X_train = x_train.reshape(-1, 1)
    X_test  = x_test.reshape(-1, 1)
    model.fit(X_train, y_train)

    # 3) Evaluate
    y_pred_train = model.predict(X_train)
    y_pred_test  = model.predict(X_test)
    mse_train = mean_squared_error(y_train, y_pred_train)
    mse_test  = mean_squared_error(y_test, y_pred_test)

    # 4) Plot
    xx = np.linspace(x.min(), x.max(), 300).reshape(-1, 1)
    y_fit = model.predict(xx)
    y_ideal = 0.5 * xx[:,0]**2 - 3.0 * xx[:,0] + 10.0

    plt.figure()
    plt.scatter(x_train, y_train, label="train", alpha=0.8)
    plt.scatter(x_test,  y_test,  label="test",  alpha=0.8)
    plt.plot(xx, y_ideal, linestyle="--", label="y_ideal")
    plt.plot(xx, y_fit, label="y_fit")
    plt.title(f"Regression: degree={degree}, λ={lam:.1f} | MSE train={mse_train:.1f}, test={mse_test:.1f}")
    plt.xlabel("x"); plt.ylabel("y"); plt.legend()
    plt.show()

interactive_output_reg = interactive_output(
    plot_regression,
    {"degree": degree, "n_samples": n_samples, "noise": noise, "lam": lam, "test_frac": test_frac}
)

display(VBox([HBox([degree, n_samples]), HBox([noise, lam, test_frac]), interactive_output_reg]))


## Interactive Demo — Classification (Logistic Regression + Polynomial Features)

Use the controls to:
- Change **polynomial degree** (decision boundary complexity).
- Adjust **noise** (how mixed the classes are).
- Apply **L2 regularization** (C is inverse of strength; we report λ for intuition).

We visualize the decision boundary and report **train/test accuracy** and **log loss**.

In [None]:
#@title Classification Playground
degree_c = IntSlider(value=3, min=1, max=10, step=1, description='Degree')
n_samples_c = IntSlider(value=80, min=30, max=400, step=10, description='Samples')
noise_c = FloatSlider(value=0.25, min=0.0, max=1.0, step=0.05, description='Noise')
lam_c = FloatSlider(value=1.0, min=0.0, max=50.0, step=0.5, description='λ (L2)')
test_frac_c = FloatSlider(value=0.3, min=0.1, max=0.6, step=0.05, description='Test frac')

def plot_classification(degree_c, n_samples_c, noise_c, lam_c, test_frac_c):
    # 1) Data
    X, y, f_ideal = make_classification_data(n=n_samples_c, noise=noise_c)
    idx = rng.permutation(n_samples_c)
    n_test = int(n_samples_c * test_frac_c)
    test_idx, train_idx = idx[:n_test], idx[n_test:]
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test   = X[test_idx],  y[test_idx]

    # 2) Model: poly -> standardize -> logistic regression (L2)
    C = np.inf if lam_c == 0 else 1.0/lam_c
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=degree_c, include_bias=True)),
        ("scale", StandardScaler(with_mean=False)),
        ("logreg", LogisticRegression(penalty="l2", C=C, solver="lbfgs", max_iter=500, fit_intercept=True))
    ])

    model.fit(X_train, y_train)

    prob_train = model.predict_proba(X_train)[:,1]
    prob_test  = model.predict_proba(X_test)[:,1]
    y_pred_train = (prob_train >= 0.5).astype(int)
    y_pred_test  = (prob_test  >= 0.5).astype(int)
    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test  = accuracy_score(y_test,  y_pred_test)
    ll_train  = log_loss(y_train, prob_train, labels=[0,1])
    ll_test   = log_loss(y_test,  prob_test,  labels=[0,1])

    # 3) Grid for boundary
    x0_min, x1_min = X[:,0].min()-0.2, X[:,1].min()-0.2
    x0_max, x1_max = X[:,0].max()+0.2, X[:,1].max()+0.2
    gx0, gx1 = np.meshgrid(np.linspace(x0_min, x0_max, 200),
                           np.linspace(x1_min, x1_max, 200))
    grid = np.c_[gx0.ravel(), gx1.ravel()]
    probs = model.predict_proba(grid)[:,1].reshape(gx0.shape)

    # 4) Plot
    plt.figure()
    plt.contour(gx0, gx1, probs, levels=[0.5], linewidths=2)
    plt.scatter(X_train[:,0], X_train[:,1], c=y_train, marker='o', label='train', alpha=0.85)
    plt.scatter(X_test[:,0],  X_test[:,1],  c=y_test,  marker='s', label='test',  alpha=0.85)
    plt.title(f"Classification: degree={degree_c}, λ={lam_c:.2f} | acc train={acc_train:.2f}, test={acc_test:.2f}")
    plt.xlabel("x0"); plt.ylabel("x1"); plt.legend()
    plt.show()

interactive_output_cls = interactive_output(
    plot_classification,
    {"degree_c": degree_c, "n_samples_c": n_samples_c, "noise_c": noise_c,
     "lam_c": lam_c, "test_frac_c": test_frac_c}
)

display(VBox([HBox([degree_c, n_samples_c]), HBox([noise_c, lam_c, test_frac_c]), interactive_output_cls]))


## Practical Guidance (as in the course)

Try the following experiments:
- **Underfit**: Regression with degree = 1 (line) on quadratic data → high bias.
- **Overfit**: Raise degree to 8–12 with few points → wiggly curve; low train error, higher test error.
- **Outliers**: Increase `Noise` or manually add extreme values; note how the fit changes. Regularization helps.
- **Regularization**: Increase λ to smooth the curve or boundary.
- **Data size**: Increase `Samples` to reduce variance; overfitting often decreases.

**Remember**: choose hyperparameters (degree, λ) using a **validation** set or cross-validation, not the test set.