# Ridge Regression on the Diabetes Dataset

This notebook walks through a simple Ridge Regression workflow using scikit-learn's built-in diabetes dataset.
It highlights why L2 regularization is useful and how changing the `alpha` parameter affects model behavior.

## Why Ridge Regression?

* **Regularization**: Adds an L2 penalty to discourage large coefficients and reduce overfitting.
* **Numerical stability**: Helps when features are correlated or when the dataset is small/noisy.
* **`alpha` controls strength**: Larger values increase shrinkage; smaller values behave closer to ordinary least squares.
* **Scaling matters**: Because the penalty depends on coefficient magnitudes, features should be on comparable scales (hence `StandardScaler`).

In [None]:
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes

# Allow importing from the app/ directory
repo_root = Path.cwd().resolve().parent
if str(repo_root) not in sys.path:
    sys.path.append(str(repo_root))

from app.data import load_diabetes_dataset
from app.evaluate import regression_metrics
from app.model import build_ridge_model

## Load and inspect the data

In [None]:
X, y = load_diabetes_dataset()
X.head(), y.head()

## Train/test split
We reserve 20% of the data for testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

## Train Ridge models with different `alpha` values
We use a Pipeline to ensure scaling happens inside the training workflow.

In [None]:
alphas = [0.1, 1.0, 10.0]
results = {}

for alpha in alphas:
    model = build_ridge_model(alpha=alpha)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    results[alpha] = regression_metrics(y_test, pd.Series(preds, index=y_test.index))

results

## Visualize predicted vs actual for `alpha=1.0`
The closer points lie to the diagonal line, the better the predictions.

In [None]:
alpha = 1.0
model = build_ridge_model(alpha=alpha)
model.fit(X_train, y_train)
preds = pd.Series(model.predict(X_test), index=y_test.index)

plt.figure(figsize=(6, 6))
sns.scatterplot(x=y_test, y=preds, alpha=0.7, edgecolor="white")
line_min, line_max = min(y_test.min(), preds.min()), max(y_test.max(), preds.max())
plt.plot([line_min, line_max], [line_min, line_max], linestyle="--", color="red")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Ridge Regression (alpha=1.0)")
plt.tight_layout()
plt.show()

## Next steps

* Try more `alpha` values and plot how metrics change.
* Use cross-validation (`RidgeCV`) to pick an optimal `alpha`.
* Add feature importance analysis or coefficient inspection.