# Week 9 — Gradient Boost (Scikit‑learn)
*Generated: 2025-11-02 04:58:46*

This notebook applies **Gradient Boosting** concepts to your Integrated Capstone dataset. It covers:
- Gradient boosting for **regression or classification** (auto‑detected).
- Key hyperparameters: **learning rate**, **number of estimators**, **tree depth/leaves**, **subsample**, **max_features**.
- Regularization: **min_samples_leaf**, **max_depth / max_leaf_nodes**, **subsample**, **max_features**, **L2** (via Histogram‑based GBDT).
- **Cross‑validation**, **early stopping** (for `HistGradientBoosting*`), and **hyperparameter tuning**.
- Diagnostics: performance metrics, curves, and **permutation importance**.

> **Tip:** If your capstone CSV is not ready, enable the demo toggle below to run with a built‑in dataset so the notebook executes end‑to‑end.


In [None]:

# === Setup ===
# Toggle to use a demo dataset if your CSV isn't ready yet.
USE_DEMO_DATA = True  # set to False when you have your own CSV

# Filepath to your project CSV (only used if USE_DEMO_DATA = False)
CSV_PATH = "path/to/your_capstone.csv"

# Name of your target column in the CSV (used if not using demo data)
TARGET_COL = "YourTargetColumnName"

# Random seed for reproducibility
RANDOM_STATE = 42

# CV folds
CV_FOLDS = 5


In [None]:

# === Imports ===
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, f1_score, roc_auc_score, precision_score, recall_score
)
from sklearn.inspection import permutation_importance
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting  # noqa: F401
from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint, uniform

import warnings
warnings.filterwarnings("ignore")
np.random.seed(RANDOM_STATE)


## 1) Load Data

In [None]:

if USE_DEMO_DATA:
    # Demo option detects classification vs regression automatically by switching datasets
    from sklearn.datasets import load_breast_cancer, fetch_california_housing

    DEMO_TASK = "classification"  # change to "regression" to see regression flow

    if DEMO_TASK == "classification":
        ds = load_breast_cancer(as_frame=True)
        df = ds.frame.copy()
        TARGET_COL = "target"
    else:
        ds = fetch_california_housing(as_frame=True)
        df = ds.frame.copy()
        TARGET_COL = "MedHouseVal"
else:
    df = pd.read_csv(CSV_PATH)

print("Shape:", df.shape)
df.head()


## 2) Quick EDA & Target Type Detection

In [None]:

display(df.describe(include="all").T)

y = df[TARGET_COL]
X = df.drop(columns=[TARGET_COL])

# Heuristic to detect classification vs regression
if pd.api.types.is_bool_dtype(y) or (y.nunique() <= 10 and set(y.unique()).issubset({0,1})):
    TASK = "classification"
elif y.dtype.kind in {"i","u"} and y.nunique() <= 10:
    TASK = "classification"
else:
    TASK = "regression"

print("Detected task:", TASK, "| Target:", TARGET_COL, "| Unique target values:", y.nunique())


## 3) Train/Validation Split

In [None]:

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE,
    stratify=y if TASK == "classification" else None
)
print(X_train.shape, X_valid.shape)


## 4) Preprocessing Pipeline

In [None]:

# Column splits
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = X_train.select_dtypes(include=["object","category","bool"]).columns.tolist()
print(f"Numeric cols: {len(num_cols)} | Categorical cols: {len(cat_cols)}")

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler(with_mean=False))  # tree models don't need scaling; kept to be safe for mixed pipelines
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols)
    ]
)


## 5) Baseline Gradient Boosting

In [None]:

if TASK == "regression":
    base_est = GradientBoostingRegressor(random_state=RANDOM_STATE)
else:
    base_est = GradientBoostingClassifier(random_state=RANDOM_STATE)

pipe_base = Pipeline([("prep", preprocess), ("gb", base_est)])
pipe_base.fit(X_train, y_train)

def evaluate(pipe, X_tr, y_tr, X_te, y_te, task):
    pred_tr = pipe.predict(X_tr)
    pred_te = pipe.predict(X_te)
    if task == "classification":
        metrics = {
            "accuracy_tr": accuracy_score(y_tr, pred_tr),
            "accuracy_te": accuracy_score(y_te, pred_te),
            "f1_te": f1_score(y_te, pred_te, average="binary" if y_te.nunique()==2 else "macro"),
        }
        # AUC if possible
        try:
            proba = pipe.predict_proba(X_te)[:,1]
            metrics["roc_auc_te"] = roc_auc_score(y_te, proba)
        except Exception:
            pass
    else:
        metrics = {
            "rmse_tr": mean_squared_error(y_tr, pred_tr, squared=False),
            "rmse_te": mean_squared_error(y_te, pred_te, squared=False),
            "mae_te": mean_absolute_error(y_te, pred_te),
            "r2_te": r2_score(y_te, pred_te),
        }
    return metrics

metrics_base = evaluate(pipe_base, X_train, y_train, X_valid, y_valid, TASK)
metrics_base


## 6) Learning Rate × Number of Estimators (Bias–Variance Trade‑off)

In [None]:

lrs = [0.02, 0.05, 0.1, 0.2]
nest_opts = [100, 300, 600]

results = []
for lr in lrs:
    for ne in nest_opts:
        if TASK == "regression":
            est = GradientBoostingRegressor(learning_rate=lr, n_estimators=ne, random_state=RANDOM_STATE)
        else:
            est = GradientBoostingClassifier(learning_rate=lr, n_estimators=ne, random_state=RANDOM_STATE)
        pipe = Pipeline([("prep", preprocess), ("gb", est)])
        pipe.fit(X_train, y_train)
        m = evaluate(pipe, X_train, y_train, X_valid, y_valid, TASK)
        m["learning_rate"] = lr
        m["n_estimators"] = ne
        results.append(m)

df_lr = pd.DataFrame(results)
display(df_lr.sort_values(by=list(df_lr.columns)[:1]).reset_index(drop=True))

# Simple visualization: plot score vs n_estimators for each LR
plt.figure(figsize=(8,5))
for lr in lrs:
    sub = df_lr[df_lr["learning_rate"] == lr]
    x = sub["n_estimators"].values
    if TASK == "classification":
        y = sub["accuracy_te"].values
        plt.plot(x, y, marker="o", label=f"lr={lr}")
        plt.ylabel("Accuracy (valid)")
    else:
        y = sub["rmse_te"].values
        plt.plot(x, y, marker="o", label=f"lr={lr}")
        plt.ylabel("RMSE (valid)")
plt.xlabel("n_estimators")
plt.title("Learning Rate vs n_estimators (Validation)")
plt.legend()
plt.show()


## 7) Tree Depth & Leaves (Regularization)

In [None]:

depths = [2, 3, 4, 6]
min_leaves = [1, 5, 10]

rows = []
for d in depths:
    for ml in min_leaves:
        if TASK == "regression":
            est = GradientBoostingRegressor(max_depth=d, min_samples_leaf=ml, random_state=RANDOM_STATE)
        else:
            est = GradientBoostingClassifier(max_depth=d, min_samples_leaf=ml, random_state=RANDOM_STATE)
        pipe = Pipeline([("prep", preprocess), ("gb", est)])
        pipe.fit(X_train, y_train)
        m = evaluate(pipe, X_train, y_train, X_valid, y_valid, TASK)
        m.update({"max_depth": d, "min_samples_leaf": ml})
        rows.append(m)

df_depth = pd.DataFrame(rows)
display(df_depth)

# Plot depth vs metric at best min_samples_leaf
if TASK == "classification":
    best_ml = df_depth.sort_values("accuracy_te", ascending=False).iloc[0]["min_samples_leaf"]
    sub = df_depth[df_depth["min_samples_leaf"] == best_ml]
    plt.figure(figsize=(8,5))
    plt.plot(sub["max_depth"], sub["accuracy_te"], marker="o")
    plt.xlabel("max_depth")
    plt.ylabel("Accuracy (valid)")
    plt.title(f"Depth vs Accuracy (min_samples_leaf={int(best_ml)})")
    plt.show()
else:
    best_ml = df_depth.sort_values("rmse_te", ascending=True).iloc[0]["min_samples_leaf"]
    sub = df_depth[df_depth["min_samples_leaf"] == best_ml]
    plt.figure(figsize=(8,5))
    plt.plot(sub["max_depth"], sub["rmse_te"], marker="o")
    plt.xlabel("max_depth")
    plt.ylabel("RMSE (valid)")
    plt.title(f"Depth vs RMSE (min_samples_leaf={int(best_ml)})")
    plt.show()


## 8) Stochastic Gradient Boosting (subsample, max_features)

In [None]:

subsamps = [0.6, 0.8, 1.0]
max_feats = [None, 0.5, 0.8]

rows = []
for ss in subsamps:
    for mf in max_feats:
        if TASK == "regression":
            est = GradientBoostingRegressor(subsample=ss, max_features=mf, random_state=RANDOM_STATE)
        else:
            est = GradientBoostingClassifier(subsample=ss, max_features=mf, random_state=RANDOM_STATE)
        pipe = Pipeline([("prep", preprocess), ("gb", est)])
        pipe.fit(X_train, y_train)
        m = evaluate(pipe, X_train, y_train, X_valid, y_valid, TASK)
        m.update({"subsample": ss, "max_features": mf})
        rows.append(m)

df_stoch = pd.DataFrame(rows)
display(df_stoch)


## 9) Histogram-based GBDT with Early Stopping & L2

In [None]:

if TASK == "regression":
    est = HistGradientBoostingRegressor(
        learning_rate=0.1,
        max_depth=None,            # use max_leaf_nodes instead for HGBDT
        max_leaf_nodes=31,
        min_samples_leaf=20,
        l2_regularization=0.0,     # try >0 for more regularization
        early_stopping=True,
        random_state=RANDOM_STATE
    )
else:
    est = HistGradientBoostingClassifier(
        learning_rate=0.1,
        max_depth=None,
        max_leaf_nodes=31,
        min_samples_leaf=20,
        l2_regularization=0.0,
        early_stopping=True,
        random_state=RANDOM_STATE
    )

pipe_hgb = Pipeline([("prep", preprocess), ("hgb", est)])
pipe_hgb.fit(X_train, y_train)
metrics_hgb = evaluate(pipe_hgb, X_train, y_train, X_valid, y_valid, TASK)
metrics_hgb


## 10) Hyperparameter Tuning (RandomizedSearchCV)

In [None]:

if TASK == "regression":
    model = HistGradientBoostingRegressor(random_state=RANDOM_STATE, early_stopping=True)
    param_dist = {
        "hgb__learning_rate": loguniform(1e-3, 3e-1),
        "hgb__max_leaf_nodes": randint(15, 63),
        "hgb__min_samples_leaf": randint(5, 60),
        "hgb__l2_regularization": loguniform(1e-4, 1e-1)
    }
else:
    model = HistGradientBoostingClassifier(random_state=RANDOM_STATE, early_stopping=True)
    param_dist = {
        "hgb__learning_rate": loguniform(1e-3, 3e-1),
        "hgb__max_leaf_nodes": randint(15, 63),
        "hgb__min_samples_leaf": randint(5, 60),
        "hgb__l2_regularization": loguniform(1e-4, 1e-1)
    }

pipe = Pipeline([("prep", preprocess), ("hgb", model)])

cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE) if TASK == "classification" else KFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

scoring = "roc_auc" if TASK == "classification" else "neg_root_mean_squared_error"

search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=30,
    scoring=scoring,
    cv=cv,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=0
)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV score:", search.best_score_)

best_pipe = search.best_estimator_
tuned_metrics = evaluate(best_pipe, X_train, y_train, X_valid, y_valid, TASK)
tuned_metrics


## 11) Permutation Importance (Validation Set)

In [None]:

perm = permutation_importance(best_pipe, X_valid, y_valid, n_repeats=10, random_state=RANDOM_STATE, n_jobs=-1)
imp_idx = perm.importances_mean.argsort()[::-1]
feature_names = list(best_pipe.named_steps["prep"].get_feature_names_out())
imp_df = pd.DataFrame({
    "feature": np.array(feature_names)[imp_idx],
    "importance_mean": perm.importances_mean[imp_idx],
    "importance_std": perm.importances_std[imp_idx]
})
display(imp_df.head(25))

plt.figure(figsize=(8,6))
topn = min(20, len(imp_df))
plt.barh(imp_df["feature"][:topn][::-1], imp_df["importance_mean"][:topn][::-1])
plt.xlabel("Permutation Importance (mean decrease in score)")
plt.title("Top Features (Validation)")
plt.tight_layout()
plt.show()


## 12) Diagnostics

In [None]:

if TASK == "regression":
    y_pred = best_pipe.predict(X_valid)
    residuals = y_valid - y_pred
    plt.figure(figsize=(7,5))
    plt.scatter(y_pred, residuals, alpha=0.6)
    plt.axhline(0, linestyle="--")
    plt.xlabel("Predicted")
    plt.ylabel("Residual (y - ŷ)")
    plt.title("Residuals vs Predicted (Validation)")
    plt.show()
else:
    # Calibration-ish check via predicted probs histogram
    try:
        proba = best_pipe.predict_proba(X_valid)[:,1]
        plt.figure(figsize=(7,5))
        plt.hist(proba, bins=20)
        plt.xlabel("Predicted probability (positive class)")
        plt.ylabel("Count")
        plt.title("Predicted Probabilities — Validation")
        plt.show()
    except Exception as e:
        print("Could not compute probabilities:", e)



## 13) Results Summary (Fill These In)
- **Task:** {{classification/regression}} on **`{TARGET_COL}`**.
- **Baseline GradientBoosting:** (report metrics from `metrics_base`).
- **HistGradientBoosting (early stopping):** (report `metrics_hgb`).
- **Tuned model (RandomizedSearchCV):** (report `tuned_metrics` and best hyperparameters).

**What helped avoid overfitting?**  
- Appropriate **tree depth / min_samples_leaf** limited complexity.  
- **Subsample** and **max_features** (stochastic GB) added randomness → less variance.  
- **Early stopping** (HGBDT) stopped when validation score stopped improving.  
- **Cross‑validation** guided robust hyperparameter selection.  
- **L2 regularization** (HGBDT) controlled leaf weights.

**What metrics did you use and why?**  
- **Classification:** Accuracy / F1; **ROC AUC** to capture ranking quality under class imbalance.  
- **Regression:** **RMSE** (penalizes large errors), **MAE** (robust), **R²** (variance explained).

**Expected vs. unexpected findings:**  
- Note any surprising feature importances, diminishing returns from very small learning rates, or overfitting when trees are too deep.

**How did EDA help?**  
- Informed which variables to treat as categorical, guided imputation strategies, and highlighted outliers influencing GB sensitivity.

**Citations you used this week (add APA references in your write‑up):**  
- Scikit‑learn user guide on Gradient Boosting and Histogram‑based Gradient Boosting.
- Any additional articles, docs, or tutorials you consulted.

> _Reminder_: In your Week 12 summary, highlight where you went **deep** (e.g., Week 9: Gradient Boost) per the rubric.
