# Model Training Walkthrough (RandomForestRegressor)

This notebook trains a **RandomForestRegressor** on the processed datasets created by the *prepare* step,
evaluates it, and saves artifacts (model + reports). It mirrors the logic in `src/train.py` but with detailed
explanations and inline outputs for readability.

**What you'll see:**
- Load **params.yaml** (for paths and model hyperparameters)
- Load **processed CSVs** (`X_train.csv`, `X_test.csv`, `y_train.csv`, `y_test.csv`)
- Train **RandomForestRegressor**
- Compute **metrics** (RMSE, R², MSE on train/test)
- Save **artifacts** (`models/model.joblib`, reports/metrics.json, and feature_importance.csv)
- Visualize **feature importances**


In [None]:
from __future__ import annotations
from pathlib import Path
import json

import pandas as pd
import yaml

# Modeling
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from joblib import dump

# Plotting
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 160)

## 1. Locate the project root and `params.yaml`

We resolve the **project root** as the directory that contains your `params.yaml`.  
This makes the notebook location-agnostic (works if you place it under `notebooks/` or elsewhere).

In [None]:
def find_project_root(start: Path) -> Path:
    """Walk upward from `start` to find directory containing params.yaml."""
    cur = start.resolve()
    for p in [cur, *cur.parents]:
        if (p / "params.yaml").exists():
            return p
    raise FileNotFoundError("Could not locate 'params.yaml' in current or parent directories.")

# Assume the notebook is run from its own directory; adjust as needed.
NOTEBOOK_DIR = Path.cwd()
ROOT = find_project_root(NOTEBOOK_DIR)
PARAMS_PATH = ROOT / "params.yaml"

ROOT, PARAMS_PATH

## 2. Load parameters

We read `params.yaml` to get paths, the target column, split proportions, and (optionally) model hyperparameters.

In [None]:
with open(PARAMS_PATH, "r") as f:
    params = yaml.safe_load(f)

# Show the relevant portions
params

## 3. Load processed datasets

We expect the following files from the **prepare** stage:

- `data/processed/X_train.csv`
- `data/processed/X_test.csv`
- `data/processed/y_train.csv`
- `data/processed/y_test.csv`

The helper below reads `y_*` safely whether saved with or without a header row.


In [None]:
def read_processed(root: Path):
    processed = root / "data" / "processed"
    X_train = pd.read_csv(processed / "X_train.csv")
    X_test  = pd.read_csv(processed / "X_test.csv")

    def read_y(path: Path, n_expected: int) -> pd.Series:
        # Try with header (default)
        y = pd.read_csv(path).iloc[:, 0]
        if len(y) == n_expected:
            return y
        # Fallback: no header saved
        y2 = pd.read_csv(path, header=None).iloc[:, 0]
        if len(y2) == n_expected:
            return y2
        raise ValueError(
            f"Inconsistent length when reading {path}. "
            f"Got {len(y)} (header) and {len(y2)} (no header). "
            f"Expected {n_expected}."
        )

    y_train = read_y(processed / "y_train.csv", len(X_train))
    y_test  = read_y(processed / "y_test.csv", len(X_test))
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = read_processed(ROOT)

print("Shapes:")
print("X_train:", X_train.shape, "| X_test:", X_test.shape)
print("y_train:", y_train.shape, "| y_test:", y_test.shape)

X_train.head(5)

## 4. Train the RandomForest model

We use `n_estimators`, `max_depth`, `n_jobs`, and `random_state` from `params.yaml` (if present).


In [None]:
model_cfg = params.get("model", {}) or {}

def build_rf(cfg: dict) -> RandomForestRegressor:
    md = cfg.get("max_depth")
    max_depth = None if md in (None, "None") else int(md)
    rf = RandomForestRegressor(
        n_estimators=int(cfg.get("n_estimators", 200)),
        max_depth=max_depth,
        n_jobs=int(cfg.get("n_jobs", -1)),
        random_state=int(cfg.get("random_state", 42)),
    )
    return rf

rf = build_rf(model_cfg)
rf.fit(X_train, y_train)
rf

## 5. Evaluate the model

We compute **MSE** and derive **RMSE** (square root of MSE). We also report **R²** for both train and test sets.

In [None]:
from math import sqrt

def evaluate(model, X_tr, y_tr, X_te, y_te):
    preds_tr = model.predict(X_tr)
    preds_te = model.predict(X_te)

    mse_tr = mean_squared_error(y_tr, preds_tr)
    mse_te = mean_squared_error(y_te, preds_te)
    rmse_tr = sqrt(mse_tr)
    rmse_te = sqrt(mse_te)
    r2_tr = r2_score(y_tr, preds_tr)
    r2_te = r2_score(y_te, preds_te)

    return {
        "rmse": float(rmse_tr),
        "rmse_test": float(rmse_te),
        "r2": float(r2_tr),
        "r2_test": float(r2_te),
        "mse": float(mse_tr),
        "mse_test": float(mse_te),
    }

metrics = evaluate(rf, X_train, y_train, X_test, y_test)
metrics

## 6. Save artifacts (model + reports)

We save:
- `models/model.joblib` (the trained estimator)
- `models/feature_names.json` (for inference-time alignment)
- `reports/metrics.json` (for auditing and CI checks)
- `reports/feature_importance.csv` (optional helper for interpretation)


In [None]:
def save_artifacts(root: Path, model, X_train, metrics: dict):
    models_dir = root / "models"
    reports_dir = root / "reports"
    models_dir.mkdir(parents=True, exist_ok=True)
    reports_dir.mkdir(parents=True, exist_ok=True)

    model_path = models_dir / "model.joblib"
    dump(model, model_path)

    feature_names = list(X_train.columns)
    (models_dir / "feature_names.json").write_text(json.dumps(feature_names, indent=2))
    (reports_dir / "metrics.json").write_text(json.dumps(metrics, indent=2))

    # Optional feature importances
    try:
        fi = pd.DataFrame({
            "feature": feature_names,
            "importance": getattr(model, "feature_importances_", [])
        }).sort_values("importance", ascending=False)
        fi_path = reports_dir / "feature_importance.csv"
        fi.to_csv(fi_path, index=False)
    except Exception:
        fi_path = None

    return {
        "model": str(model_path),
        "feature_names": str(models_dir / "feature_names.json"),
        "metrics": str(reports_dir / "metrics.json"),
        "feature_importance": str(fi_path) if fi_path else None,
    }

artifacts = save_artifacts(ROOT, rf, X_train, metrics)
artifacts

## 7. Visualize feature importances

We plot the top features by importance to get a sense of what the RandomForest found informative.


In [None]:
# Only plot if the attribute exists and has values
if hasattr(rf, "feature_importances_") and len(getattr(rf, "feature_importances_", [])):
    fi = pd.DataFrame({
        "feature": list(X_train.columns),
        "importance": rf.feature_importances_
    }).sort_values("importance", ascending=False)

    top_n = 15 if fi.shape[0] > 15 else fi.shape[0]
    top_fi = fi.head(top_n).iloc[::-1]  # reverse for a nicer horizontal plot

    plt.figure(figsize=(8, max(4, top_n * 0.4)))
    plt.barh(top_fi["feature"], top_fi["importance"])
    plt.title("Top Feature Importances")
    plt.xlabel("Importance")
    plt.ylabel("Feature")
    plt.tight_layout()
    plt.show()

    top_fi
else:
    print("Model has no feature_importances_ attribute; skipping plot.")

## 8. Recap

- We loaded configuration from `params.yaml` (paths + model hyperparameters).
- We read the processed train/test splits from `data/processed/`.
- We trained a `RandomForestRegressor`, computed metrics (RMSE, R², MSE), and saved artifacts.
- We visualized **feature importances** for quick interpretation.

This notebook is presentation-friendly and aligns 1:1 with `src/train.py` (sans MLflow).