In [None]:
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Exploratory Analysis (Support Notebook for C++ MOF Pipeline)\n",
        "\n",
        "> **Scope guard (important):** This notebook is **not** the main pipeline.  \n",
        "> Use it only for:\n",
        "> - data inspection\n",
        "> - plotting\n",
        "> - quick experiments\n",
        "> - validating outputs produced by the **C++ pipeline**\n",
        "\n",
        "The production workflow remains:\n",
        "\n",
        "`preprocessing -> feature engineering -> modeling -> evaluation -> reports` (implemented in C++).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 1) Configuration\n",
        "\n",
        "Set paths to your raw data and C++ output folders.  \n",
        "This notebook is intentionally lightweight and non-destructive.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from pathlib import Path\n",
        "import re\n",
        "import math\n",
        "import json\n",
        "import warnings\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "from IPython.display import display\n",
        "\n",
        "# ----------------------------\n",
        "# User-adjustable paths\n",
        "# ----------------------------\n",
        "PROJECT_ROOT = Path(\"..\").resolve()   # notebook lives in notebooks/\n",
        "RAW_DATA_CSV = PROJECT_ROOT / \"data\" / \"raw\" / \"mof_descriptors.csv\"\n",
        "\n",
        "# Matches the src/main.cpp output layout from the C++ pipeline\n",
        "CPP_OUTPUT_ROOT = PROJECT_ROOT / \"output\"\n",
        "CPP_PROCESSED_CSV = CPP_OUTPUT_ROOT / \"data\" / \"processed\" / \"cleaned.csv\"\n",
        "CPP_FEATURES_ALL_CSV = CPP_OUTPUT_ROOT / \"data\" / \"features\" / \"engineered_all_unscaled.csv\"\n",
        "CPP_TRAIN_SCALED_CSV = CPP_OUTPUT_ROOT / \"data\" / \"features\" / \"train_scaled.csv\"\n",
        "CPP_TEST_SCALED_CSV = CPP_OUTPUT_ROOT / \"data\" / \"features\" / \"test_scaled.csv\"\n",
        "CPP_PREDICTIONS_CSV = CPP_OUTPUT_ROOT / \"reports\" / \"predictions.csv\"\n",
        "CPP_METRICS_TXT = CPP_OUTPUT_ROOT / \"reports\" / \"metrics.txt\"\n",
        "\n",
        "# Notebook display preferences\n",
        "pd.set_option(\"display.max_columns\", 120)\n",
        "pd.set_option(\"display.width\", 180)\n",
        "\n",
        "print(\"PROJECT_ROOT:\", PROJECT_ROOT)\n",
        "print(\"RAW_DATA_CSV:\", RAW_DATA_CSV)\n",
        "print(\"CPP_OUTPUT_ROOT:\", CPP_OUTPUT_ROOT)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 2) Helper Functions\n",
        "\n",
        "Utilities for safe loading, quick summaries, plotting, and metric validation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "def safe_read_csv(path: Path, **kwargs) -> pd.DataFrame | None:\n",
        "    \"\"\"Read CSV safely; return None if missing/unreadable.\"\"\"\n",
        "    try:\n",
        "        if not path.exists():\n",
        "            print(f\"[WARN] File not found: {path}\")\n",
        "            return None\n",
        "        df = pd.read_csv(path, **kwargs)\n",
        "        print(f\"[OK] Loaded: {path} | shape={df.shape}\")\n",
        "        return df\n",
        "    except Exception as e:\n",
        "        print(f\"[ERROR] Failed to read {path}: {e}\")\n",
        "        return None\n",
        "\n",
        "\n",
        "def summarize_df(df: pd.DataFrame, name: str = \"DataFrame\", max_rows: int = 5) -> None:\n",
        "    if df is None:\n",
        "        print(f\"[WARN] {name}: None\")\n",
        "        return\n",
        "    print(f\"\\n=== {name} Summary ===\")\n",
        "    print(\"Shape:\", df.shape)\n",
        "    print(\"\\nDtypes:\")\n",
        "    print(df.dtypes)\n",
        "    print(\"\\nMissing values (top 20):\")\n",
        "    mv = df.isna().sum().sort_values(ascending=False)\n",
        "    print(mv.head(20))\n",
        "    print(\"\\nHead:\")\n",
        "    display(df.head(max_rows))\n",
        "    \n",
        "\n",
        "def numeric_profile(df: pd.DataFrame, name: str = \"numeric_df\") -> pd.DataFrame:\n",
        "    if df is None:\n",
        "        raise ValueError(f\"{name} is None\")\n",
        "    num = df.select_dtypes(include=[np.number])\n",
        "    if num.empty:\n",
        "        print(f\"[WARN] {name}: no numeric columns\")\n",
        "        return pd.DataFrame()\n",
        "    profile = num.describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]).T\n",
        "    profile[\"missing\"] = num.isna().sum()\n",
        "    profile[\"missing_pct\"] = profile[\"missing\"] / len(num) * 100.0\n",
        "    return profile.sort_values(\"missing\", ascending=False)\n",
        "\n",
        "\n",
        "def plot_numeric_histograms(df: pd.DataFrame, columns=None, bins: int = 40, max_cols: int = 6):\n",
        "    \"\"\"Quick histogram plots for selected numeric columns (single-plot style, one column per figure).\"\"\"\n",
        "    if df is None:\n",
        "        print(\"[WARN] DataFrame is None\")\n",
        "        return\n",
        "    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()\n",
        "    if columns is None:\n",
        "        columns = num_cols[:max_cols]\n",
        "    else:\n",
        "        columns = [c for c in columns if c in df.columns]\n",
        "    if not columns:\n",
        "        print(\"[WARN] No numeric columns selected.\")\n",
        "        return\n",
        "\n",
        "    for col in columns:\n",
        "        plt.figure(figsize=(6, 4))\n",
        "        series = df[col].dropna()\n",
        "        plt.hist(series, bins=bins)\n",
        "        plt.title(f\"Histogram: {col}\")\n",
        "        plt.xlabel(col)\n",
        "        plt.ylabel(\"Count\")\n",
        "        plt.tight_layout()\n",
        "        plt.show()\n",
        "\n",
        "\n",
        "def plot_scatter_true_vs_pred(df_pred: pd.DataFrame):\n",
        "    required = {\"y_true\", \"y_pred\"}\n",
        "    if df_pred is None or not required.issubset(df_pred.columns):\n",
        "        print(\"[WARN] predictions DataFrame missing required columns:\", required)\n",
        "        return\n",
        "\n",
        "    x = pd.to_numeric(df_pred[\"y_true\"], errors=\"coerce\")\n",
        "    y = pd.to_numeric(df_pred[\"y_pred\"], errors=\"coerce\")\n",
        "    mask = x.notna() & y.notna()\n",
        "    x = x[mask]\n",
        "    y = y[mask]\n",
        "    if x.empty:\n",
        "        print(\"[WARN] No valid numeric y_true/y_pred rows.\")\n",
        "        return\n",
        "\n",
        "    plt.figure(figsize=(6, 6))\n",
        "    plt.scatter(x, y, alpha=0.8)\n",
        "    xy_min = float(min(x.min(), y.min()))\n",
        "    xy_max = float(max(x.max(), y.max()))\n",
        "    plt.plot([xy_min, xy_max], [xy_min, xy_max])  # y=x reference line\n",
        "    plt.xlabel(\"y_true\")\n",
        "    plt.ylabel(\"y_pred\")\n",
        "    plt.title(\"Predicted vs True (C++ output)\")\n",
        "    plt.tight_layout()\n",
        "    plt.show()\n",
        "\n",
        "\n",
        "def plot_residuals(df_pred: pd.DataFrame):\n",
        "    if df_pred is None:\n",
        "        print(\"[WARN] predictions DataFrame is None\")\n",
        "        return\n",
        "    if \"error\" in df_pred.columns:\n",
        "        residual = pd.to_numeric(df_pred[\"error\"], errors=\"coerce\")\n",
        "    elif {\"y_true\", \"y_pred\"}.issubset(df_pred.columns):\n",
        "        residual = pd.to_numeric(df_pred[\"y_pred\"], errors=\"coerce\") - pd.to_numeric(df_pred[\"y_true\"], errors=\"coerce\")\n",
        "    else:\n",
        "        print(\"[WARN] predictions DataFrame lacks residual/error columns.\")\n",
        "        return\n",
        "\n",
        "    residual = residual.dropna()\n",
        "    if residual.empty:\n",
        "        print(\"[WARN] No numeric residuals found.\")\n",
        "        return\n",
        "\n",
        "    plt.figure(figsize=(6, 4))\n",
        "    plt.hist(residual, bins=40)\n",
        "    plt.xlabel(\"Residual (y_pred - y_true)\")\n",
        "    plt.ylabel(\"Count\")\n",
        "    plt.title(\"Residual Distribution (C++ output)\")\n",
        "    plt.tight_layout()\n",
        "    plt.show()\n",
        "\n",
        "\n",
        "def rmse_np(y_true: np.ndarray, y_pred: np.ndarray) -> float:\n",
        "    e = y_true - y_pred\n",
        "    return float(np.sqrt(np.mean(e * e)))\n",
        "\n",
        "\n",
        "def mae_np(y_true: np.ndarray, y_pred: np.ndarray) -> float:\n",
        "    return float(np.mean(np.abs(y_true - y_pred)))\n",
        "\n",
        "\n",
        "def r2_np(y_true: np.ndarray, y_pred: np.ndarray, eps: float = 1e-12) -> float:\n",
        "    y_bar = float(np.mean(y_true))\n",
        "    ss_res = float(np.sum((y_true - y_pred) ** 2))\n",
        "    ss_tot = float(np.sum((y_true - y_bar) ** 2))\n",
        "    if ss_tot <= eps:\n",
        "        return 0.0  # mirror fallback style from C++ implementation\n",
        "    return float(1.0 - ss_res / ss_tot)\n",
        "\n",
        "\n",
        "def mape_np(y_true: np.ndarray, y_pred: np.ndarray, eps: float = 1e-12) -> tuple[float | None, int, int]:\n",
        "    denom = np.abs(y_true)\n",
        "    usable = denom > eps\n",
        "    used_count = int(np.sum(usable))\n",
        "    skipped_count = int(np.sum(~usable))\n",
        "    if used_count == 0:\n",
        "        return None, used_count, skipped_count\n",
        "    val = float(np.mean(np.abs((y_true[usable] - y_pred[usable]) / denom[usable])) * 100.0)\n",
        "    return val, used_count, skipped_count\n",
        "\n",
        "\n",
        "def recompute_metrics_from_predictions(df_pred: pd.DataFrame) -> dict:\n",
        "    required = {\"y_true\", \"y_pred\"}\n",
        "    if df_pred is None or not required.issubset(df_pred.columns):\n",
        "        raise ValueError(\"Predictions CSV must contain y_true and y_pred columns\")\n",
        "\n",
        "    y_true = pd.to_numeric(df_pred[\"y_true\"], errors=\"coerce\").to_numpy(dtype=float)\n",
        "    y_pred = pd.to_numeric(df_pred[\"y_pred\"], errors=\"coerce\").to_numpy(dtype=float)\n",
        "    mask = np.isfinite(y_true) & np.isfinite(y_pred)\n",
        "    y_true = y_true[mask]\n",
        "    y_pred = y_pred[mask]\n",
        "    if y_true.size == 0:\n",
        "        raise ValueError(\"No valid numeric rows in predictions CSV\")\n",
        "\n",
        "    mape_val, mape_used, mape_skipped = mape_np(y_true, y_pred)\n",
        "    return {\n",
        "        \"n\": int(y_true.size),\n",
        "        \"RMSE\": rmse_np(y_true, y_pred),\n",
        "        \"MAE\": mae_np(y_true, y_pred),\n",
        "        \"R2\": r2_np(y_true, y_pred),\n",
        "        \"MAPE\": mape_val,\n",
        "        \"MAPE_used\": mape_used,\n",
        "        \"MAPE_skipped\": mape_skipped,\n",
        "    }\n",
        "\n",
        "\n",
        "def parse_cpp_metrics_txt(path: Path) -> dict:\n",
        "    \"\"\"\n",
        "    Parse the C++ metrics report text (metrics.txt).\n",
        "    Looks for lines like:\n",
        "      [Test Metrics]\n",
        "      n=..., RMSE=..., MAE=..., R2=..., MAPE=...%\n",
        "    \"\"\"\n",
        "    if not path.exists():\n",
        "        print(f\"[WARN] Metrics report not found: {path}\")\n",
        "        return {}\n",
        "\n",
        "    text = path.read_text(encoding=\"utf-8\", errors=\"replace\")\n",
        "    # Capture the metrics line after [Test Metrics]\n",
        "    m_section = re.search(r\"\\[Test Metrics\\]\\s*(.+)\", text)\n",
        "    line = m_section.group(1).strip() if m_section else \"\"\n",
        "\n",
        "    parsed = {}\n",
        "    if line:\n",
        "        # Flexible parse: key=value pairs separated by commas\n",
        "        for part in [p.strip() for p in line.split(\",\")]:\n",
        "            if \"=\" not in part:\n",
        "                continue\n",
        "            k, v = part.split(\"=\", 1)\n",
        "            k = k.strip()\n",
        "            v = v.strip()\n",
        "\n",
        "            # normalize MAPE like \"12.34% (used=..., skipped=...)\"\n",
        "            if k.upper() == \"MAPE\":\n",
        "                # numeric prefix before %\n",
        "                m_num = re.match(r\"([-+]?\\d*\\.?\\d+(?:[eE][-+]?\\d+)?)%\", v)\n",
        "                if m_num:\n",
        "                    parsed[\"MAPE\"] = float(m_num.group(1))\n",
        "                else:\n",
        "                    parsed[\"MAPE\"] = None\n",
        "                m_used = re.search(r\"used=(\\d+)\", v)\n",
        "                m_skipped = re.search(r\"skipped=(\\d+)\", v)\n",
        "                if m_used:\n",
        "                    parsed[\"MAPE_used\"] = int(m_used.group(1))\n",
        "                if m_skipped:\n",
        "                    parsed[\"MAPE_skipped\"] = int(m_skipped.group(1))\n",
        "                continue\n",
        "\n",
        "            # parse plain float/int for other metrics\n",
        "            try:\n",
        "                if k == \"n\":\n",
        "                    parsed[k] = int(float(v))\n",
        "                else:\n",
        "                    parsed[k] = float(v.split()[0])\n",
        "            except Exception:\n",
        "                parsed[k] = v\n",
        "\n",
        "    return parsed\n",
        "\n",
        "\n",
        "def compare_metrics(cpp_metrics: dict, py_metrics: dict, tol: float = 1e-6) -> pd.DataFrame:\n",
        "    keys = [\"n\", \"RMSE\", \"MAE\", \"R2\", \"MAPE\", \"MAPE_used\", \"MAPE_skipped\"]\n",
        "    rows = []\n",
        "    for k in keys:\n",
        "        cv = cpp_metrics.get(k, None)\n",
        "        pv = py_metrics.get(k, None)\n",
        "        if isinstance(cv, (int, float)) and isinstance(pv, (int, float)):\n",
        "            diff = float(abs(float(cv) - float(pv)))\n",
        "            match = diff <= tol\n",
        "        else:\n",
        "            diff = None if (cv is None and pv is None) else np.nan\n",
        "            match = (cv == pv)\n",
        "        rows.append({\"metric\": k, \"cpp\": cv, \"python_recomputed\": pv, \"abs_diff\": diff, \"match\": match})\n",
        "    return pd.DataFrame(rows)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 3) Inspect Raw Input Data (optional)\n",
        "\n",
        "Use this to understand the incoming CSV before/while building the C++ pipeline.  \n",
        "This notebook should help you inspect data quality â€” not replace preprocessing logic.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "raw_df = safe_read_csv(RAW_DATA_CSV)\n",
        "if raw_df is not None:\n",
        "    summarize_df(raw_df, \"Raw Input CSV\")\n",
        "    raw_profile = numeric_profile(raw_df, \"raw_df\")\n",
        "    if not raw_profile.empty:\n",
        "        display(raw_profile.head(20))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 4) Inspect C++ Preprocessed Output\n",
        "\n",
        "Validate what your C++ preprocessing wrote to `output/data/processed/cleaned.csv`.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "clean_df = safe_read_csv(CPP_PROCESSED_CSV)\n",
        "if clean_df is not None:\n",
        "    summarize_df(clean_df, \"C++ cleaned.csv\")\n",
        "    clean_profile = numeric_profile(clean_df, \"clean_df\")\n",
        "    if not clean_profile.empty:\n",
        "        display(clean_profile.head(20))\n",
        "        # Quick visual inspection of a few numeric columns\n",
        "        cols = clean_df.select_dtypes(include=[np.number]).columns.tolist()\n",
        "        plot_numeric_histograms(clean_df, columns=cols[:6], bins=40)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 5) Inspect C++ Feature Engineering Outputs\n",
        "\n",
        "Useful for checking feature counts, constant columns, scaling ranges, and basic distributions.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "features_all_df = safe_read_csv(CPP_FEATURES_ALL_CSV)\n",
        "train_scaled_df = safe_read_csv(CPP_TRAIN_SCALED_CSV)\n",
        "test_scaled_df = safe_read_csv(CPP_TEST_SCALED_CSV)\n",
        "\n",
        "if features_all_df is not None:\n",
        "    summarize_df(features_all_df, \"C++ engineered_all_unscaled.csv\", max_rows=3)\n",
        "\n",
        "if train_scaled_df is not None:\n",
        "    summarize_df(train_scaled_df, \"C++ train_scaled.csv\", max_rows=3)\n",
        "\n",
        "if test_scaled_df is not None:\n",
        "    summarize_df(test_scaled_df, \"C++ test_scaled.csv\", max_rows=3)\n",
        "\n",
        "# Check for NaN/Inf in scaled outputs (important sanity check)\n",
        "for name, df in [(\"train_scaled\", train_scaled_df), (\"test_scaled\", test_scaled_df)]:\n",
        "    if df is None:\n",
        "        continue\n",
        "    num = df.select_dtypes(include=[np.number])\n",
        "    if num.empty:\n",
        "        print(f\"[WARN] {name}: no numeric columns found.\")\n",
        "        continue\n",
        "    finite_mask = np.isfinite(num.to_numpy(dtype=float))\n",
        "    print(f\"[CHECK] {name}: all finite =\", bool(finite_mask.all()))\n",
        "    print(f\"[CHECK] {name}: total NaN =\", int(np.isnan(num.to_numpy(dtype=float)).sum()))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 6) Validate C++ Predictions and Metrics (Main Notebook Purpose)\n",
        "\n",
        "This is the key notebook workflow for your C++ goal:\n",
        "- read `output/reports/predictions.csv`\n",
        "- recompute metrics in Python\n",
        "- compare against C++ `metrics.txt`\n",
        "\n",
        "If these match (within tolerance), your C++ evaluation pipeline is likely correct.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "pred_df = safe_read_csv(CPP_PREDICTIONS_CSV)\n",
        "cpp_metrics = parse_cpp_metrics_txt(CPP_METRICS_TXT)\n",
        "\n",
        "if pred_df is not None:\n",
        "    summarize_df(pred_df, \"C++ predictions.csv\")\n",
        "\n",
        "    py_metrics = recompute_metrics_from_predictions(pred_df)\n",
        "    print(\"\\nPython recomputed metrics:\")\n",
        "    print(json.dumps(py_metrics, indent=2, default=str))\n",
        "\n",
        "    print(\"\\nParsed C++ metrics.txt (test metrics):\")\n",
        "    print(json.dumps(cpp_metrics, indent=2, default=str))\n",
        "\n",
        "    cmp_df = compare_metrics(cpp_metrics, py_metrics, tol=1e-6)\n",
        "    display(cmp_df)\n",
        "\n",
        "    if not cmp_df.empty and \"match\" in cmp_df.columns:\n",
        "        mismatches = cmp_df[cmp_df[\"match\"] == False]\n",
        "        if len(mismatches) == 0:\n",
        "            print(\"[OK] C++ metrics and Python recomputed metrics match within tolerance.\")\n",
        "        else:\n",
        "            print(\"[WARN] Metric mismatches found. Inspect rows above.\")\n",
        "else:\n",
        "    print(\"[INFO] Run the C++ pipeline first to generate predictions.csv and metrics.txt.\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 7) Plot Model Behavior from C++ Outputs\n",
        "\n",
        "Simple visual diagnostics (scatter + residuals) to support quick analysis.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if pred_df is not None:\n",
        "    plot_scatter_true_vs_pred(pred_df)\n",
        "    plot_residuals(pred_df)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 8) Quick Experiments (Optional, Notebook-only)\n",
        "\n",
        "This section is intentionally small and optional.  \n",
        "It can help you **understand the data** or sanity-check trends, but it must **not replace** the C++ pipeline.\n",
        "\n",
        "Example below: correlation scan with target (if the target exists in `cleaned.csv`).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "if clean_df is not None:\n",
        "    numeric_cols = clean_df.select_dtypes(include=[np.number]).columns.tolist()\n",
        "    # Try to infer target column from predictions/metrics or report; fallback to last numeric column\n",
        "    guessed_target = None\n",
        "    if \"y_true\" in (pred_df.columns if pred_df is not None else []):\n",
        "        # predictions.csv doesn't include target name; just a reminder\n",
        "        pass\n",
        "\n",
        "    # If your C++ cleaned.csv includes target column, set it manually here for quick experiments:\n",
        "    # guessed_target = \"adsorption_capacity\"\n",
        "    # guessed_target = \"surface_area\"\n",
        "\n",
        "    if guessed_target is None:\n",
        "        print(\"[INFO] Set `guessed_target` manually to compute correlations with target.\")\n",
        "    elif guessed_target not in clean_df.columns:\n",
        "        print(f\"[WARN] guessed_target '{guessed_target}' not found in clean_df\")\n",
        "    else:\n",
        "        corr = clean_df[numeric_cols].corr(numeric_only=True)[guessed_target].sort_values(key=np.abs, ascending=False)\n",
        "        display(corr.to_frame(\"corr_with_target\").head(20))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 9) C++-First Workflow Checklist\n",
        "\n",
        "Use this notebook as a **validation companion** after running the C++ executable.\n",
        "\n",
        "### Recommended loop\n",
        "1. Run C++ pipeline (`src/main.cpp`) on a dataset.\n",
        "2. Confirm outputs exist:\n",
        "   - `output/data/processed/cleaned.csv`\n",
        "   - `output/data/features/*.csv`\n",
        "   - `output/reports/predictions.csv`\n",
        "   - `output/reports/metrics.txt`\n",
        "3. Open this notebook and:\n",
        "   - inspect data quality\n",
        "   - inspect feature outputs\n",
        "   - recompute metrics from `predictions.csv`\n",
        "   - compare against C++ report\n",
        "4. Fix C++ code if anything looks wrong.\n",
        "5. Re-run C++ pipeline.\n",
        "\n",
        "### Anti-pattern to avoid\n",
        "- Adding full training logic in Python and treating the notebook as the real pipeline.\n",
        "\n",
        "Your goal is **C++**, so this notebook stays for analysis and validation only.\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}