<a id="top"></a>
# 01A_PREP_BALANCED ‚Äî Splits estratificados + **Balanceo offline por im√°genes**

**Qu√© hace este notebook:**  
Realiza la **preparaci√≥n completa** para cada *run* (por ejemplo, `circuito1`, `circuito2`) admitiendo **subcarpetas** de vueltas (`vuelta1/`, `vuelta2/`, ‚Ä¶) y a√±ade un **balanceo offline** del conjunto de entrenamiento (`train_balanced.csv`) por *bins* de `steering`.

Genera/actualiza:

- `data/processed/prep_manifest.json` (trazabilidad de la preparaci√≥n),
- `data/processed/<run>/{canonical,train,val,test}.csv` por circuito,
- `data/processed/tasks.json` (splits originales por circuito), y
- `data/processed/tasks_balanced.json` (splits que usan `train_balanced.csv` cuando el balanceo est√° activo).

**Caracter√≠sticas clave:**

- Lee par√°metros desde `configs/presets.yaml` (secci√≥n `prep`, si existe).
- Autodetecta `RUNS` dentro de `data/raw/udacity/*` si no est√°n definidos en el preset.
- Puede **fusionar varias vueltas** por circuito (`merge_subruns`).
- Puede **expandir c√°maras L/R** a centro con correcci√≥n de √°ngulo (`use_left_right` + `steer_shift`).
- Realiza **splits estratificados** por *bins* de `steering` (`train/val/test`).
- Puede **balancear offline** el `train` generando im√°genes aumentadas para rellenar *bins* infrarrepresentados (`train_balanced.csv`).
- Escribe un `prep_manifest.json` con la **trazabilidad completa** de la preparaci√≥n.

**Diferencia con `01_DATA_QC_PREP.ipynb`:**  
`01_DATA_QC_PREP` hace **QC + splits** sin balanceo offline (y por defecto sin expansi√≥n L/R).  
Este cuaderno hace **QC + splits** y, adem√°s, **balanceo offline por im√°genes** (y suele activar la expansi√≥n L/R).

---

<a id="toc"></a>
## üß≠ √çndice
1. [Configuraci√≥n y par√°metros del balanceo offline](#sec-01)
2. [Ejecutar preparaci√≥n + verificaci√≥n y escribir `tasks_balanced.json`](#sec-02)
3. [EDA r√°pida y resumen por circuitos (para la memoria)](#sec-03)
4. [Figuras y tablas para la memoria (Conjunto de datos)](#sec-04)


<a id="sec-01"></a>
## 1) Configuraci√≥n y par√°metros del balanceo offline

**Objetivo de esta secci√≥n**  
Configurar la preparaci√≥n de datos de forma reproducible para generar, para cada *run* (por ejemplo, `circuito1`, `circuito2`):

- un conjunto can√≥nico `canonical.csv` por circuito,
- splits estratificados `train/val/test.csv`, y
- opcionalmente, un `train_balanced.csv` equilibrado por *bins* de `steering`.

Esta celda:

- Define `ROOT` y prepara importaciones de `src.prep.data_prep` y `src.prep.augment_offline`.
- Carga (si existe) la secci√≥n `prep` de `configs/presets.yaml` para el `PRESET` elegido.
- Establece rutas base `RAW` (`data/raw/udacity`) y `PROC` (`data/processed`).
- Determina `RUNS`:
  - usa los definidos en el preset (`prep.runs`) si existen;
  - si no, **autodetecta** circuitos que contengan al menos un `driving_log.csv` bajo `data/raw/udacity/*` (ignorando directorios `aug/`).
- Declara hiperpar√°metros de preparaci√≥n:
  - `merge_subruns`: fusiona subcarpetas de vueltas (`vuelta1/`, `vuelta2/`, ‚Ä¶) en un √∫nico `canonical.csv` por circuito.
  - `use_left_right` + `steer_shift`: controlan la expansi√≥n de c√°maras L/R como muestras adicionales con correcci√≥n de √°ngulo.
  - `bins`, `train`, `val`, `seed`: controlan la estratificaci√≥n por bins de `steering` y las proporciones de splits.
- Declara par√°metros de **balanceo offline**:
  - `balance_offline.mode` (normalmente `"images"`),
  - `target_per_bin`, `cap_per_bin`,
  - y configuraci√≥n de aumentaci√≥n `aug`.

Se construye un `PrepConfig` en el que **no se duplica a√∫n ninguna fila**: la expansi√≥n por im√°genes (balanceo offline) se hace en el paso siguiente mediante `balance_train_with_augmented_images`.

[‚Üë Volver al √≠ndice](#toc)



In [None]:
# %% [code]
%load_ext autoreload
%autoreload 2

from pathlib import Path
import sys, json
import pandas as pd

from pprint import pprint

ROOT = Path.cwd().parents[0] if (Path.cwd().name == "notebooks") else Path.cwd()
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

from src.prep.data_prep import PrepConfig, run_prep, verify_processed_splits
from src.prep.augment_offline import balance_train_with_augmented_images

# Intentar leer par√°metros de PREP desde presets.yaml (p.ej. std)
try:
    from src.config import load_preset
    PRESET = "accurate"  # c√°mbialo si quieres
    _cfg = load_preset(ROOT / "configs" / "presets.yaml", PRESET)
    PREP = _cfg.get("prep", {})
except Exception:
    PREP = {}

RAW  = ROOT / "data" / "raw" / "udacity"
PROC = ROOT / "data" / "processed"

# RUNS: preset o autodetecci√≥n robusta
if "runs" in PREP and PREP["runs"]:
    RUNS = list(PREP["runs"])
else:
    RUNS = sorted({
        p.parents[1].name
        for p in RAW.rglob("driving_log.csv")
        if "aug" not in p.parts
    })

# Hiperpar√°metros de preparaci√≥n
merge_subruns   = bool(PREP.get("merge_subruns", True))
use_left_right  = bool(PREP.get("use_left_right", True))
steer_shift     = float(PREP.get("steer_shift", 0.2))
bins            = int(PREP.get("bins", 50))
train           = float(PREP.get("train", 0.70))
val             = float(PREP.get("val", 0.15))
seed            = int(PREP.get("seed", 42))

# Balanceo offline
BAL             = dict(PREP.get("balance_offline", {}))
bal_mode        = str(BAL.get("mode", "images")).lower()
target_per_bin  = BAL.get("target_per_bin", "auto")
cap_per_bin     = BAL.get("cap_per_bin", 12000)
AUG             = BAL.get("aug", {})

# PrepConfig SIN duplicaci√≥n de filas (la hace el balanceo)
CFG = PrepConfig(
    root=ROOT,
    runs=RUNS,
    merge_subruns=merge_subruns,
    use_left_right=use_left_right,
    steer_shift=steer_shift,
    bins=bins,
    train=train,
    val=val,
    seed=seed,
    target_per_bin=None,
    cap_per_bin=None,
)

print("ROOT:", ROOT)
print("RAW :", RAW)
print("PROC:", PROC)
print("RUNS:", RUNS)
print("BAL mode:", bal_mode, "| target_per_bin:", target_per_bin, "| cap_per_bin:", cap_per_bin)

# --- 1) PREP CAN√ìNICO (canonical/train/val/test) ------------------------------
manifest = run_prep(CFG)
print("prep_manifest.json:", PROC / "prep_manifest.json")
print("tasks.json:", manifest["outputs"].get("tasks_json", "(desconocido)"))

# Verificaci√≥n b√°sica (splits coherentes)
verify_processed_splits(PROC, RUNS)

# --- 2) BALANCEO OFFLINE (train ‚Üí train_balanced) + RESUMEN -------------------
if bal_mode == "images":
    stats_rows = []

    for run in RUNS:
        base_dir = RAW  / run
        out_dir  = PROC / run
        train_csv = out_dir / "train.csv"

        # Tama√±o original de train
        df_tr_orig = pd.read_csv(train_csv)
        n_train_orig = len(df_tr_orig)

        out_csv, stats = balance_train_with_augmented_images(
            train_csv=train_csv,
            raw_run_dir=base_dir,
            out_run_dir=out_dir,
            bins=CFG.bins,
            target_per_bin=target_per_bin,
            cap_per_bin=cap_per_bin,
            seed=CFG.seed,
            aug=AUG,
            idempotent=True,
            overwrite=False,
        )

        # Tama√±o tras balanceo
        df_tr_bal = pd.read_csv(out_csv)
        n_train_bal = len(df_tr_bal)
        generated = max(0, n_train_bal - n_train_orig)

        print(f"[{run}] train_orig={n_train_orig} ‚Üí train_balanced={n_train_bal} "
              f"(+{generated} nuevas) ‚Üí {out_csv.name}")

        stats_rows.append({
            "run": run,
            "n_train_orig": n_train_orig,
            "n_train_balanced": n_train_bal,
            "generated": generated,
        })

    # Resumen por circuito
    df_bal_stats = pd.DataFrame(stats_rows).sort_values("run")
    print("\n=== RESUMEN BALANCEO OFFLINE POR CIRCUITO ===")
    display(df_bal_stats)

    # Guardar tambi√©n a disco para trazabilidad
    eda_all = PROC / "eda_all"
    eda_all.mkdir(parents=True, exist_ok=True)
    bal_stats_csv = eda_all / "balance_stats.csv"
    df_bal_stats.to_csv(bal_stats_csv, index=False)
    print("Guardado balance_stats.csv en:", bal_stats_csv)

    # Escribir tasks_balanced.json (lo que ya ten√≠as)
    tb = {"tasks_order": RUNS, "splits": {}}
    for run in RUNS:
        d = str((PROC / run).resolve())
        tb["splits"][run] = {
            "train": f"{d}/train_balanced.csv",
            "val":   f"{d}/val.csv",
            "test":  f"{d}/test.csv",
        }
    tasks_balanced_path = PROC / PREP.get("tasks_balanced_file_name", "tasks_balanced.json")
    tasks_balanced_path.write_text(json.dumps(tb, indent=2), encoding="utf-8")
    print("OK BALANCED:", tasks_balanced_path)
else:
    print("Balanceo offline desactivado (prep.balance_offline.mode != 'images').")


<a id="sec-02"></a>
## 2) Ejecutar preparaci√≥n + verificaci√≥n y escribir `tasks_balanced.json`

**Secuencia de esta celda**

1. Ejecuta `manifest = run_prep(CFG)`:
   - genera `canonical.csv` por *run* (rutas normalizadas y fusi√≥n de subvueltas si `merge_subruns=True`);
   - genera `train.csv`, `val.csv`, `test.csv` con **splits estratificados por bins de `steering`**;
   - escribe `data/processed/tasks.json` con el **orden de tareas** y las rutas a los splits originales.

2. Verifica que `train/val/test.csv` existen para cada *run* (`verify_processed_splits(PROC, RUNS)`).

3. Si `bal_mode == "images"`:
   - Para cada *run*, genera un `train_balanced.csv` equilibrado por bins mediante
     `balance_train_with_augmented_images(...)`, con aumentaci√≥n fotom√©trica (`AUG`) para rellenar bins poco poblados.
   - Escribe `data/processed/tasks_balanced.json` con rutas a:
     - `train_balanced.csv`,
     - `val.csv`,
     - `test.csv`  
     para cada circuito.

**Idempotencia**

- `idempotent=True`: si ya existe un balanceo con la misma configuraci√≥n, no se regenera.
- `overwrite=False`: evita sobrescrituras accidentales de `train_balanced.csv`.

[‚Üë Volver al √≠ndice](#toc)


In [None]:
# %% [code]
# =============================================================================
# 3) EDA RESUMIDO PARA LA MEMORIA
#    - Histogramas por circuito
#    - Tabla resumen global (con efecto de balanceo)
# =============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import json

bins = CFG.bins
edges = np.linspace(-1.0, 1.0, bins)

def _plot_hist(series, title, save_path, edges):
    s = pd.to_numeric(series, errors="coerce").dropna().clip(-1, 1)
    plt.figure(figsize=(6, 3))
    plt.hist(s, bins=edges, edgecolor="black")
    plt.title(title)
    plt.xlabel("steering")
    plt.ylabel("freq")
    Path(save_path).parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(save_path, dpi=140)
    plt.close()

rows_summary = []

for RUN in RUNS:
    base_out = PROC / RUN
    if not base_out.exists():
        print(f"[WARN] {RUN}: no existe {base_out}, salto.")
        continue

    # Carga de CSVs por circuito
    df_c  = pd.read_csv(base_out / "canonical.csv")
    df_tr = pd.read_csv(base_out / "train.csv")
    df_va = pd.read_csv(base_out / "val.csv")
    df_te = pd.read_csv(base_out / "test.csv")

    bal_path = base_out / "train_balanced.csv"
    df_bal = pd.read_csv(bal_path) if bal_path.exists() else None

    # Tama√±os y factores
    n_c        = len(df_c)
    n_tr_orig  = len(df_tr)
    n_va       = len(df_va)
    n_te       = len(df_te)
    n_tr_bal   = len(df_bal) if df_bal is not None else n_tr_orig

    n_all_before = n_tr_orig + n_va + n_te
    n_all_after  = n_tr_bal  + n_va + n_te

    expansion_before = (n_all_before / n_c) if n_c else float("nan")
    expansion_after  = (n_all_after  / n_c) if n_c else float("nan")

    generated = max(0, n_tr_bal - n_tr_orig)
    has_bal   = df_bal is not None

    # Directorio EDA
    eda_dir = base_out / "eda"
    eda_dir.mkdir(parents=True, exist_ok=True)

    # Histogramas clave para la memoria
    _plot_hist(df_c["steering"],
               f"{RUN} ‚Äî steering (canonical)",
               eda_dir / "hist_canonical.png",
               edges)

    _plot_hist(df_tr["steering"],
               f"{RUN} ‚Äî steering (train)",
               eda_dir / "hist_train.png",
               edges)

    if df_bal is not None:
        _plot_hist(df_bal["steering"],
                   f"{RUN} ‚Äî steering (train balanced)",
                   eda_dir / "hist_train_balanced.png",
                   edges)

    # Resumen JSON por circuito (para trazabilidad)
    summary = {
        "run": RUN,
        "n_canonical": int(n_c),
        "n_train_orig": int(n_tr_orig),
        "n_train_balanced": int(n_tr_bal),
        "n_val": int(n_va),
        "n_test": int(n_te),
        "n_total_before_expand": int(n_all_before),
        "n_total_after_expand": int(n_all_after),
        "expansion_factor_before": float(expansion_before),
        "expansion_factor_after": float(expansion_after),
        "generated_train": int(generated),
        "has_train_balanced": bool(has_bal),
    }
    (eda_dir / "summary.json").write_text(json.dumps(summary, indent=2),
                                          encoding="utf-8")
    rows_summary.append(summary)

# Resumen global de todos los circuitos
eda_all = PROC / "eda_all"
eda_all.mkdir(parents=True, exist_ok=True)

df_sum = pd.DataFrame(rows_summary)
display(df_sum.sort_values("run") if not df_sum.empty else df_sum)

out_csv = eda_all / "summary_runs.csv"
df_sum.to_csv(out_csv, index=False)
print("Guardado resumen global:", out_csv)


<a id="sec-03"></a>
## 3) EDA r√°pida y resumen por circuitos (para la memoria)

**Qu√© proporciona esta secci√≥n**

Para cada circuito (*run*):

- Tama√±os por split (`canonical`, `train`, `val`, `test`) y **factor de expansi√≥n**  
  (‚âà 3 si se activan c√°maras L/R sin p√©rdidas).
- Histogramas de `steering` clave:
  - `hist_canonical.png`: distribuci√≥n original del √°ngulo de giro por circuito.
  - `hist_train.png`: distribuci√≥n del `train` tras la expansi√≥n (L/R, splits).
  - `hist_train_balanced.png`: distribuci√≥n final del `train` tras el balanceo offline (si existe).
- Un `summary.json` por circuito con los principales contadores (√∫til para la memoria).
- Un **resumen global** `data/processed/eda_all/summary_runs.csv` con una fila por circuito.

Esta informaci√≥n se utilizar√° para:

- La **Tabla** (resumen del dataset por circuito).
- La **Figura** (histogramas can√≥nicos por circuito).
- La **Figura** (efecto del balanceo en `train`).

[‚Üë Volver al √≠ndice](#toc)


In [None]:
# %% [code]
# =============================================================================
# 4) Figuras y tablas para la memoria (Conjunto de datos)
#    - Tabla resumen (para Tabla)
#    - Figura: ejemplos de im√°genes por circuito
#    - Figura: histogramas can√≥nicos por circuito
#    - Figura: train vs train_balanced por circuito
# =============================================================================
from pathlib import Path
import sys, json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image

# -------------------------------------------------------------------------
# 0) Rutas base y runs (por si no est√°n ya en el entorno)
# -------------------------------------------------------------------------
if "ROOT" not in globals():
    ROOT = Path.cwd().parents[0] if (Path.cwd().name == "notebooks") else Path.cwd()

RAW  = ROOT / "data" / "raw" / "udacity"
PROC = ROOT / "data" / "processed"

if "RUNS" in globals():
    RUNS_LOCAL = list(RUNS)
else:
    RUNS_LOCAL = sorted({
        p.parents[1].name
        for p in RAW.rglob("driving_log.csv")
        if "aug" not in p.parts
    })

print("ROOT:", ROOT)
print("RAW :", RAW)
print("PROC:", PROC)
print("RUNS:", RUNS_LOCAL)

# Directorio donde guardaremos figuras y tabla "para la memoria"
FIG_DIR = ROOT / "figs_memoria" / "dataset"
FIG_DIR.mkdir(parents=True, exist_ok=True)
print("FIG_DIR:", FIG_DIR)

# Bins para los histogramas (coherente con la preparaci√≥n)
bins = None
if "CFG" in globals() and hasattr(CFG, "bins"):
    bins = int(getattr(CFG, "bins"))
elif "PREP" in globals() and isinstance(PREP, dict):
    bins = int(PREP.get("bins", 50))
if bins is None:
    bins = 50
print("Bins para histogramas:", bins)

# -------------------------------------------------------------------------
# 1) Tabla resumen por circuito (para Tabla 3.1)
#    - Se apoya en data/processed/eda_all/summary_runs.csv
# -------------------------------------------------------------------------
eda_all = PROC / "eda_all"
eda_all.mkdir(parents=True, exist_ok=True)
sum_csv = eda_all / "summary_runs.csv"

if not sum_csv.exists():
    raise FileNotFoundError(f"No existe {sum_csv}. Ejecuta antes la celda de EDA resumen.")

df_sum = pd.read_csv(sum_csv)
print("Le√≠do resumen global existente:", sum_csv)

# Tabla "bonita" para la memoria
df_tab = df_sum.copy()
df_tab["Circuito"] = df_tab["run"].str.replace("circuito", "Circuito ", regex=False)

# Usamos n_train_balanced si existe, sino caemos a n_train_orig
if "n_train_balanced" in df_tab.columns:
    train_col = "n_train_balanced"
else:
    train_col = "n_train_orig"

df_tab = df_tab[
    [
        "Circuito",
        "n_canonical",
        train_col,
        "n_val",
        "n_test",
        "expansion_factor_after",
        "has_train_balanced",
    ]
].rename(
    columns={
        "n_canonical": "Muestras_canonical",
        train_col:      "Train",
        "n_val":        "Val",
        "n_test":       "Test",
        "expansion_factor_after": "Factor_expansion",
        "has_train_balanced":     "Train_balanceado",
    }
)

tab_csv_out = FIG_DIR / "tabla_3_1_resumen_dataset.csv"
df_tab.to_csv(tab_csv_out, index=False)
print("Tabla resumen para la memoria guardada en:", tab_csv_out)
display(df_tab)

# -------------------------------------------------------------------------
# 2) Figura ‚Äì Ejemplos de im√°genes por circuito
# -------------------------------------------------------------------------
def _pick_examples_for_run(run_name: str, seed: int = 42):
    """Devuelve dos filas de canonical.csv: (recta, curva) si es posible."""
    base = PROC / run_name
    df_c = pd.read_csv(base / "canonical.csv")
    df_c["steering"] = pd.to_numeric(df_c["steering"], errors="coerce")
    df_c = df_c.dropna(subset=["steering"])

    rng = np.random.default_rng(seed)

    # recta: |steering| < 0.05
    df_straight = df_c[np.abs(df_c["steering"]) < 0.05]
    if len(df_straight) == 0:
        df_straight = df_c  # fallback

    # curva: |steering| > 0.30
    df_curve = df_c[np.abs(df_c["steering"]) > 0.30]
    if len(df_curve) == 0:
        df_curve = df_c  # fallback

    row_straight = df_straight.iloc[rng.integers(0, len(df_straight))]
    row_curve    = df_curve.iloc[rng.integers(0, len(df_curve))]
    return row_straight, row_curve

def _resolve_image_path(run_name: str, rel_path: str) -> Path:
    """Convierte la ruta relativa de canonical.csv en una ruta absoluta al fichero de imagen."""
    rel = str(rel_path).replace("\\", "/").lstrip("/")
    return (RAW / run_name / rel).resolve()

# Creamos figura: una fila por run, columnas recta/curva
fig, axes = plt.subplots(len(RUNS_LOCAL), 2, figsize=(8, 3 * len(RUNS_LOCAL)))
if len(RUNS_LOCAL) == 1:
    axes = np.array([axes])  # normalizar a 2D

for row_idx, run in enumerate(RUNS_LOCAL):
    row_straight, row_curve = _pick_examples_for_run(run, seed=42 + row_idx)
    for col_idx, row in enumerate([row_straight, row_curve]):
        img_path = _resolve_image_path(run, row["center"])
        try:
            img = Image.open(img_path)
        except Exception as e:
            print(f"[WARN] No se pudo abrir {img_path}: {e}")
            axes[row_idx, col_idx].axis("off")
            axes[row_idx, col_idx].set_title(f"{run} (imagen no disponible)")
            continue
        axes[row_idx, col_idx].imshow(img)
        axes[row_idx, col_idx].axis("off")
        kind = "recta" if col_idx == 0 else "curva"
        steering = float(row["steering"])
        axes[row_idx, col_idx].set_title(f"{run} ‚Äì {kind}, steering={steering:.2f}")

fig.suptitle("Ejemplos de im√°genes por circuito", fontsize=12)
plt.tight_layout()
fig_31_path = FIG_DIR / "fig_3_1_ejemplos_imagenes.png"
fig.savefig(fig_31_path, dpi=200)
plt.close(fig)
print("Figura 3.1 guardada en:", fig_31_path)

# -------------------------------------------------------------------------
# 3) Figura ‚Äì Histogramas can√≥nicos por circuito
# -------------------------------------------------------------------------
edges = np.linspace(-1.0, 1.0, bins)

fig, axes = plt.subplots(1, len(RUNS_LOCAL), figsize=(8, 3))
if len(RUNS_LOCAL) == 1:
    axes = [axes]

for ax, run in zip(axes, RUNS_LOCAL):
    base = PROC / run
    df_c = pd.read_csv(base / "canonical.csv")
    s = pd.to_numeric(df_c["steering"], errors="coerce").dropna().clip(-1, 1)
    ax.hist(s, bins=edges, edgecolor="black")
    ax.set_title(f"{run} ‚Äì canonical")
    ax.set_xlabel("steering")
    ax.set_ylabel("freq")

fig.suptitle("Histogramas del √°ngulo de giro (canonical)", fontsize=12)
plt.tight_layout()
fig_32_path = FIG_DIR / "fig_3_2_hist_canonical.png"
fig.savefig(fig_32_path, dpi=200)
plt.close(fig)
print("Figura 3.2 guardada en:", fig_32_path)

# -------------------------------------------------------------------------
# 4) Figura ‚Äì Train vs train_balanced por circuito
# -------------------------------------------------------------------------
fig, axes = plt.subplots(len(RUNS_LOCAL), 2, figsize=(8, 3 * len(RUNS_LOCAL)))
if len(RUNS_LOCAL) == 1:
    axes = np.array([axes])  # normalizar 2D

for row_idx, run in enumerate(RUNS_LOCAL):
    base = PROC / run
    df_tr = pd.read_csv(base / "train.csv")
    s_tr = pd.to_numeric(df_tr["steering"], errors="coerce").dropna().clip(-1, 1)

    ax_tr = axes[row_idx, 0]
    ax_tr.hist(s_tr, bins=edges, edgecolor="black")
    ax_tr.set_title(f"{run} ‚Äì TRAIN (original)")
    ax_tr.set_xlabel("steering")
    ax_tr.set_ylabel("freq")

    ax_bal = axes[row_idx, 1]
    bal_path = base / "train_balanced.csv"
    if bal_path.exists():
        df_bal = pd.read_csv(bal_path)
        s_bal = pd.to_numeric(df_bal["steering"], errors="coerce").dropna().clip(-1, 1)
        ax_bal.hist(s_bal, bins=edges, edgecolor="black")
        ax_bal.set_title(f"{run} ‚Äì TRAIN (balanceado)")
        ax_bal.set_xlabel("steering")
        ax_bal.set_ylabel("freq")
    else:
        ax_bal.axis("off")
        ax_bal.set_title(f"{run} ‚Äì sin train_balanced.csv")

fig.suptitle("Efecto del balanceo en TRAIN", fontsize=12)
plt.tight_layout()
fig_33_path = FIG_DIR / "fig_3_3_hist_train_vs_bal.png"
fig.savefig(fig_33_path, dpi=200)
plt.close(fig)
print("Figura 3.3 guardada en:", fig_33_path)

print("\n=== RESUMEN ===")
print("Tabla 3.1 ‚Üí", tab_csv_out)
print("Figura 3.1 ‚Üí", fig_31_path)
print("Figura 3.2 ‚Üí", fig_32_path)
print("Figura 3.3 ‚Üí", fig_33_path)


<a id="sec-04"></a>
## 4) Figuras y tablas para la memoria (Conjunto de datos)

Esta celda genera en `figs_memoria/dataset/`:

- **Tabla** ‚Üí `tabla_3_1_resumen_dataset.csv`  
  (base para la tabla resumen del conjunto de datos por circuito).
- **Figura** ‚Üí `fig_3_1_ejemplos_imagenes.png`  
  (ejemplos de im√°genes de recta/curva por circuito).
- **Figura** ‚Üí `fig_3_2_hist_canonical.png`  
  (histogramas can√≥nicos de `steering` por circuito).
- **Figura** ‚Üí `fig_3_3_hist_train_vs_bal.png`  
  (`train` original frente a `train_balanced` por circuito).

Estas salidas son las que se referencian en el apartado **3.0 Conjunto de Datos** de la memoria.

[‚Üë Volver al √≠ndice](#toc)


In [None]:
# %% [code]
# =============================================================================
# 5) EDA adicional: n¬∫ de filas por circuito / sub-vuelta (raw)
# =============================================================================
from pathlib import Path
import pandas as pd

ROOT = Path.cwd().parents[0] if (Path.cwd().name == "notebooks") else Path.cwd()
RAW  = ROOT / "data" / "raw" / "udacity"

rows = []
for csv_path in RAW.rglob("driving_log.csv"):
    # Si tuvieras carpetas de augmentaci√≥n sint√©tica en raw y quisieras ignorarlas:
    if "aug" in csv_path.parts:
        continue

    circuito = csv_path.parents[1].name   # p.ej. "circuito1"
    subvuelta = csv_path.parent.name      # p.ej. "vuelta2"

    df = pd.read_csv(csv_path, header=None)
    n_rows = len(df)

    rows.append({
        "circuito": circuito,
        "subvuelta": subvuelta,
        "n_filas_log": n_rows,
    })

df_sub = pd.DataFrame(rows).sort_values(["circuito", "subvuelta"])
display(df_sub)

eda_all = ROOT / "data" / "processed" / "eda_all"
eda_all.mkdir(parents=True, exist_ok=True)
subruns_csv = eda_all / "subruns_counts.csv"
df_sub.to_csv(subruns_csv, index=False)
print("Guardado resumen de sub-vueltas en:", subruns_csv)


In [None]:
# =============================================================================
# 5) Comprobaci√≥n visual: original vs recorte+resize (pipeline real)
# =============================================================================
import numpy as np
import cv2
import matplotlib.pyplot as plt
import pandas as pd

from src.config import load_preset
from src.datasets import ImageTransform

# --- 5.1) Elegimos circuito a inspeccionar -----------------------------------
# Puedes cambiarlo a "circuito1" si quieres ver el otro
RUN_EXAMPLE = "circuito2"

raw_dir = RAW / RUN_EXAMPLE
proc_dir = PROC / RUN_EXAMPLE

canonical_csv = proc_dir / "canonical.csv"
if not canonical_csv.exists():
    raise FileNotFoundError(f"No existe {canonical_csv}; ejecuta antes el prep.")

df = pd.read_csv(canonical_csv)

print(f"[INFO] {RUN_EXAMPLE} ‚Üí canonical.csv con {len(df)} filas")
display(df.head())

# --- 5.2) Construimos el mismo transform que en entrenamiento ----------------
PRESET = "std"  # o "fast"/"accurate" si quieres ver otra config
cfg = load_preset(ROOT / "configs" / "presets.yaml", PRESET)
MODEL = cfg["model"]

W = int(MODEL["img_w"])
H = int(MODEL["img_h"])
to_gray = bool(MODEL["to_gray"])
crop_top = int(MODEL.get("crop_top", 0) or 0)
crop_bottom = int(MODEL.get("crop_bottom", 0) or 0)

print(f"[PRESET={PRESET}] img={W}x{H} to_gray={to_gray} "
      f"| crop_top={crop_top} crop_bottom={crop_bottom}")

tfm = ImageTransform(
    W, H,
    to_gray=to_gray,
    crop_top=crop_top or None,
    crop_bottom=crop_bottom or None,
)

# --- 5.3) Sampling aleatorio de ejemplos -------------------------------------
N = 10  # cambia este n√∫mero si quieres m√°s/menos ejemplos

idxs = sorted(np.random.choice(len(df), size=min(N, len(df)), replace=False))
print("Mostrando √≠ndices:", idxs)

fig, axes = plt.subplots(len(idxs), 2, figsize=(8, 3 * len(idxs)))
if len(idxs) == 1:
    axes = [axes]  # normalizamos a lista de filas

for row_i, ax_row in zip(idxs, axes):
    # Ruta relativa de la imagen (columna 'center' de canonical.csv)
    rel = str(df.loc[row_i, "center"]).replace("\\", "/")
    img_path = (raw_dir / rel).resolve()

    img_bgr = cv2.imread(str(img_path), cv2.IMREAD_COLOR)
    if img_bgr is None:
        print("No se pudo leer:", img_path)
        continue

    # Original en RGB (para mostrarla "bien")
    orig_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

    # Transform de tu pipeline (crop + resize + gris/color)
    x = tfm(img_bgr).numpy()
    if to_gray:
        crop_img = x[0]               # (H, W)
    else:
        crop_img = x.transpose(1, 2, 0)  # (H, W, C)

    # --- Columna 1: original ---
    ax_row[0].imshow(orig_rgb)
    ax_row[0].set_title(f"Original\nidx={row_i}")
    ax_row[0].axis("off")

    # --- Columna 2: recortada+resize ---
    if to_gray:
        ax_row[1].imshow(crop_img, cmap="gray", vmin=0, vmax=1)
    else:
        ax_row[1].imshow(crop_img)
    ax_row[1].set_title(f"Recortada + resize\n{W}x{H}")
    ax_row[1].axis("off")

plt.tight_layout()
plt.show()


**Listo.** Ya puedes ir a `03_TRAIN_EVAL.ipynb` y activar `USE_OFFLINE_BALANCED = True`
para consumir `tasks_balanced.json` (o dejarlo en `False` si quieres usar `tasks.json`).
