# FraSoHome ‚Äì Notebook 6: Preprocesamiento final (escala y codificaci√≥n) ‚Äì ML/BI

## Objetivos formativos

En este notebook vamos a preparar los datasets de **features** (generados en el Notebook 5) para que est√©n listos para:
- **Machine Learning**: datos num√©ricos escalados, categ√≥ricas codificadas, sin nulos.
- **BI**: versionado de datasets consistentes, con un ‚Äúdiccionario‚Äù de variables y comprobaciones b√°sicas.

> Nota did√°ctica: en este caso, **los or√≠genes contienen errores intencionales** (formatos mixtos, valores raros, etc.).  
> En Notebook 5 ya convertimos gran parte de esos problemas a features agregadas, pero aqu√≠ a√∫n veremos:
> - n√∫meros en texto con `‚Ç¨`, coma decimal, separador de miles‚Ä¶
> - flags booleanos heterog√©neos (`S`, `s√≠`, `YES`, `0/1`, etc.)
> - categor√≠as con variantes (`Bronce` vs `BRONCE`, etc.)

## Entradas esperadas

El notebook intenta cargar (por orden de preferencia):
- `output_features/features_clientes.csv`
- `output_features/features_productos.csv`
- `output_features/dataset_propension_snapshots.csv` (si existe)

Si no existen, puedes ejecutar primero el Notebook 5.


In [None]:
# ============================================
# 0) Imports y configuraci√≥n
# ============================================
from __future__ import annotations

import os
from pathlib import Path
import json
import re
import warnings

import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

# scikit-learn (para pipelines de ML)
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 160)

DATA_DIR = Path(".")  # carpeta donde est√© el notebook


## 1) Utilidades: carga robusta y normalizaci√≥n ligera

Estrategia:
1. Leemos CSV como **texto** (`dtype=str`) para preservar formatos raros.
2. Estandarizamos nombres de columnas.
3. Parseamos num√©ricos/booleanos de forma *tolerante* (sin ‚Äúromper‚Äù el flujo).


In [None]:
# ============================================
# 1) Utilidades de ingesti√≥n / normalizaci√≥n
# ============================================

def load_csv_robust(path: Path, encoding: str = "utf-8", dtype=str) -> pd.DataFrame:
    """Carga CSV de forma robusta (pensado para data quality con formatos raros)."""
    if not path.exists():
        raise FileNotFoundError(f"No existe el fichero: {path.resolve()}")
    # engine='python' tolera mejor algunos casos raros de comillas
    df = pd.read_csv(path, encoding=encoding, dtype=dtype, sep=",", engine="python")
    return df


def standardize_column_names(df: pd.DataFrame) -> pd.DataFrame:
    """Estandariza nombres de columna a snake_case simple."""
    df = df.copy()
    def _clean(col: str) -> str:
        col = col.strip().lower()
        col = re.sub(r"\s+", "_", col)
        col = re.sub(r"[^a-z0-9_]+", "_", col)
        col = re.sub(r"_+", "_", col).strip("_")
        return col
    df.columns = [_clean(c) for c in df.columns]
    return df


def strip_strings(df: pd.DataFrame) -> pd.DataFrame:
    """Trim de espacios en strings (incluye valores tipo '  P1001 ')."""
    df = df.copy()
    for c in df.columns:
        if df[c].dtype == object:
            df[c] = df[c].astype(str).str.strip()
            df[c] = df[c].replace({"": np.nan, "None": np.nan, "nan": np.nan, "NaN": np.nan})
    return df


_NUM_CLEAN_RE = re.compile(r"[^0-9,\.\-]+")  # deja d√≠gitos, coma, punto y signo


def parse_numeric_series(s: pd.Series) -> pd.Series:
    """Convierte series con n√∫meros 'sucios' (‚Ç¨, comas, miles) a float.

    Heur√≠stica:
    - Elimina s√≠mbolos (‚Ç¨, EUR, espacios, etc.)
    - Si hay coma y punto: asume que la coma es separador de miles y el punto decimal (ej: 1,234.56)
    - Si solo hay coma: asume coma decimal (ej: 123,45)
    """
    if s is None:
        return s
    s0 = s.astype(str)

    # Mant√©n NaNs
    s0 = s0.replace({"None": np.nan, "nan": np.nan, "NaN": np.nan})

    # Limpia s√≠mbolos
    s1 = s0.str.replace(_NUM_CLEAN_RE, "", regex=True)

    def _to_float(x: str):
        if x is None or (isinstance(x, float) and np.isnan(x)):
            return np.nan
        x = str(x)
        if x.strip() == "":
            return np.nan

        # Casos con ambos separadores
        if "," in x and "." in x:
            # ejemplo 1,234.56 => quitar comas
            x = x.replace(",", "")
        elif "," in x and "." not in x:
            # ejemplo 123,45 => coma decimal
            x = x.replace(",", ".")
        # else: solo punto o solo d√≠gitos

        try:
            return float(x)
        except Exception:
            return np.nan

    return s1.map(_to_float)


_TRUE_SET = {"1", "true", "t", "yes", "y", "si", "s√≠", "s"}
_FALSE_SET = {"0", "false", "f", "no", "n"}

def parse_bool_series(s: pd.Series) -> pd.Series:
    """Parsea booleanos heterog√©neos a 0/1 (float para permitir NaN)."""
    s0 = s.astype(str).str.strip().str.lower()
    s0 = s0.replace({"": np.nan, "none": np.nan, "nan": np.nan})
    def _map(x):
        if x is None or (isinstance(x, float) and np.isnan(x)):
            return np.nan
        if x in _TRUE_SET:
            return 1.0
        if x in _FALSE_SET:
            return 0.0
        return np.nan
    return s0.map(_map)


def guess_column_roles(df: pd.DataFrame, target_col: str | None = None) -> dict:
    """Devuelve una propuesta de roles: ids / categoricas / numericas / fechas."""
    cols = list(df.columns)
    roles = {"id": [], "categorical": [], "numeric": [], "datetime": []}

    # Heur√≠sticas b√°sicas
    for c in cols:
        if target_col and c == target_col:
            continue
        if c.endswith("_id") or c in {"customer_id", "product_id", "ticket_id", "order_id"}:
            roles["id"].append(c)
        elif "date" in c or "fecha" in c or c.endswith("_dt") or c.endswith("_datetime") or c.endswith("_ts"):
            roles["datetime"].append(c)
        else:
            roles["categorical"].append(c)

    return roles


## 2) Carga de datasets de features

Intentamos localizar los ficheros en `output_features/`.  
Si tus notebooks anteriores exportaron en otra carpeta, ajusta `FEATURES_DIR`.


In [None]:
# ============================================
# 2) Carga de datasets (features)
# ============================================
FEATURES_DIR = Path("output_features")
OUTPUT_DIR = Path("output_ml")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

paths = {
    "clientes": FEATURES_DIR / "features_clientes.csv",
    "productos": FEATURES_DIR / "features_productos.csv",
    "propension": FEATURES_DIR / "dataset_propension_snapshots.csv",
}

loaded = {}
for k, p in paths.items():
    if p.exists():
        df = load_csv_robust(p)
        df = standardize_column_names(strip_strings(df))
        loaded[k] = df
        print(f"‚úÖ Cargado: {k} -> {p} | shape={df.shape}")
    else:
        print(f"‚ö†Ô∏è No encontrado: {k} -> {p}")

loaded.keys()


## 3) Perfilado r√°pido (formativo)

Antes de preprocesar, hacemos un chequeo r√°pido:
- tama√±os, nulos, duplicados exactos
- columnas con cardinalidad alta (posibles IDs)


In [None]:
# ============================================
# 3) Perfilado r√°pido
# ============================================

def quick_profile(df: pd.DataFrame, name: str, max_unique_show: int = 10) -> pd.DataFrame:
    rows = []
    n = len(df)
    dup_exact = int(df.duplicated().sum())
    for c in df.columns:
        nulls = int(df[c].isna().sum())
        nunique = int(df[c].nunique(dropna=True))
        sample = df[c].dropna().astype(str).head(max_unique_show).tolist()
        rows.append({
            "dataset": name,
            "column": c,
            "dtype": str(df[c].dtype),
            "rows": n,
            "nulls": nulls,
            "null_pct": round(100 * nulls / n, 2) if n else np.nan,
            "nunique": nunique,
            "sample_values": sample
        })
    prof = pd.DataFrame(rows).sort_values(["null_pct", "nunique"], ascending=[False, False])
    print(f"\nüìå {name}: shape={df.shape} | duplicated_rows_exact={dup_exact}")
    display(prof.head(20))
    return prof

profiles = {}
for name, df in loaded.items():
    profiles[name] = quick_profile(df, name)


## 4) Construcci√≥n de un preprocesador ML (escala + one-hot)

Vamos a crear funciones reutilizables para:
- identificar columnas num√©ricas/categ√≥ricas
- aplicar **imputaci√≥n** de nulos
- aplicar **escalado** a num√©ricas (StandardScaler o MinMaxScaler)
- aplicar **one-hot** a categ√≥ricas (OneHotEncoder)

üí° Nota did√°ctica: guardamos tambi√©n la lista de *feature names* resultante.


In [None]:
# ============================================
# 4) Preprocesamiento (sklearn): build + transform + export
# ============================================

def detect_numeric_columns(df: pd.DataFrame, candidate_cols: list[str]) -> list[str]:
    """Intenta detectar columnas num√©ricas dentro de un conjunto de candidatas."""
    numeric_cols = []
    for c in candidate_cols:
        # intencionalmente tolerante: si al parsear hay suficientes valores num√©ricos, consideramos num√©rica
        parsed = parse_numeric_series(df[c])
        ok_ratio = parsed.notna().mean() if len(parsed) else 0
        # umbral did√°ctico: si m√°s del 60% parsea, la tratamos como num√©rica
        if ok_ratio >= 0.60:
            numeric_cols.append(c)
    return numeric_cols


def split_numeric_categorical(df: pd.DataFrame, target_col: str | None, id_cols: list[str]) -> tuple[list[str], list[str]]:
    cols = [c for c in df.columns if c != target_col]
    cols = [c for c in cols if c not in id_cols]
    # intento: detectar num√©ricas
    numeric_cols = detect_numeric_columns(df, cols)
    cat_cols = [c for c in cols if c not in numeric_cols]
    return numeric_cols, cat_cols


def build_preprocessor(numeric_cols: list[str], categorical_cols: list[str], scaler: str = "standard") -> ColumnTransformer:
    # escalador
    if scaler == "minmax":
        num_scaler = MinMaxScaler()
    else:
        num_scaler = StandardScaler()

    # OneHotEncoder: compatibilidad entre versiones (sparse vs sparse_output)
    try:
        ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    except TypeError:
        ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)

    numeric_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", num_scaler),
    ])

    categorical_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", ohe),
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_pipe, numeric_cols),
            ("cat", categorical_pipe, categorical_cols),
        ],
        remainder="drop"
    )
    return preprocessor


def get_feature_names(preprocessor: ColumnTransformer) -> list[str]:
    """Recupera nombres de features tras ColumnTransformer."""
    feature_names = []

    # num
    try:
        num_features = list(preprocessor.transformers_[0][2])
    except Exception:
        num_features = []
    feature_names.extend([f"num__{c}" for c in num_features])

    # cat (onehot)
    cat_cols = list(preprocessor.transformers_[1][2]) if len(preprocessor.transformers_) > 1 else []
    if cat_cols:
        ohe = preprocessor.named_transformers_["cat"].named_steps["onehot"]
        try:
            cat_names = ohe.get_feature_names_out(cat_cols).tolist()
        except Exception:
            # fallback simple
            cat_names = []
            for i, c in enumerate(cat_cols):
                cat_names.append(f"{c}_*")
        feature_names.extend([f"cat__{n}" for n in cat_names])

    return feature_names


def coerce_numeric_columns(df: pd.DataFrame, numeric_cols: list[str]) -> pd.DataFrame:
    """Convierte columnas num√©ricas detectadas a float de forma tolerante."""
    df = df.copy()
    for c in numeric_cols:
        df[c] = parse_numeric_series(df[c])
    return df


def preprocess_dataframe_for_ml(
    df_raw: pd.DataFrame,
    target_col: str | None,
    id_cols: list[str],
    scaler: str = "standard",
    test_size: float = 0.2,
    random_state: int = 42,
) -> dict:
    """Devuelve X_train/X_test/y_train/y_test + transformer + feature_names (si hay target)."""
    df = df_raw.copy()

    # Detectar columnas num/cat
    numeric_cols, categorical_cols = split_numeric_categorical(df, target_col=target_col, id_cols=id_cols)
    print(f"Numeric cols ({len(numeric_cols)}): {numeric_cols[:10]}{'...' if len(numeric_cols)>10 else ''}")
    print(f"Categorical cols ({len(categorical_cols)}): {categorical_cols[:10]}{'...' if len(categorical_cols)>10 else ''}")

    # Coerce num√©ricas (para que el imputador/escalador funcionen)
    df = coerce_numeric_columns(df, numeric_cols)

    # Target
    y = None
    if target_col and target_col in df.columns:
        # target puede venir como texto -> intenta num√©rico; si no, booleans
        y_num = parse_numeric_series(df[target_col])
        if y_num.notna().mean() >= 0.8:
            y = y_num.astype(float)
        else:
            y = parse_bool_series(df[target_col])
        # fallback final: si no parsea bien, deja como texto
        if y is None or y.isna().all():
            y = df[target_col]

    # X (sin id_cols, sin target)
    drop_cols = set(id_cols)
    if target_col:
        drop_cols.add(target_col)
    X = df.drop(columns=[c for c in drop_cols if c in df.columns], errors="ignore")

    preprocessor = build_preprocessor(numeric_cols=[c for c in numeric_cols if c in X.columns],
                                      categorical_cols=[c for c in categorical_cols if c in X.columns],
                                      scaler=scaler)

    X_trans = preprocessor.fit_transform(X)
    feature_names = get_feature_names(preprocessor)

    # Empaquetar a DataFrame
    X_trans_df = pd.DataFrame(X_trans, columns=feature_names)

    # Split si hay y num√©rico razonable
    if y is not None:
        # elimina filas con y NaN (did√°ctico: para ML suele ser obligatorio)
        mask = ~pd.isna(y)
        X_trans_df2 = X_trans_df.loc[mask].reset_index(drop=True)
        y2 = y.loc[mask].reset_index(drop=True)

        X_train, X_test, y_train, y_test = train_test_split(
            X_trans_df2, y2, test_size=test_size, random_state=random_state, stratify=None
        )
        return {
            "X_all": X_trans_df2,
            "y_all": y2,
            "X_train": X_train,
            "X_test": X_test,
            "y_train": y_train,
            "y_test": y_test,
            "preprocessor": preprocessor,
            "feature_names": feature_names,
            "numeric_cols": numeric_cols,
            "categorical_cols": categorical_cols,
        }

    return {
        "X_all": X_trans_df,
        "preprocessor": preprocessor,
        "feature_names": feature_names,
        "numeric_cols": numeric_cols,
        "categorical_cols": categorical_cols,
    }


## 5) Dataset ML-ready de Clientes (churn/RFM)

En el Notebook 5 se generaba una etiqueta did√°ctica t√≠pica:
- `label_churn_180d` (1 si no compra en los √∫ltimos 180 d√≠as, 0 en caso contrario)

En este notebook:
- escalamos num√©ricas + one-hot en categ√≥ricas
- exportamos un dataset **100% num√©rico** listo para ML/BI


In [None]:
# ============================================
# 5) Clientes: ML-ready (con label de churn si existe)
# ============================================
if "clientes" not in loaded:
    raise RuntimeError("No se encontr√≥ features_clientes.csv en output_features/. Ejecuta el Notebook 5 primero.")

df_cli = loaded["clientes"].copy()

# Columnas t√≠picas que solemos tratar como IDs (ajusta si tu dataset difiere)
id_cols_cli = [c for c in df_cli.columns if c in {"customer_id", "cliente_id", "id_cliente"}]
target_cli = "label_churn_180d" if "label_churn_180d" in df_cli.columns else None

print("Target:", target_cli)
print("ID cols:", id_cols_cli)

prep_cli = preprocess_dataframe_for_ml(
    df_raw=df_cli,
    target_col=target_cli,
    id_cols=id_cols_cli,
    scaler="standard",  # cambia a 'minmax' si quieres 0..1
)

X_cli = prep_cli["X_all"]
y_cli = prep_cli.get("y_all", None)

display(X_cli.head())
if y_cli is not None:
    display(y_cli.value_counts(dropna=False).head(10))


In [None]:
# Export clientes ML-ready
clients_ml_path = OUTPUT_DIR / "FraSoHome_clientes_ML_ready.csv"
X_cli_export = X_cli.copy()

# si existe target, lo a√±adimos
if y_cli is not None:
    X_cli_export[target_cli] = y_cli.values

# (opcional) a√±ade id si existe para trazabilidad BI
if id_cols_cli:
    # mant√©n una copia de IDs "raw"
    X_cli_export.insert(0, id_cols_cli[0], df_cli.loc[~pd.isna(y_cli) if y_cli is not None else df_cli.index, id_cols_cli[0]].values)

X_cli_export.to_csv(clients_ml_path, index=False, encoding="utf-8")
print(f"‚úÖ Exportado: {clients_ml_path} | shape={X_cli_export.shape}")

# Metadata (diccionario de variables)
meta_cli = {
    "dataset": "clientes",
    "target": target_cli,
    "id_cols": id_cols_cli,
    "numeric_cols_detected": prep_cli["numeric_cols"],
    "categorical_cols_detected": prep_cli["categorical_cols"],
    "n_features_after_encoding": len(prep_cli["feature_names"]),
    "feature_names": prep_cli["feature_names"][:200],  # l√≠mite did√°ctico; ajusta si quieres todas
    "scaler": "standard",
}
meta_path = OUTPUT_DIR / "FraSoHome_clientes_ML_ready_metadata.json"
meta_path.write_text(json.dumps(meta_cli, ensure_ascii=False, indent=2), encoding="utf-8")
print(f"‚úÖ Metadata: {meta_path}")


## 6) Dataset ML-ready de Propensi√≥n (si existe)

Si has generado `dataset_propension_snapshots.csv` en Notebook 5, aqu√≠ lo preprocesamos tambi√©n.
Suele incluir un target tipo `label_buy_horizon` (nombre puede variar).  
Este bloque detecta autom√°ticamente una columna target si existe.


In [None]:
# ============================================
# 6) Propensi√≥n (si existe)
# ============================================
if "propension" in loaded:
    df_prop = loaded["propension"].copy()

    # Heur√≠stica: buscar targets comunes
    possible_targets = [c for c in df_prop.columns if c.startswith("label_") or c in {"target", "y", "will_buy"}]
    target_prop = possible_targets[0] if possible_targets else None

    id_cols_prop = [c for c in df_prop.columns if c.endswith("_id") or c in {"customer_id", "product_id", "snapshot_date"}]
    print("Target propensi√≥n detectado:", target_prop)
    print("ID cols (propensi√≥n):", id_cols_prop)

    prep_prop = preprocess_dataframe_for_ml(
        df_raw=df_prop,
        target_col=target_prop,
        id_cols=id_cols_prop,
        scaler="minmax"  # en propensi√≥n suele ser c√≥modo 0..1
    )

    X_prop = prep_prop["X_all"]
    y_prop = prep_prop.get("y_all", None)

    prop_path = OUTPUT_DIR / "FraSoHome_propension_ML_ready.csv"
    X_prop_export = X_prop.copy()
    if y_prop is not None and target_prop:
        X_prop_export[target_prop] = y_prop.values

    X_prop_export.to_csv(prop_path, index=False, encoding="utf-8")
    print(f"‚úÖ Exportado: {prop_path} | shape={X_prop_export.shape}")
else:
    print("‚ÑπÔ∏è No hay dataset de propensi√≥n. (No existe output_features/dataset_propension_snapshots.csv)")


## 7) Dataset ML-ready de Productos (clustering / regresi√≥n / BI)

Para productos normalmente no hay target. Preparamos el dataset para:
- clustering (p.ej. KMeans)  
- regresi√≥n (p.ej. predicci√≥n de devoluciones, margen, etc.)  
- BI (cuadros de mando con variables normalizadas)

La idea did√°ctica: estandarizamos num√©ricas + one-hot de categor√≠a/marca/canal, etc.


In [None]:
# ============================================
# 7) Productos (si existe)
# ============================================
if "productos" in loaded:
    df_prod = loaded["productos"].copy()
    id_cols_prod = [c for c in df_prod.columns if c in {"product_id", "sku", "id_producto"}]

    prep_prod = preprocess_dataframe_for_ml(
        df_raw=df_prod,
        target_col=None,
        id_cols=id_cols_prod,
        scaler="standard"
    )

    X_prod = prep_prod["X_all"]
    prod_path = OUTPUT_DIR / "FraSoHome_productos_ML_ready.csv"

    X_prod_export = X_prod.copy()
    # BI: a√±ade product_id como trazabilidad
    if id_cols_prod:
        X_prod_export.insert(0, id_cols_prod[0], df_prod[id_cols_prod[0]].values)

    X_prod_export.to_csv(prod_path, index=False, encoding="utf-8")
    print(f"‚úÖ Exportado: {prod_path} | shape={X_prod_export.shape}")
else:
    print("‚ÑπÔ∏è No hay features_productos.csv en output_features/")


## 8) Comprobaciones finales (calidad ML-ready)

- ¬øQuedan nulos en los datasets exportados?
- ¬øQu√© tama√±o tienen (n_features)?
- ¬øQu√© proporci√≥n de columnas son one-hot (alta dimensionalidad)?

Este bloque ayuda a discutir trade-offs (por ejemplo, demasiadas columnas one-hot, etc.).


In [None]:
# ============================================
# 8) Checks finales
# ============================================

def check_ml_dataset(path: Path, name: str, max_cols_show: int = 30):
    df = pd.read_csv(path, dtype=str, encoding="utf-8")
    # intenta medir nulos
    null_pct = (df.isna().mean() * 100).sort_values(ascending=False)
    print(f"\n‚úÖ {name} -> {path.name} | shape={df.shape}")
    print("Top columnas con nulos (%):")
    display(null_pct.head(10))
    print("Muestra de columnas:", df.columns[:max_cols_show].tolist())

for p in OUTPUT_DIR.glob("*.csv"):
    check_ml_dataset(p, p.stem)


## 9) Pr√≥ximos pasos sugeridos (para el curso)

- Entrenar un primer modelo baseline (LogReg / RandomForest) con `FraSoHome_clientes_ML_ready.csv`.
- Evaluar con AUC/Accuracy y revisar:
  - impacto de la imputaci√≥n,
  - sensibilidad a outliers,
  - necesidad de balanceo,
  - ingenier√≠a adicional (tendencias temporales, ventanas m√≥viles).
- Para basket: transformar `basket_long.csv` a matriz one-hot y aplicar Apriori / FP-Growth en un notebook adicional.

---

üí° Si quieres, puedo generarte un **Notebook 7 (opcional)**:  
`Baseline Modeling` (churn y propensi√≥n) con evaluaci√≥n, *cross-validation* y explicaci√≥n did√°ctica.
