# FraSoHome – Notebook 5: Cálculo de Features (Cliente y Producto)

Este notebook continúa el caso práctico **FraSoHome** y tiene un objetivo **formativo**: a partir de una **fact table integrada** de transacciones omnicanal (online + tienda + devoluciones), calcularemos **features** para:

- Segmentación **RFM** (Recencia, Frecuencia, Monetario)
- Casos de uso de **abandono (churn)** (definición de etiqueta y variables)
- **Propensión de compra** (dataset por *snapshots*)
- **Análisis de cesta** (preparación de datos de transacciones)
- KPIs y features a nivel de **producto/categoría/canal**

### Enfoque didáctico
- Cargamos datos en modo **raw** (`dtype=str`) para preservar errores.
- Aplicamos funciones reutilizables que reciben **dataframes como parámetros**.
- Generamos salidas en CSV dentro de `output_features/`.

> Nota: Los datos contienen **errores intencionales** (IDs huérfanos, formatos mixtos, etc.).  
> Aquí nos centramos en **feature engineering** tras una limpieza mínima.  
> Si has ejecutado el Notebook 3/4 con datasets ya limpios/integrados, los usaremos si existen.


In [None]:

import os
import re
import math
import json
from datetime import datetime, timedelta
from typing import Optional, Dict, List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)


In [None]:

# =========================
# Configuración de rutas
# =========================
DATA_DIR = "."  # carpeta donde están los CSV (ajusta si lo necesitas)
OUTPUT_DIR = "output_features"
os.makedirs(OUTPUT_DIR, exist_ok=True)

PATHS = {
    "crm": os.path.join(DATA_DIR, "crm.csv"),
    "productos": os.path.join(DATA_DIR, "productos.csv"),
    "tiendas": os.path.join(DATA_DIR, "tiendas.csv"),
    "stock_diario": os.path.join(DATA_DIR, "stock_diario.csv"),
    # Preferimos la fact integrada del Notebook 4 si existe
    "fact_integrada": os.path.join(DATA_DIR, "output_integrated", "fact_transacciones_integrada.csv"),
    # Fallback: fact preconstruida
    "fact_fallback": os.path.join(DATA_DIR, "fact_transacciones.csv"),
}

for k, p in PATHS.items():
    print(("OK  " if os.path.exists(p) else "MISS") + f"{k:>18}: {p}")


In [None]:

# ==========================================================
# Utilidades reutilizables (formato, fechas, números, IDs)
# ==========================================================

def load_csv_raw(path: str, encoding: str = "utf-8") -> pd.DataFrame:
    """Carga un CSV en modo raw (dtype=str) para NO perder errores intencionales."""
    if not os.path.exists(path):
        raise FileNotFoundError(f"No existe el archivo: {path}")
    return pd.read_csv(
        path,
        dtype=str,
        encoding=encoding,
        keep_default_na=False,
        na_values=["", "NULL", "null", "None", "NA", "N/A"]
    )

def standardize_id(series: pd.Series) -> pd.Series:
    """Normaliza IDs (strip, uppercase) y convierte valores tipo GUEST/nan a NaN."""
    s = series.astype(str).str.strip()
    s = s.replace({"": np.nan, "nan": np.nan, "None": np.nan, "NULL": np.nan, "N/A": np.nan})
    s = s.str.replace(r"\s+", "", regex=True)   # quita espacios internos
    s = s.str.upper()
    s = s.replace({"GUEST": np.nan, "INVITADO": np.nan})
    return s

_MONTHS_ES = {
    "enero": 1, "febrero": 2, "marzo": 3, "abril": 4, "mayo": 5, "junio": 6,
    "julio": 7, "agosto": 8, "septiembre": 9, "setiembre": 9, "octubre": 10,
    "noviembre": 11, "diciembre": 12
}

def _parse_spanish_text_date(value: str):
    """Parsea fechas tipo '10 de Enero de 2023' (con o sin hora)."""
    if value is None or (isinstance(value, float) and np.isnan(value)):
        return None
    txt = str(value).strip().lower()
    m = re.search(r"(\d{1,2})\s+de\s+([a-záéíóúñ]+)\s+de\s+(\d{4})(?:\s+(\d{1,2}):(\d{2}))?", txt)
    if not m:
        return None

    day = int(m.group(1))
    month_name = (m.group(2)
                  .replace("á","a").replace("é","e").replace("í","i")
                  .replace("ó","o").replace("ú","u"))
    month = _MONTHS_ES.get(month_name)
    if month is None:
        return None

    year = int(m.group(3))
    hh = int(m.group(4)) if m.group(4) else 0
    mm = int(m.group(5)) if m.group(5) else 0

    try:
        return datetime(year, month, day, hh, mm)
    except ValueError:
        return None

def parse_datetime_es(series: pd.Series, dayfirst: bool = True) -> pd.Series:
    """
    Parsea fechas soportando:
    - ISO (YYYY-MM-DD, YYYY-MM-DD HH:MM:SS)
    - DD/MM/YYYY o DD/MM/YY
    - Texto español: '10 de Enero de 2023'
    """
    s = series.copy()
    dt1 = pd.to_datetime(s, errors="coerce", dayfirst=dayfirst, utc=False)

    mask = dt1.isna() & s.notna()
    if mask.any():
        parsed = [_parse_spanish_text_date(v) for v in s[mask].astype(str).tolist()]
        dt1.loc[mask] = pd.to_datetime(parsed, errors="coerce")

    return dt1

def parse_numeric_mixed(series: pd.Series) -> pd.Series:
    """
    Convierte strings numéricos con formatos mixtos a float:
    - '€1.234,56' / '1.234,56' / '1234.56' / '1,234.56'
    """
    s = series.astype(str).str.strip()
    s = s.replace({"": np.nan, "nan": np.nan, "None": np.nan, "NULL": np.nan, "N/A": np.nan})

    s = (s.str.replace("€", "", regex=False)
           .str.replace("EUR", "", regex=False)
           .str.replace("eur", "", regex=False))

    # deja solo dígitos, coma, punto y signo
    s = s.str.replace(r"[^\d,.\-]", "", regex=True)

    def _to_float(x):
        if x is None or (isinstance(x, float) and np.isnan(x)):
            return np.nan
        x = str(x)
        if x == "":
            return np.nan

        if "," in x and "." in x:
            # decide cuál es el separador decimal por posición
            if x.rfind(",") > x.rfind("."):
                x2 = x.replace(".", "").replace(",", ".")
            else:
                x2 = x.replace(",", "")
        elif "," in x and "." not in x:
            x2 = x.replace(",", ".")
        else:
            x2 = x

        try:
            return float(x2)
        except ValueError:
            return np.nan

    return s.apply(_to_float)

def safe_ratio(num: pd.Series, den: pd.Series) -> pd.Series:
    """Evita divisiones por cero."""
    den0 = den.fillna(0)
    return np.where(den0 == 0, np.nan, num / den0)

def describe_quick(df: pd.DataFrame, name: str, n: int = 5) -> None:
    print(f"\n=== {name} ===")
    print("shape:", df.shape)
    display(df.head(n))
    na = df.isna().mean().sort_values(ascending=False)
    print("\n% nulos (top 10):")
    display((na.head(10) * 100).round(2))


In [None]:

# =========================
# Carga de datasets
# =========================
crm = load_csv_raw(PATHS["crm"]) if os.path.exists(PATHS["crm"]) else None
productos = load_csv_raw(PATHS["productos"]) if os.path.exists(PATHS["productos"]) else None
tiendas = load_csv_raw(PATHS["tiendas"]) if os.path.exists(PATHS["tiendas"]) else None
stock = load_csv_raw(PATHS["stock_diario"]) if os.path.exists(PATHS["stock_diario"]) else None

if os.path.exists(PATHS["fact_integrada"]):
    fact = load_csv_raw(PATHS["fact_integrada"])
    FACT_SOURCE = "fact_integrada"
else:
    fact = load_csv_raw(PATHS["fact_fallback"])
    FACT_SOURCE = "fact_fallback"

print(f"\nUsando fact: {FACT_SOURCE} -> filas={len(fact):,} cols={fact.shape[1]}")
describe_quick(fact, "fact (raw)", n=3)


In [None]:

# ==========================================================
# Tipado mínimo para feature engineering (fact + dimensiones)
# ==========================================================

def ensure_fact_types(fact_raw: pd.DataFrame) -> pd.DataFrame:
    """
    Asegura que la fact tenga:
    - customer_id_std / product_id_std / store_id_std estandarizados
    - event_dt (datetime)
    - amount_signed (float) y quantity_signed (float)
    Mantiene columnas originales para trazabilidad.
    """
    df = fact_raw.copy()

    # IDs
    if "customer_id_std" in df.columns:
        df["customer_id_std"] = standardize_id(df["customer_id_std"])
    elif "customer_id" in df.columns:
        df["customer_id_std"] = standardize_id(df["customer_id"])
    elif "customer_id_raw" in df.columns:
        df["customer_id_std"] = standardize_id(df["customer_id_raw"])
    else:
        df["customer_id_std"] = np.nan

    if "product_id_std" in df.columns:
        df["product_id_std"] = standardize_id(df["product_id_std"])
    elif "product_id" in df.columns:
        df["product_id_std"] = standardize_id(df["product_id"])
    elif "product_id_raw" in df.columns:
        df["product_id_std"] = standardize_id(df["product_id_raw"])
    else:
        df["product_id_std"] = np.nan

    if "store_id_std" in df.columns:
        df["store_id_std"] = standardize_id(df["store_id_std"])
    elif "store_id" in df.columns:
        df["store_id_std"] = standardize_id(df["store_id"])
    elif "store_id_raw" in df.columns:
        df["store_id_std"] = standardize_id(df["store_id_raw"])
    else:
        df["store_id_std"] = np.nan

    # Fecha
    if "fecha_movimiento" in df.columns:
        df["event_dt"] = parse_datetime_es(df["fecha_movimiento"])
    elif "fecha_hora" in df.columns:
        df["event_dt"] = parse_datetime_es(df["fecha_hora"])
    elif "fecha_movimiento_raw" in df.columns:
        df["event_dt"] = parse_datetime_es(df["fecha_movimiento_raw"])
    else:
        df["event_dt"] = pd.NaT

    # Cantidad
    if "cantidad_signed" in df.columns:
        df["quantity_signed"] = parse_numeric_mixed(df["cantidad_signed"])
    elif "cantidad" in df.columns:
        df["quantity_signed"] = parse_numeric_mixed(df["cantidad"])
    elif "cantidad_raw" in df.columns:
        df["quantity_signed"] = parse_numeric_mixed(df["cantidad_raw"])
    else:
        df["quantity_signed"] = np.nan

    # Importe
    if "importe_signed_num" in df.columns:
        df["amount_signed"] = parse_numeric_mixed(df["importe_signed_num"])
    elif "importe_signed" in df.columns:
        df["amount_signed"] = parse_numeric_mixed(df["importe_signed"])
    elif "importe_linea_raw" in df.columns:
        df["amount_signed"] = parse_numeric_mixed(df["importe_linea_raw"])
    else:
        df["amount_signed"] = np.nan

    # Descuento (%)
    if "descuento_pct_num" in df.columns:
        df["discount_pct"] = parse_numeric_mixed(df["descuento_pct_num"])
    elif "descuento_pct_raw" in df.columns:
        df["discount_pct"] = parse_numeric_mixed(df["descuento_pct_raw"])
    else:
        df["discount_pct"] = np.nan

    # Canal
    if "canal_origen" not in df.columns:
        df["canal_origen"] = df["canal"] if "canal" in df.columns else np.nan

    # Tipo movimiento
    if "tipo_movimiento" not in df.columns:
        if "es_devolucion" in df.columns:
            df["tipo_movimiento"] = np.where(df["es_devolucion"].astype(str).str.lower().isin(["1","true","yes","si","sí"]), "DEVOLUCION", "VENTA")
        else:
            df["tipo_movimiento"] = np.nan

    # Documento (pedido/ticket)
    if "doc_id_std" not in df.columns:
        if "doc_id" in df.columns:
            df["doc_id_std"] = standardize_id(df["doc_id"])
        elif "ticket_id" in df.columns:
            df["doc_id_std"] = standardize_id(df["ticket_id"])
        elif "order_id" in df.columns:
            df["doc_id_std"] = standardize_id(df["order_id"])
        else:
            df["doc_id_std"] = np.nan

    df["is_return"] = df["tipo_movimiento"].astype(str).str.upper().str.contains("DEV")
    df["is_sale"] = ~df["is_return"]
    df["event_dt"] = pd.to_datetime(df["event_dt"], errors="coerce")

    return df

fact_t = ensure_fact_types(fact)

describe_quick(
    fact_t[["customer_id_std","product_id_std","store_id_std","event_dt","quantity_signed","amount_signed","discount_pct","canal_origen","tipo_movimiento","doc_id_std"]].copy(),
    "fact (tipada - columnas clave)",
    n=5
)

print("\nRango fechas (event_dt):", fact_t["event_dt"].min(), "->", fact_t["event_dt"].max())
print("Filas con fecha no parseada:", fact_t["event_dt"].isna().sum())


In [None]:

# ==========================================================
# Features de cliente (RFM, devoluciones, canal, descuentos)
# ==========================================================

def compute_customer_features(fact_df: pd.DataFrame, as_of: Optional[pd.Timestamp] = None) -> pd.DataFrame:
    df = fact_df.copy()
    df = df[df["customer_id_std"].notna()].copy()

    sales = df[df["is_sale"]].copy()
    returns = df[df["is_return"]].copy()

    if as_of is None:
        as_of = sales["event_dt"].max()
    as_of = pd.to_datetime(as_of)

    first_last = sales.groupby("customer_id_std")["event_dt"].agg(first_purchase_dt="min", last_purchase_dt="max").reset_index()
    first_last["tenure_days"] = (as_of - first_last["first_purchase_dt"]).dt.days
    first_last["recency_days"] = (as_of - first_last["last_purchase_dt"]).dt.days

    freq = sales.groupby("customer_id_std")["doc_id_std"].nunique().reset_index(name="frequency_docs")

    mon_gross = sales.groupby("customer_id_std")["amount_signed"].sum(min_count=1).reset_index(name="monetary_gross")
    mon_net = df.groupby("customer_id_std")["amount_signed"].sum(min_count=1).reset_index(name="monetary_net")

    qty_gross = sales.groupby("customer_id_std")["quantity_signed"].sum(min_count=1).reset_index(name="units_sold")
    qty_net = df.groupby("customer_id_std")["quantity_signed"].sum(min_count=1).reset_index(name="units_net")

    ret_docs = returns.groupby("customer_id_std")["doc_id_std"].nunique().reset_index(name="return_docs")
    ret_units = returns.groupby("customer_id_std")["quantity_signed"].sum(min_count=1).abs().reset_index(name="units_returned")
    ret_amount = returns.groupby("customer_id_std")["amount_signed"].sum(min_count=1).abs().reset_index(name="amount_returned")

    # canal split (ventas)
    ch = sales.pivot_table(index="customer_id_std", columns="canal_origen", values="amount_signed", aggfunc="sum", fill_value=0)
    ch.columns = [f"amt_{c.lower()}" for c in ch.columns]
    ch = ch.reset_index()

    sales["has_discount"] = sales["discount_pct"].fillna(0) > 0
    disc = sales.groupby("customer_id_std").agg(
        pct_lines_discount=("has_discount", "mean"),
        avg_discount_pct=("discount_pct", "mean"),
    ).reset_index()
    disc["pct_lines_discount"] = (disc["pct_lines_discount"] * 100).round(2)

    last_channel = (sales.sort_values("event_dt")
                         .groupby("customer_id_std")
                         .tail(1)[["customer_id_std","canal_origen"]]
                         .rename(columns={"canal_origen":"last_channel"}))

    out = (first_last.merge(freq, on="customer_id_std", how="left")
                    .merge(mon_gross, on="customer_id_std", how="left")
                    .merge(mon_net, on="customer_id_std", how="left")
                    .merge(qty_gross, on="customer_id_std", how="left")
                    .merge(qty_net, on="customer_id_std", how="left")
                    .merge(ret_docs, on="customer_id_std", how="left")
                    .merge(ret_units, on="customer_id_std", how="left")
                    .merge(ret_amount, on="customer_id_std", how="left")
                    .merge(ch, on="customer_id_std", how="left")
                    .merge(disc, on="customer_id_std", how="left")
                    .merge(last_channel, on="customer_id_std", how="left"))

    for c in ["return_docs","units_returned","amount_returned"]:
        out[c] = out[c].fillna(0)

    out["return_rate_units"] = safe_ratio(out["units_returned"], out["units_sold"])
    out["return_rate_amount"] = safe_ratio(out["amount_returned"], out["monetary_gross"])
    out["avg_ticket_gross"] = safe_ratio(out["monetary_gross"], out["frequency_docs"])

    # canal favorito por importe
    amt_cols = [c for c in out.columns if c.startswith("amt_")]
    def _fav_channel(row):
        if not amt_cols:
            return np.nan
        m = row[amt_cols].max()
        if pd.isna(m) or m == 0:
            return np.nan
        for c in amt_cols:
            if row[c] == m:
                return c.replace("amt_", "")
        return np.nan
    out["fav_channel"] = out.apply(_fav_channel, axis=1)

    # etiqueta didáctica churn: no compra en últimos 180 días
    out["label_churn_180d"] = (out["recency_days"] > 180).astype(int)

    return out

customer_features = compute_customer_features(fact_t)
describe_quick(customer_features, "customer_features (base)", n=10)
customer_features.sort_values("monetary_gross", ascending=False).head(10)


In [None]:

# ==========================================================
# RFM Scoring (didáctico)
# ==========================================================

def rfm_score(df: pd.DataFrame,
              recency_col: str = "recency_days",
              frequency_col: str = "frequency_docs",
              monetary_col: str = "monetary_gross",
              q: int = 5) -> pd.DataFrame:
    out = df.copy()

    out["R_score"] = pd.qcut(out[recency_col].rank(method="first"), q, labels=list(range(q,0,-1))).astype(int)
    out["F_score"] = pd.qcut(out[frequency_col].rank(method="first"), q, labels=list(range(1,q+1))).astype(int)
    out["M_score"] = pd.qcut(out[monetary_col].rank(method="first"), q, labels=list(range(1,q+1))).astype(int)

    out["RFM_score"] = out["R_score"].astype(str) + out["F_score"].astype(str) + out["M_score"].astype(str)
    out["RFM_sum"] = out[["R_score","F_score","M_score"]].sum(axis=1)
    return out

customer_features_rfm = rfm_score(customer_features)
display(customer_features_rfm[["customer_id_std","recency_days","frequency_docs","monetary_gross","R_score","F_score","M_score","RFM_score","RFM_sum"]].head(10))

print("\nDistribución RFM_sum:")
display(customer_features_rfm["RFM_sum"].value_counts().sort_index())


In [None]:

# ==========================================================
# Enriquecimiento con CRM (tier, puntos, estado, etc.)
# ==========================================================

def prepare_crm_dim(crm_raw: pd.DataFrame) -> pd.DataFrame:
    df = crm_raw.copy()

    df["customer_id_std"] = standardize_id(df["customer_id"]) if "customer_id" in df.columns else np.nan

    if "puntos_acumulados" in df.columns:
        df["puntos_num"] = parse_numeric_mixed(df["puntos_acumulados"])
    else:
        df["puntos_num"] = np.nan

    if "fecha_alta_programa" in df.columns:
        df["fecha_alta_dt"] = parse_datetime_es(df["fecha_alta_programa"])
    else:
        df["fecha_alta_dt"] = pd.NaT

    if "tier_fidelizacion" in df.columns:
        df["tier_fidelizacion"] = df["tier_fidelizacion"].astype(str).str.strip()
    else:
        df["tier_fidelizacion"] = np.nan

    # Deduplicación por customer_id_std
    if "ultima_actualizacion" in df.columns:
        df["ultima_actualizacion_dt"] = parse_datetime_es(df["ultima_actualizacion"])
        df = df.sort_values("ultima_actualizacion_dt").drop_duplicates("customer_id_std", keep="last")
    else:
        df = df.drop_duplicates("customer_id_std", keep="last")

    keep = ["customer_id_std","tier_fidelizacion","puntos_num","estado_cliente","consentimiento_marketing","fecha_alta_dt","ciudad","provincia","codigo_postal","pais"]
    keep = [c for c in keep if c in df.columns]
    return df[keep].copy()

crm_dim = prepare_crm_dim(crm) if crm is not None else None
describe_quick(crm_dim, "crm_dim (preparada)", n=5)

customer_features_final = customer_features_rfm.merge(crm_dim, on="customer_id_std", how="left") if crm_dim is not None else customer_features_rfm.copy()
describe_quick(customer_features_final, "customer_features_final (con CRM)", n=10)


In [None]:

# ==========================================================
# Features de producto / canal
# ==========================================================

def compute_product_features(fact_df: pd.DataFrame) -> pd.DataFrame:
    df = fact_df.copy()
    df = df[df["product_id_std"].notna()].copy()

    sales = df[df["is_sale"]].copy()
    returns = df[df["is_return"]].copy()

    agg_sales = sales.groupby("product_id_std").agg(
        units_sold=("quantity_signed", "sum"),
        revenue_gross=("amount_signed", "sum"),
        avg_discount_pct=("discount_pct", "mean"),
        lines_sold=("product_id_std", "size"),
        docs_sold=("doc_id_std", "nunique"),
    ).reset_index()

    agg_net = df.groupby("product_id_std").agg(
        units_net=("quantity_signed", "sum"),
        revenue_net=("amount_signed", "sum"),
    ).reset_index()

    agg_returns = returns.groupby("product_id_std").agg(
        units_returned=("quantity_signed", lambda s: s.abs().sum()),
        amount_returned=("amount_signed", lambda s: s.abs().sum()),
        return_docs=("doc_id_std", "nunique"),
    ).reset_index()

    out = agg_sales.merge(agg_net, on="product_id_std", how="left").merge(agg_returns, on="product_id_std", how="left")
    out[["units_returned","amount_returned","return_docs"]] = out[["units_returned","amount_returned","return_docs"]].fillna(0)

    out["return_rate_units"] = safe_ratio(out["units_returned"], out["units_sold"])
    out["return_rate_amount"] = safe_ratio(out["amount_returned"], out["revenue_gross"])

    # canal split (ventas)
    ch = sales.pivot_table(index="product_id_std", columns="canal_origen", values="amount_signed", aggfunc="sum", fill_value=0)
    ch.columns = [f"amt_{c.lower()}" for c in ch.columns]
    ch = ch.reset_index()

    out = out.merge(ch, on="product_id_std", how="left")
    return out

product_features = compute_product_features(fact_t)
describe_quick(product_features, "product_features (base)", n=10)


In [None]:

# ==========================================================
# Enriquecer productos con maestro + métricas de stock
# ==========================================================

def prepare_product_dim(prod_raw: pd.DataFrame) -> pd.DataFrame:
    df = prod_raw.copy()
    df["product_id_std"] = standardize_id(df["product_id"]) if "product_id" in df.columns else np.nan

    if "precio_venta" in df.columns:
        df["precio_venta_num"] = parse_numeric_mixed(df["precio_venta"])
    if "coste_unitario" in df.columns:
        df["coste_unitario_num"] = parse_numeric_mixed(df["coste_unitario"])

    for c in ["categoria","subcategoria","marca","estado_producto","nombre_producto"]:
        if c in df.columns:
            df[c] = df[c].astype(str).str.strip()

    df = df.drop_duplicates("product_id_std", keep="last")
    keep = ["product_id_std","nombre_producto","categoria","subcategoria","marca","proveedor","material","color",
            "precio_venta_num","coste_unitario_num","estado_producto"]
    keep = [c for c in keep if c in df.columns]
    return df[keep].copy()

prod_dim = prepare_product_dim(productos) if productos is not None else None
describe_quick(prod_dim, "prod_dim (preparada)", n=5)

def compute_stock_features(stock_raw: pd.DataFrame) -> pd.DataFrame:
    df = stock_raw.copy()
    df["product_id_std"] = standardize_id(df["product_id"]) if "product_id" in df.columns else np.nan
    df["store_id_std"] = standardize_id(df["store_id"]) if "store_id" in df.columns else np.nan
    df["date_dt"] = parse_datetime_es(df["fecha"]) if "fecha" in df.columns else pd.NaT
    df["stock_cierre_num"] = parse_numeric_mixed(df["stock_cierre"]) if "stock_cierre" in df.columns else np.nan

    g = df.groupby("product_id_std").agg(
        avg_stock=("stock_cierre_num","mean"),
        min_stock=("stock_cierre_num","min"),
        max_stock=("stock_cierre_num","max"),
        days_measured=("date_dt","count"),
        days_stockout=("stock_cierre_num", lambda s: (s.fillna(0) <= 0).sum()),
        days_negative_stock=("stock_cierre_num", lambda s: (s.fillna(0) < 0).sum()),
    ).reset_index()

    g["pct_days_stockout"] = safe_ratio(g["days_stockout"], g["days_measured"])
    return g

stock_feat = compute_stock_features(stock) if stock is not None else None
describe_quick(stock_feat, "stock_feat (por producto)", n=5)

product_features_final = product_features.copy()
if prod_dim is not None:
    product_features_final = product_features_final.merge(prod_dim, on="product_id_std", how="left")
if stock_feat is not None:
    product_features_final = product_features_final.merge(stock_feat, on="product_id_std", how="left")

describe_quick(product_features_final, "product_features_final (enriquecido)", n=10)

# --- KPIs por categoría y canal (ventas)
def compute_category_channel_kpis(fact_df: pd.DataFrame, prod_dim: Optional[pd.DataFrame] = None) -> pd.DataFrame:
    df = fact_df[fact_df["is_sale"]].copy()

    if "categoria" not in df.columns and prod_dim is not None:
        df = df.merge(prod_dim[["product_id_std","categoria","subcategoria"]], on="product_id_std", how="left")

    g = df.groupby(["categoria","canal_origen"]).agg(
        revenue=("amount_signed","sum"),
        units=("quantity_signed","sum"),
        docs=("doc_id_std","nunique"),
        lines=("product_id_std","size"),
        avg_discount=("discount_pct","mean"),
    ).reset_index()

    return g

category_channel = compute_category_channel_kpis(fact_t, prod_dim=prod_dim)
describe_quick(category_channel, "category_channel KPIs", n=10)


In [None]:

# ==========================================================
# Preparación para análisis de cesta (Market Basket)
# ==========================================================

def build_basket_long(fact_df: pd.DataFrame,
                      min_qty: float = 0.0,
                      drop_duplicates_in_tx: bool = True) -> pd.DataFrame:
    df = fact_df.copy()
    df = df[df["is_sale"]].copy()
    df = df[df["product_id_std"].notna() & df["doc_id_std"].notna()].copy()
    df = df[df["quantity_signed"].fillna(0) > min_qty].copy()

    basket = df[["doc_id_std","product_id_std","canal_origen","event_dt","quantity_signed","amount_signed"]].copy()
    basket = basket.rename(columns={"doc_id_std":"transaction_id"})

    if drop_duplicates_in_tx:
        basket = basket.groupby(["transaction_id","product_id_std","canal_origen"], as_index=False).agg(
            event_dt=("event_dt","max"),
            quantity=("quantity_signed","sum"),
            amount=("amount_signed","sum"),
        )
    else:
        basket = basket.rename(columns={"quantity_signed":"quantity","amount_signed":"amount"})

    return basket

basket_long = build_basket_long(fact_t)
describe_quick(basket_long, "basket_long", n=10)

print("Transacciones únicas:", basket_long["transaction_id"].nunique())
print("Productos únicos:", basket_long["product_id_std"].nunique())

# Ejemplo didáctico: matriz one-hot (puede crecer en casos reales)
basket_onehot = (basket_long.assign(present=1)
                          .pivot_table(index="transaction_id", columns="product_id_std", values="present", aggfunc="max", fill_value=0))
print("Matriz one-hot:", basket_onehot.shape)
display(basket_onehot.head(5))


In [None]:

# ==========================================================
# Dataset didáctico de propensión (por snapshots)
# ==========================================================

def compute_customer_features_in_window(fact_df: pd.DataFrame,
                                        start: pd.Timestamp,
                                        end: pd.Timestamp) -> pd.DataFrame:
    """Calcula features de cliente restringidas a un intervalo [start, end)."""
    dfw = fact_df[(fact_df["event_dt"] >= start) & (fact_df["event_dt"] < end)].copy()
    return compute_customer_features(dfw, as_of=end)

def build_propensity_dataset(fact_df: pd.DataFrame,
                             snapshot_dates: List[pd.Timestamp],
                             lookback_days: int = 180,
                             horizon_days: int = 60) -> pd.DataFrame:
    fact_df = fact_df.copy()
    fact_df = fact_df[fact_df["customer_id_std"].notna()].copy()

    rows = []
    for snap in snapshot_dates:
        snap = pd.to_datetime(snap)

        start = snap - pd.Timedelta(days=lookback_days)
        end = snap

        feats = compute_customer_features_in_window(fact_df, start=start, end=end)

        future_start = snap
        future_end = snap + pd.Timedelta(days=horizon_days)

        future_sales = (fact_df[(fact_df["is_sale"]) &
                                (fact_df["event_dt"] >= future_start) &
                                (fact_df["event_dt"] < future_end)][["customer_id_std","doc_id_std"]]
                        .drop_duplicates())
        future_sales["label_buy_next_horizon"] = 1

        feats = feats.merge(future_sales[["customer_id_std","label_buy_next_horizon"]].drop_duplicates(),
                            on="customer_id_std", how="left")
        feats["label_buy_next_horizon"] = feats["label_buy_next_horizon"].fillna(0).astype(int)
        feats["snapshot_date"] = snap

        rows.append(feats)

    return pd.concat(rows, ignore_index=True) if rows else pd.DataFrame()

max_dt = fact_t["event_dt"].max()
min_dt = fact_t["event_dt"].min()

if pd.notna(max_dt) and pd.notna(min_dt):
    snapshots = pd.date_range(end=max_dt.normalize(), periods=6, freq="MS")
    snapshots = [d for d in snapshots if d > (min_dt + pd.Timedelta(days=240))]
    print("Snapshots:", snapshots)

    propensity_ds = build_propensity_dataset(fact_t, snapshot_dates=snapshots, lookback_days=180, horizon_days=60)
    describe_quick(propensity_ds, "propensity_ds", n=10)

    print("Balance label_buy_next_horizon:")
    display(propensity_ds["label_buy_next_horizon"].value_counts(normalize=True).rename("pct"))
else:
    propensity_ds = pd.DataFrame()
    print("No hay fechas suficientes para snapshots.")


In [None]:

# ==========================================================
# Exportación de datasets de features (CSV)
# ==========================================================

def export_csv(df: pd.DataFrame, filename: str) -> str:
    path = os.path.join(OUTPUT_DIR, filename)
    df.to_csv(path, index=False, encoding="utf-8")
    print("Exportado:", path, "| filas:", len(df))
    return path

export_csv(customer_features_final, "features_clientes.csv")
export_csv(product_features_final, "features_productos.csv")
export_csv(category_channel, "features_categoria_canal.csv")
export_csv(basket_long, "basket_long.csv")

if len(propensity_ds) > 0:
    export_csv(propensity_ds, "dataset_propension_snapshots.csv")

summary = pd.DataFrame([
    {"dataset":"features_clientes.csv","rows":len(customer_features_final)},
    {"dataset":"features_productos.csv","rows":len(product_features_final)},
    {"dataset":"features_categoria_canal.csv","rows":len(category_channel)},
    {"dataset":"basket_long.csv","rows":len(basket_long)},
    {"dataset":"dataset_propension_snapshots.csv","rows":len(propensity_ds)},
])
display(summary)


## Siguientes pasos sugeridos (para el curso)

- **Churn / Abandono**
  - Revisa la definición de etiqueta `label_churn_180d` (es didáctica).
  - Prueba diferentes umbrales (90/120/180 días) o una definición basada en ventanas temporales.

- **Propensión**
  - Ajusta `lookback_days` y `horizon_days`.
  - Puedes crear etiquetas por **categoría**: “comprará Muebles en los próximos 60 días”.

- **Cesta (Market Basket)**
  - Usa `basket_long.csv` para construir reglas de asociación (Apriori, FP-Growth).
  - Prueba por canal (ONLINE vs POS) filtrando `canal_origen`.

- **Notebook 6 – Preprocesamiento final (escala y codificación)**
  - Escalado (MinMax / StandardScaler)
  - One-hot encoding de `tier_fidelizacion`, `fav_channel`, etc.
  - Dataset final numérico listo para ML/BI
