# Churn — Modelado (Clasificación)

En este notebook entreno modelos de clasificación para predecir `churn` usando el dataset ya preprocesado (`churn_processed.csv`).

## 1) Carga del dataset

Leemos el CSV procesado y verificamos rápidamente el formato (primeras filas).

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix
)


df = pd.read_csv("../data/churn_processed.csv")
df.head()



## 2) Definir variables (X) y objetivo (y)

- `y` = columna `churn`
- `X` = resto de variables (features).

In [None]:
target = "churn"   

X = df.drop(columns=target)
y = df[target]


## 3) Train/Test split

Separo datos en entrenamiento y prueba (80/20). Uso `stratify=y` para mantener la proporción de clases.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


## 4) Preprocesamiento

- Numéricas: escalado con `StandardScaler`.
- Categóricas: `OneHotEncoder` con `handle_unknown='ignore'`.

> Nota: `class_weight='balanced'` ayuda cuando hay desbalance entre clases.

In [None]:
cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(with_mean=False), num_cols),
        ("cat", OneHotEncoder(
            handle_unknown="ignore",
            sparse_output=True,
            dtype=np.float32,
            min_frequency=20
        ), cat_cols)
    ],
    sparse_threshold=1.0
)


## 5) Modelos a comparar

Pruebo 3 modelos lineales rápidos y sólidos para baseline:
- Regresión logística
- SVM lineal
- SGDClassifier

In [None]:
models = {
    "LogReg (saga)": LogisticRegression(
        solver="saga",
        max_iter=3000,
        n_jobs=-1,
        class_weight="balanced"
    ),
    "LinearSVC": LinearSVC(class_weight="balanced"),
    "SGDClassifier": SGDClassifier(class_weight="balanced", random_state=42)
}


## 6) Entrenamiento y comparación

Entreno cada modelo con el mismo `Pipeline` y comparo métricas (Accuracy, Precision, Recall, F1). Ordeno por **F1** para elegir el mejor balance.

In [None]:
results = []

for name, model in models.items():
    pipeline = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("model", model)
    ])

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, zero_division=0),
        "Recall": recall_score(y_test, y_pred, zero_division=0),
        "F1": f1_score(y_test, y_pred, zero_division=0),
    })

results_df = pd.DataFrame(results).sort_values(by="F1", ascending=False)
results_df


## 7) Mejor modelo + reporte final

Entreno el mejor modelo nuevamente y muestro:
- `classification_report`
- matriz de confusión

In [None]:
best_name = results_df.iloc[0]["Model"]
best_model = models[best_name]

best_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", best_model)
])

best_pipeline.fit(X_train, y_train)
y_pred = best_pipeline.predict(X_test)

print("Best model:", best_name)
print(classification_report(y_test, y_pred, zero_division=0))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
