# Churn — Modelado (Clasificación)

En este notebook entreno modelos de clasificación para predecir `churn` usando el dataset ya preprocesado (`churn_processed.csv`).

## 1) Carga del dataset

Leemos el CSV procesado y verificamos rápidamente el formato (primeras filas).

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix
)


df = pd.read_csv("../data/churn_processed.csv")
df.head()



Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,last_purchase_date,days_since_last_purchase,churn,purchase_count,total_spent,avg_ticket,avg_review_score,main_payment_type
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP,2017-05-16 15:05:35,519,1,1,124.99,124.99,4.0,credit_card
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP,2018-01-12 20:48:24,277,1,1,289.0,289.0,5.0,credit_card
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP,2018-05-19 16:07:45,151,0,1,139.94,139.94,5.0,credit_card
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP,2018-03-13 16:06:38,218,1,1,149.94,149.94,5.0,credit_card
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP,2018-07-29 09:51:30,80,0,1,230.0,230.0,5.0,credit_card


## 2) Definir variables (X) y objetivo (y)

- `y` = columna `churn`
- `X` = resto de variables (features).

In [2]:
target = "churn"   

X = df.drop(columns=target)
y = df[target]


## 3) Train/Test split

Separo datos en entrenamiento y prueba (80/20). Uso `stratify=y` para mantener la proporción de clases.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


## 4) Preprocesamiento

- Numéricas: escalado con `StandardScaler`.
- Categóricas: `OneHotEncoder` con `handle_unknown='ignore'`.

> Nota: `class_weight='balanced'` ayuda cuando hay desbalance entre clases.

In [4]:
cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(with_mean=False), num_cols),
        ("cat", OneHotEncoder(
            handle_unknown="ignore",
            sparse_output=True,
            dtype=np.float32,
            min_frequency=20
        ), cat_cols)
    ],
    sparse_threshold=1.0
)


## 5) Modelos a comparar

Pruebo 3 modelos lineales rápidos y sólidos para baseline:
- Regresión logística
- SVM lineal
- SGDClassifier

In [5]:
models = {
    "LogReg (saga)": LogisticRegression(
        solver="saga",
        max_iter=3000,
        n_jobs=-1,
        class_weight="balanced"
    ),
    "LinearSVC": LinearSVC(class_weight="balanced"),
    "SGDClassifier": SGDClassifier(class_weight="balanced", random_state=42)
}


## 6) Entrenamiento y comparación

Entreno cada modelo con el mismo `Pipeline` y comparo métricas (Accuracy, Precision, Recall, F1). Ordeno por **F1** para elegir el mejor balance.

In [6]:
results = []

for name, model in models.items():
    pipeline = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("model", model)
    ])

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, zero_division=0),
        "Recall": recall_score(y_test, y_pred, zero_division=0),
        "F1": f1_score(y_test, y_pred, zero_division=0),
    })

results_df = pd.DataFrame(results).sort_values(by="F1", ascending=False)
results_df


Unnamed: 0,Model,Accuracy,Precision,Recall,F1
1,LinearSVC,0.795867,0.777705,1.0,0.874954
2,SGDClassifier,0.782593,0.766623,1.0,0.867897
0,LogReg (saga),0.751823,0.742111,1.0,0.851967


## 7) Mejor modelo + reporte final

Entreno el mejor modelo nuevamente y muestro:
- `classification_report`
- matriz de confusión

In [7]:
best_name = results_df.iloc[0]["Model"]
best_model = models[best_name]

best_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", best_model)
])

best_pipeline.fit(X_train, y_train)
y_pred = best_pipeline.predict(X_test)

print("Best model:", best_name)
print(classification_report(y_test, y_pred, zero_division=0))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


Best model: LinearSVC
              precision    recall  f1-score   support

           0       1.00      0.29      0.44      5685
           1       0.78      1.00      0.87     14204

    accuracy                           0.80     19889
   macro avg       0.89      0.64      0.66     19889
weighted avg       0.84      0.80      0.75     19889

Confusion matrix:
 [[ 1625  4060]
 [    0 14204]]
