# üìå Telco Churn ‚Äî Advanced Modeling (Notebook n¬∞4)

Objectif :
- entra√Æner un mod√®le plus avanc√© que la baseline (LogReg)
- comparer les performances (ROC-AUC + Recall churn)
- conserver un pipeline reproductible (preprocess + mod√®le)
- sauvegarder le meilleur mod√®le

In [17]:
# Imports

import numpy as np
import pandas as pd
import joblib

from joblib import load, dump

from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score,
    classification_report,
    confusion_matrix,
    precision_recall_fscore_support
)

from sklearn.ensemble import RandomForestClassifier


In [5]:
# Load data & split X/y

train_df = pd.read_csv("../../data/processed/telco_train.csv")
test_df  = pd.read_csv("../../data/processed/telco_test.csv")

X_train = train_df.drop(columns=["Churn"])
y_train = train_df["Churn"]

X_test  = test_df.drop(columns=["Churn"])
y_test  = test_df["Churn"]

print("Train:", X_train.shape, "| churn rate:", round(y_train.mean(), 3))
print("Test :", X_test.shape,  "| churn rate:", round(y_test.mean(), 3))


Train: (5634, 21) | churn rate: 0.265
Test : (1409, 21) | churn rate: 0.265


In [20]:
# Save train/test sets for future notebooks

joblib.dump(X_train, "../../data/processed/X_train.joblib")
joblib.dump(X_test, "../../data/processed/X_test.joblib")
joblib.dump(y_train, "../../data/processed/y_train.joblib")
joblib.dump(y_test, "../../data/processed/y_test.joblib")


['../data/processed/y_test.joblib']

In [8]:
# Load preprocessor

from joblib import load

preprocessor = load("../../models/preprocessor.joblib")
preprocessor


In [9]:
# Build pipeline with RandomForest

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

rf_pipeline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "model",
            RandomForestClassifier(
                n_estimators=300,
                max_depth=None,
                min_samples_split=5,
                min_samples_leaf=2,
                class_weight="balanced",
                random_state=42,
                n_jobs=-1
            )
        )
    ]
)

rf_pipeline


In [10]:
# Train RandomForest model

rf_pipeline.fit(X_train, y_train)


In [11]:
# Evaluate on test set

from sklearn.metrics import (
    roc_auc_score,
    classification_report,
    confusion_matrix
)

# Pr√©dictions
y_pred_rf = rf_pipeline.predict(X_test)

# Probabilit√©s churn
y_proba_rf = rf_pipeline.predict_proba(X_test)[:, 1]

print("ROC-AUC RandomForest :", round(roc_auc_score(y_test, y_proba_rf), 3))


ROC-AUC RandomForest : 0.835


In [12]:
# Classification report

print("Classification report ‚Äî RandomForest\n")
print(classification_report(y_test, y_pred_rf))


Classification report ‚Äî RandomForest

              precision    recall  f1-score   support

           0       0.87      0.83      0.85      1035
           1       0.58      0.65      0.61       374

    accuracy                           0.78      1409
   macro avg       0.73      0.74      0.73      1409
weighted avg       0.79      0.78      0.79      1409



In [13]:
# Confusion matrix

confusion_matrix(y_test, y_pred_rf)


array([[862, 173],
       [132, 242]], dtype=int64)

In [14]:
# Model comparison

results = pd.DataFrame({
    "model": ["Logistic Regression", "Random Forest"],
    "roc_auc": [
        0.841,  # valeur NB3
        roc_auc_score(y_test, y_proba_rf)
    ],
    "recall_churn": [
        0.786,  # recall √† threshold 0.5 pour LogReg
        classification_report(y_test, y_pred_rf, output_dict=True)["1"]["recall"]
    ]
})

results


Unnamed: 0,model,roc_auc,recall_churn
0,Logistic Regression,0.841,0.786
1,Random Forest,0.835392,0.647059


## Conclusion ‚Äî Mod√©lisation avanc√©e

Un mod√®le RandomForest a √©t√© entra√Æn√© et compar√© √† la baseline
(Logistic Regression) en conservant le m√™me pipeline de preprocessing.

R√©sultats :
- Logistic Regression : ROC-AUC = 0.84, Recall churn ‚âà 0.79
- RandomForest        : ROC-AUC = 0.83, Recall churn ‚âà 0.65

Bien que le RandomForest am√©liore l√©g√®rement la pr√©cision,
il d√©tecte significativement moins de clients √† risque.

Dans un objectif de r√©tention client, la Logistic Regression est retenue
comme mod√®le final en raison de son meilleur compromis
entre performance et impact m√©tier.