# üìå Telco Churn ‚Äî Baseline Modeling (Notebook n¬∞3)

üéØ Objectif : entra√Æner un mod√®le baseline (Logistic Regression) sur le dataset Telco Churn
en r√©utilisant le **preprocessor sklearn** (sans data leakage), puis √©valuer :
- Recall churn (classe 1)
- ROC-AUC
- Confusion matrix
- Threshold tuning


In [5]:
# Imports & settings

import numpy as np
import pandas as pd

from joblib import load, dump

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve,
    precision_recall_fscore_support
)

from pathlib import Path
import pandas as pd

RANDOM_STATE = 42


In [7]:
# Load data

# 1) On regarde o√π on se situe
cwd = Path.cwd()
print("üìç Current working directory:", cwd)

# 2) On liste les chemins possibles
candidates = [
    cwd / "notebooks" / "data" / "processed",
    cwd / "data" / "processed",
    cwd / "data" / "processed",  # (au cas o√π)
    cwd / "data" / "processed",
    cwd / "notebooks" / "data" / "processed",
    cwd / "data" / "raw",  # juste pour debug
    cwd / "data" / "processed",
    cwd / "data" / "processed",
]

# On ajoute aussi le cas o√π le notebook tourne d√©j√† DANS notebooks/
candidates += [
    cwd / "data" / "processed",  # si cwd == .../notebooks
    cwd.parent / "notebooks" / "data" / "processed",  # si cwd == .../notebooks
    cwd.parent / "data" / "processed",
]

# 3) On garde les dossiers qui existent
existing_dirs = [p for p in candidates if p.exists()]
print("üìÇ Existing candidate dirs:")
for p in existing_dirs:
    print(" -", p)

# 4) On cherche les fichiers dedans
train_path = None
test_path = None

for d in existing_dirs:
    tp = d / "telco_train.csv"
    te = d / "telco_test.csv"
    if tp.exists() and te.exists():
        train_path, test_path = tp, te
        break

if train_path is None:
    raise FileNotFoundError("Impossible de trouver telco_train.csv et telco_test.csv dans les dossiers candidats.")

print("‚úÖ Found train:", train_path)
print("‚úÖ Found test :", test_path)

train_df = pd.read_csv(train_path)
test_df  = pd.read_csv(test_path)

print("Train:", train_df.shape)
print("Test :", test_df.shape)
train_df.head()



üìç Current working directory: C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks
üìÇ Existing candidate dirs:
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed
 - C:\Users\Anna\PycharmProjects\churn-mlops-telco\data\processed
‚úÖ Found train: C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed\telco_train.csv
‚úÖ Found test : C:\Users\Anna\PycharmProjects\churn-mlops-telco\notebooks\data\processed\telco_test.csv
Train: (5634, 22)
Test : (1409, 22)


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,customer_value,high_monthly_charges,Churn
0,Male,0,No,No,35,No,No phone service,DSL,No,No,...,Yes,Yes,Month-to-month,No,Electronic check,49.2,1701.65,1722.0,0,0
1,Male,0,Yes,Yes,15,Yes,No,Fiber optic,Yes,No,...,No,No,Month-to-month,No,Mailed check,75.1,1151.55,1126.5,1,0
2,Male,0,Yes,Yes,13,No,No phone service,DSL,Yes,Yes,...,No,No,Two year,No,Mailed check,40.55,590.35,527.15,0,0
3,Female,0,Yes,No,26,Yes,No,DSL,No,Yes,...,Yes,Yes,Two year,Yes,Credit card (automatic),73.5,1905.7,1911.0,1,0
4,Male,0,Yes,Yes,1,Yes,No,DSL,No,No,...,No,No,Month-to-month,No,Electronic check,44.55,44.55,44.55,0,0


In [8]:
# Separate features & target

X_train = train_df.drop(columns=["Churn"])
y_train = train_df["Churn"]

X_test  = test_df.drop(columns=["Churn"])
y_test  = test_df["Churn"]

print("X_train:", X_train.shape)
print("X_test :", X_test.shape)
print("Churn rate train:", float(y_train.mean()).__round__(3))
print("Churn rate test :", float(y_test.mean()).__round__(3))


X_train: (5634, 21)
X_test : (1409, 21)
Churn rate train: 0.265
Churn rate test : 0.265


In [9]:
# Load preprocessor

from joblib import load

preprocessor = load("../../models/preprocessor.joblib")
preprocessor


In [10]:
# baseline model pipeline

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

logreg_pipeline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("model", LogisticRegression(
            class_weight="balanced",
            max_iter=1000,
            solver="liblinear",
            random_state=RANDOM_STATE
        ))
    ]
)

logreg_pipeline


In [11]:
# Entrainement du mod√®le fit

logreg_pipeline.fit(X_train, y_train)


In [12]:
#Baseline evaluation

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

# Pr√©dictions
y_pred = logreg_pipeline.predict(X_test)

# Probabilit√©s churn (classe 1)
y_proba = logreg_pipeline.predict_proba(X_test)[:, 1]

print("ROC-AUC :", round(roc_auc_score(y_test, y_proba), 3))


ROC-AUC : 0.841


In [13]:
# Classification report

print("Classification report :\n")
print(classification_report(y_test, y_pred))


Classification report :

              precision    recall  f1-score   support

           0       0.90      0.72      0.80      1035
           1       0.51      0.79      0.62       374

    accuracy                           0.74      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.80      0.74      0.75      1409



In [14]:
# Confusion matrix

cm = confusion_matrix(y_test, y_pred)
cm


array([[747, 288],
       [ 80, 294]], dtype=int64)

In [15]:
# Treshold tuning

thresholds = np.arange(0.2, 0.81, 0.05)

rows = []

for t in thresholds:
    y_pred_t = (y_proba >= t).astype(int)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test, y_pred_t, average="binary"
    )
    rows.append({
        "threshold": round(t, 2),
        "precision_churn": round(precision, 3),
        "recall_churn": round(recall, 3),
        "f1_churn": round(f1, 3)
    })

threshold_df = pd.DataFrame(rows)
threshold_df

Unnamed: 0,threshold,precision_churn,recall_churn,f1_churn
0,0.2,0.388,0.963,0.553
1,0.25,0.407,0.939,0.568
2,0.3,0.429,0.928,0.587
3,0.35,0.447,0.909,0.599
4,0.4,0.465,0.864,0.604
5,0.45,0.485,0.84,0.614
6,0.5,0.505,0.786,0.615
7,0.55,0.526,0.754,0.62
8,0.6,0.542,0.706,0.613
9,0.65,0.567,0.666,0.613


In [16]:
# Application du seuil choisi

THRESHOLD = 0.40

y_pred_40 = (y_proba >= THRESHOLD).astype(int)

print("Threshold =", THRESHOLD)
print("\nClassification report:\n")
print(classification_report(y_test, y_pred_40))

confusion_matrix(y_test, y_pred_40)


Threshold = 0.4

Classification report:

              precision    recall  f1-score   support

           0       0.93      0.64      0.76      1035
           1       0.46      0.86      0.60       374

    accuracy                           0.70      1409
   macro avg       0.70      0.75      0.68      1409
weighted avg       0.81      0.70      0.72      1409



array([[663, 372],
       [ 51, 323]], dtype=int64)

In [3]:
# Save model

from joblib import dump

dump(logreg_pipeline, "../models/logreg_baseline_pipeline.joblib")


NameError: name 'logreg_pipeline' is not defined

## ‚úÖ Conclusion ‚Äî Baseline modeling

Un mod√®le baseline bas√© sur une Logistic Regression a √©t√© entra√Æn√© √† l‚Äôaide d‚Äôun pipeline
int√©grant le preprocessing pr√©c√©demment d√©fini.

R√©sultats :
- ROC-AUC : 0.84
- Recall churn (seuil 0.40) : ~86%

Le seuil de d√©cision a √©t√© ajust√© afin de privil√©gier la d√©tection des clients √† risque,
dans une logique de r√©tention proactive.

Le pipeline final est sauvegard√© et pr√™t pour :
- une comparaison avec des mod√®les plus avanc√©s
- un d√©ploiement via API


In [2]:
from pathlib import Path
import joblib, json

PROJECT_ROOT = Path(r"C:\Users\Anna\PycharmProjects\churn-mlops-telco")
MODELS_DIR = PROJECT_ROOT / "models"
MODELS_DIR.mkdir(exist_ok=True)

joblib.dump(logreg_baseline_pipeline, MODELS_DIR / "churn_pipeline.joblib")
(MODELS_DIR / "threshold.json").write_text(json.dumps({"threshold": 0.40}, indent=2))

print("‚úÖ Model saved to:", MODELS_DIR)


NameError: name 'logreg_baseline_pipeline' is not defined