# Customer Churn Prediction with Interpretable Models

This notebook focuses on building baseline machine learning models for customer churn prediction.
The emphasis is not only on predictive performance, but also on **interpretability**, which is crucial
for understanding customer behavior and supporting business decisions.

In [48]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

## Data Loading and Preparation

We load the dataset and prepare features for modeling. Identifier variables
are removed, and categorical variables are encoded using one-hot encoding.

In [49]:
df = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Convert TotalCharges to numeric
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Target variable
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})
df = df.dropna()

df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


## Feature Selection and Train-Test Split

We separate the target variable (`Churn`) from input features and perform a standard train-test split.

In [50]:
X = df.drop(columns=["Churn", "customerID"])
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [51]:
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object"]).columns

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  categorical_features = X.select_dtypes(include=["object"]).columns


## Logistic Regression (Interpretable Baseline)

Logistic regression serves as a strong and interpretable baseline model for churn prediction.

In [52]:
log_reg = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("logisticregression", LogisticRegression(max_iter=1000))
    ]
)

log_reg.fit(X_train, y_train)

y_pred_lr = log_reg.predict_proba(X_test)[:, 1]
roc_lr = roc_auc_score(y_test, y_pred_lr)

roc_lr


0.8359290473207676

## Decision Tree (Non-linear Baseline)

A decision tree is trained as a comparison model to capture non-linear patterns.



In [53]:
tree = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("decisiontree", DecisionTreeClassifier(
            max_depth=5,
            random_state=42
        ))
    ]
)

tree.fit(X_train, y_train)

y_pred_tree = tree.predict_proba(X_test)[:, 1]
roc_tree = roc_auc_score(y_test, y_pred_tree)

roc_tree


0.8296250472379394

## Model Comparison

Logistic regression slightly outperforms the decision tree in terms of ROC-AUC.
Given its superior interpretability and competitive performance, logistic regression
is selected for deeper analysis.


In [54]:
feature_names = log_reg.named_steps["preprocessor"].get_feature_names_out()
classifier = list(log_reg.named_steps.values())[-1]

coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": classifier.coef_[0]
}).sort_values(by="coefficient", ascending=False)

coef_df.head(10), coef_df.tail(10)

(                                feature  coefficient
 3                     num__TotalCharges     0.644014
 36         cat__Contract_Month-to-month     0.613846
 16     cat__InternetService_Fiber optic     0.590184
 32                 cat__StreamingTV_Yes     0.191113
 43  cat__PaymentMethod_Electronic check     0.180671
 35             cat__StreamingMovies_Yes     0.177076
 18               cat__OnlineSecurity_No     0.164459
 27                  cat__TechSupport_No     0.142949
 14               cat__MultipleLines_Yes     0.088164
 0                    num__SeniorCitizen     0.071011,
                                       feature  coefficient
 19    cat__OnlineSecurity_No internet service    -0.283724
 17                    cat__InternetService_No    -0.283724
 31       cat__StreamingTV_No internet service    -0.283724
 25  cat__DeviceProtection_No internet service    -0.283724
 12                      cat__MultipleLines_No    -0.293470
 39                   cat__PaperlessBilling_N

In [55]:
# Identify most influential features (by absolute coefficient value)
coef_df["abs_coefficient"] = coef_df["coefficient"].abs()

coef_df.sort_values("abs_coefficient", ascending=False).head(10)

Unnamed: 0,feature,coefficient,abs_coefficient
1,num__tenure,-1.352313,1.352313
38,cat__Contract_Two year,-0.77903,0.77903
3,num__TotalCharges,0.644014,0.644014
15,cat__InternetService_DSL,-0.616132,0.616132
36,cat__Contract_Month-to-month,0.613846,0.613846
16,cat__InternetService_Fiber optic,0.590184,0.590184
2,num__MonthlyCharges,-0.541006,0.541006
39,cat__PaperlessBilling_No,-0.300387,0.300387
12,cat__MultipleLines_No,-0.29347,0.29347
22,cat__OnlineBackup_No internet service,-0.283724,0.283724


## Interpretation of Logistic Regression Coefficients

- Contract type is the strongest driver of churn, especially month-to-month contracts.
- Higher monthly charges are associated with increased churn probability.
- Customer tenure strongly reduces churn risk, reflecting accumulated loyalty.
- Electronic check payment is correlated with higher churn.
- Value-added services such as online security or tech support reduce churn likelihood.

These findings align with domain intuition and suggest that the model captures meaningful
customer behavior patterns rather than spurious correlations.