# Customer Churn Prediction with Interpretable Models

This notebook focuses on building baseline machine learning models for customer churn prediction.
The emphasis is not only on predictive performance, but also on **interpretability**, which is crucial
for understanding customer behavior and supporting business decisions.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

## Data Loading and Preparation

We load the dataset and prepare features for modeling. Identifier variables
are removed, and categorical variables are encoded using one-hot encoding.

In [None]:
df = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Convert TotalCharges to numeric
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Target variable
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})
df = df.dropna()

df.head()


## Feature Selection and Train-Test Split

We separate the target variable (`Churn`) from input features and perform a standard train-test split.

In [None]:
X = df.drop(columns=["Churn", "customerID"])
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [None]:
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "string"]).columns

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)


## Logistic Regression (Interpretable Baseline)

Logistic regression serves as a strong and interpretable baseline model for churn prediction.

In [None]:
log_reg = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("logisticregression", LogisticRegression(max_iter=1000))
    ]
)

log_reg.fit(X_train, y_train)

y_pred_lr = log_reg.predict_proba(X_test)[:, 1]
roc_lr = roc_auc_score(y_test, y_pred_lr)

roc_lr


## Decision Tree (Non-linear Baseline)

A decision tree is trained as a comparison model to capture non-linear patterns.



In [None]:
tree = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("decisiontree", DecisionTreeClassifier(
            max_depth=5,
            random_state=42
        ))
    ]
)

tree.fit(X_train, y_train)

y_pred_tree = tree.predict_proba(X_test)[:, 1]
roc_tree = roc_auc_score(y_test, y_pred_tree)

roc_tree


## Model Comparison

Logistic regression slightly outperforms the decision tree in terms of ROC-AUC.
Given its superior interpretability and competitive performance, logistic regression
is selected for deeper analysis.


In [None]:
feature_names = log_reg.named_steps["preprocessor"].get_feature_names_out()
classifier = list(log_reg.named_steps.values())[-1]

coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": classifier.coef_[0]
}).sort_values(by="coefficient", ascending=False)

coef_df.head(10), coef_df.tail(10)

In [None]:
# Identify most influential features (by absolute coefficient value)
coef_df["abs_coefficient"] = coef_df["coefficient"].abs()

coef_df.sort_values("abs_coefficient", ascending=False).head(10)

## Interpretation of Logistic Regression Coefficients

- Contract type is the strongest driver of churn, especially month-to-month contracts.
- Higher monthly charges are associated with increased churn probability.
- Customer tenure strongly reduces churn risk, reflecting accumulated loyalty.
- Electronic check payment is correlated with higher churn.
- Value-added services such as online security or tech support reduce churn likelihood.

These findings align with domain intuition and suggest that the model captures meaningful
customer behavior patterns rather than spurious correlations.