# Customer Churn Prediction – Modeling & Evaluation

## Objective
- Build a churn prediction model
- Compare baseline and advanced models
- Evaluate performance using industry-standard metrics

In [1]:
# Import Packages
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

import joblib

In [2]:
# Load Data
df = pd.read_csv("../data/processed/feature_engineered_churn_jupyter.csv")

In [3]:
# Check Columns
print(df.columns)

Index(['customer_id', 'gender', 'senior_citizen', 'tenure_months',
       'contract_type', 'monthly_charges', 'total_charges', 'payment_method',
       'avg_monthly_usage', 'usage_trend', 'support_tickets_last_3m', 'churn',
       'high_value_customer', 'tenure_bucket_Mid', 'tenure_bucket_Long',
       'support_intensity_Medium', 'support_intensity_High'],
      dtype='object')


In [4]:
# Display Data Shape
df.shape

(10000, 17)

In [5]:
# Display Dataset Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               10000 non-null  object 
 1   gender                    10000 non-null  int64  
 2   senior_citizen            10000 non-null  int64  
 3   tenure_months             10000 non-null  int64  
 4   contract_type             10000 non-null  int64  
 5   monthly_charges           10000 non-null  float64
 6   total_charges             10000 non-null  float64
 7   payment_method            10000 non-null  int64  
 8   avg_monthly_usage         10000 non-null  float64
 9   usage_trend               10000 non-null  int64  
 10  support_tickets_last_3m   10000 non-null  int64  
 11  churn                     10000 non-null  int64  
 12  high_value_customer       10000 non-null  int64  
 13  tenure_bucket_Mid         10000 non-null  bool   
 14  tenure_

In [6]:
# Display Summary
df.describe()

Unnamed: 0,gender,senior_citizen,tenure_months,contract_type,monthly_charges,total_charges,payment_method,avg_monthly_usage,usage_trend,support_tickets_last_3m,churn,high_value_customer
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.5013,0.1511,23.4746,0.6523,70.380092,1649.044038,1.4882,300.02333,1.1395,1.1919,0.3172,0.219
std,0.500023,0.358164,16.187037,0.79465,24.575732,1341.289226,1.11552,100.354051,0.852826,1.081384,0.465409,0.413589
min,0.0,0.0,1.0,0.0,20.0,20.0,0.0,50.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,11.0,0.0,52.8975,684.9825,0.0,231.8,0.0,0.0,0.0,0.0
50%,1.0,0.0,20.0,0.0,70.015,1290.14,1.0,301.3,1.0,1.0,0.0,0.0
75%,1.0,0.0,32.0,1.0,87.2525,2225.5225,2.0,367.9,2.0,2.0,1.0,0.0
max,1.0,1.0,72.0,2.0,150.0,10050.0,3.0,600.0,2.0,7.0,1.0,1.0


In [7]:
# Display First Five Rows
df.head()

Unnamed: 0,customer_id,gender,senior_citizen,tenure_months,contract_type,monthly_charges,total_charges,payment_method,avg_monthly_usage,usage_trend,support_tickets_last_3m,churn,high_value_customer,tenure_bucket_Mid,tenure_bucket_Long,support_intensity_Medium,support_intensity_High
0,CUST_1,1,0,58,0,40.63,2356.54,0,335.3,2,1,0,0,False,True,False,False
1,CUST_2,0,0,19,0,88.05,1672.95,0,271.9,2,2,0,0,True,False,True,False
2,CUST_3,1,1,12,0,44.73,536.76,1,227.8,0,2,0,0,False,False,True,False
3,CUST_4,1,0,11,0,84.89,933.79,2,124.0,2,3,1,0,False,False,True,False
4,CUST_5,1,1,4,0,82.63,330.52,2,425.4,0,1,0,0,False,False,False,False


# Model Preparation

In [8]:
X = df.drop(columns=["customer_id", "churn"])
y = df["churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

The data is split into training and testing sets to evaluate
model generalization on unseen customers.

In [9]:
# Using Baseline Model - Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

lr_probs = lr_model.predict_proba(X_test)[:, 1]
lr_auc = roc_auc_score(y_test, lr_probs)

lr_auc

0.7377719838714891

Logistic Regression serves as an interpretable baseline model
to benchmark performance.

In [10]:
# Saving the Logistic Regressin Model
joblib.dump(lr_model, "../models/lr_churn_model_jupyter.pkl")

['../models/lr_churn_model_jupyter.pkl']

In [11]:
# Final Model – Random Forest
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42
)

rf_model.fit(X_train, y_train)

rf_probs = rf_model.predict_proba(X_test)[:, 1]
rf_auc = roc_auc_score(y_test, rf_probs)

rf_auc

0.750353330777651

Random Forest captures non-linear interactions between customer
behavior features and typically performs well for churn prediction.

In [12]:
rf_preds = rf_model.predict(X_test)

print(classification_report(y_test, rf_preds))

              precision    recall  f1-score   support

           0       0.75      0.89      0.82      1366
           1       0.62      0.37      0.46       634

    accuracy                           0.73      2000
   macro avg       0.69      0.63      0.64      2000
weighted avg       0.71      0.73      0.71      2000



Precision and recall for the churn class are critical, as the business
cares about correctly identifying customers likely to leave.

In [13]:
# Feature Importance (Business Explainability)
feature_importance = pd.Series(
    rf_model.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

feature_importance.head(10)

tenure_months              0.208209
total_charges              0.151763
contract_type              0.150168
monthly_charges            0.129025
avg_monthly_usage          0.105531
usage_trend                0.059319
tenure_bucket_Mid          0.042405
support_tickets_last_3m    0.037409
payment_method             0.035094
high_value_customer        0.018528
dtype: float64

Top features align with business intuition such as tenure, usage behavior, contract characteristics, and support interaction.

In [14]:
# Saving the Randon Forest Classifier Model
joblib.dump(rf_model, "../models/rf_churn_model_jupyter.pkl")

['../models/rf_churn_model_jupyter.pkl']

## Model Selection Summary

- Logistic Regression provided a simple and interpretable baseline
- Random Forest achieved higher ROC-AUC and captured non-linear patterns
- The final model balances performance and explainability

Random Forest was selected as the final model for deployment.

## Conclusion

The churn prediction model successfully identifies customers at risk
of leaving based on behavioral and subscription features.

This model forms the foundation for:
- Customer risk segmentation
- Revenue at risk estimation
- Retention strategy optimization