# Customer Churn Prediction – Modeling & Risk Analysis

## Modeling Overview

This notebook focuses on:
- Building and comparing machine learning models for customer churn prediction
- Selecting the final model based on recall and ROC-AUC performance
- Interpreting model coefficients
- Performing customer risk segmentation based on predicted churn probabilities

In [9]:
import pandas as pd
import numpy as np

df = pd.read_csv("churn.csv")
df_model = df.copy()

In [10]:
df_model['TotalCharges'] = pd.to_numeric(df_model['TotalCharges'], errors='coerce')
df_model = df_model.dropna()
df_model = df_model.drop(columns=['customerID'])

In [11]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7032 non-null   object 
 1   SeniorCitizen     7032 non-null   int64  
 2   Partner           7032 non-null   object 
 3   Dependents        7032 non-null   object 
 4   tenure            7032 non-null   int64  
 5   PhoneService      7032 non-null   object 
 6   MultipleLines     7032 non-null   object 
 7   InternetService   7032 non-null   object 
 8   OnlineSecurity    7032 non-null   object 
 9   OnlineBackup      7032 non-null   object 
 10  DeviceProtection  7032 non-null   object 
 11  TechSupport       7032 non-null   object 
 12  StreamingTV       7032 non-null   object 
 13  StreamingMovies   7032 non-null   object 
 14  Contract          7032 non-null   object 
 15  PaperlessBilling  7032 non-null   object 
 16  PaymentMethod     7032 non-null   object 
 17  

In [16]:
X = df_model.drop(columns=['Churn'])
y = df_model['Churn'].map({'Yes': 1, 'No': 0})

numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = [col for col in X.columns if col not in numeric_features]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [17]:
X_train_processed.shape, X_test_processed.shape

((5274, 30), (1758, 30))

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

log_reg_bal = LogisticRegression(
    max_iter=2000,
    random_state=42,
    class_weight='balanced'
)

log_reg_bal.fit(X_train_processed, y_train)

y_pred_bal = log_reg_bal.predict(X_test_processed)
y_proba_bal = log_reg_bal.predict_proba(X_test_processed)[:, 1]

confusion_matrix(y_test, y_pred_bal)

array([[917, 374],
       [ 95, 372]])

In [31]:
print(classification_report(y_test, y_pred_bal))
roc_auc_score(y_test, y_proba_bal)

              precision    recall  f1-score   support

           0       0.91      0.71      0.80      1291
           1       0.50      0.80      0.61       467

    accuracy                           0.73      1758
   macro avg       0.70      0.75      0.70      1758
weighted avg       0.80      0.73      0.75      1758



np.float64(0.8400896007112326)

In [26]:
roc_auc_score(y_test, y_proba)

np.float64(0.8402985916333967)

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_processed, y_train)

y_pred_rf = rf.predict(X_test_processed)
y_proba_rf = rf.predict_proba(X_test_processed)[:, 1]

confusion_matrix(y_test, y_pred_rf)

array([[1163,  128],
       [ 234,  233]])

In [28]:
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.83      0.90      0.87      1291
           1       0.65      0.50      0.56       467

    accuracy                           0.79      1758
   macro avg       0.74      0.70      0.71      1758
weighted avg       0.78      0.79      0.78      1758



In [30]:
feature_names_num = numeric_features
feature_names_cat = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([feature_names_num, feature_names_cat])

coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': log_reg_bal.coef_[0]
})

coef_df['abs_coef'] = coef_df['coefficient'].abs()
coef_df_sorted = coef_df.sort_values('abs_coef', ascending=False)

coef_df_sorted.head(10)

Unnamed: 0,feature,coefficient,abs_coef
25,Contract_Two year,-1.411861,1.411861
0,tenure,-1.250652,1.250652
10,InternetService_Fiber optic,1.200439,1.200439
24,Contract_One year,-0.713853,0.713853
2,TotalCharges,0.607081,0.607081
1,MonthlyCharges,-0.540088,0.540088
21,StreamingTV_Yes,0.413756,0.413756
23,StreamingMovies_Yes,0.389913,0.389913
28,PaymentMethod_Electronic check,0.379165,0.379165
7,PhoneService_Yes,-0.372617,0.372617


**Model Selection and Interpretation:**

After comparing multiple models, Logistic Regression was selected as the final model due to its strong recall and ROC-AUC performance for churn prediction. Using class-weight balancing further improved the model’s ability to identify churned customers. Coefficient analysis indicates that contract type, tenure, monthly charges, and the availability of technical support are among the most influential factors associated with churn risk.


In [32]:
risk_df = X_test.copy()
risk_df['churn_probability'] = y_proba_bal
risk_df['actual_churn'] = y_test.values

bins = [0.0, 0.3, 0.6, 1.0]
labels = ['Low Risk', 'Medium Risk', 'High Risk']

risk_df['risk_segment'] = pd.cut(
    risk_df['churn_probability'],
    bins=bins,
    labels=labels,
    include_lowest=True
)

risk_df['risk_segment'].value_counts(normalize=True)

risk_segment
Low Risk       0.408987
High Risk      0.350967
Medium Risk    0.240046
Name: proportion, dtype: float64

In [33]:
pd.crosstab(
    risk_df['risk_segment'],
    risk_df['actual_churn'],
    normalize='index'
)

actual_churn,0,1
risk_segment,Unnamed: 1_level_1,Unnamed: 2_level_1
Low Risk,0.952712,0.047288
Medium Risk,0.779621,0.220379
High Risk,0.448947,0.551053


**Risk Segmentation:**

Based on predicted churn probabilities, customers were segmented into low, medium, and high-risk groups. Customers in the high-risk segment exhibit a substantially higher observed churn rate, indicating that the model effectively distinguishes customers who are more likely to churn. This segmentation can be directly used by the business to prioritize retention efforts and allocate resources more efficiently. For example, customers in the high-risk segment could be targeted with personalized retention offers or proactive customer support interventions.