# Telco Customer Churn Analysis

**Objective**: Identify churn drivers and build a predictive model for at-risk customers.

**Dataset**: [Telco Customer Churn (Kaggle)](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)

**Contents**:
1. Data Loading & Inspection
2. Data Cleaning & Preprocessing
3. Exploratory Data Analysis (EDA)
4. Feature Engineering
5. Predictive Modeling (Logistic Regression + Decision Tree)
6. Model Evaluation & Feature Importance
7. Business Insights & Retention Strategies

## 1. Data Loading & Inspection

In [None]:
# importing the basic libraries we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

In [None]:
# ignore warnings to keep output clean
import warnings
warnings.filterwarnings('ignore')

In [None]:
# importing scikit-learn stuff for machine learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [None]:
# importing metrics to evaluate our models
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, confusion_matrix, classification_report, roc_curve
)

In [None]:
# setting up some visualization preferences to make charts look better
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

In [None]:
print("All libraries loaded successfully!")

In [None]:
# load the dataset from csv file
data_path = Path("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df = pd.read_csv(data_path)

In [None]:
# check the size of our dataset
print(f"Dataset shape: {df.shape}")
print(f"We have {df.shape[0]} customers and {df.shape[1]} columns")

In [None]:
# let's see what the first few rows look like
df.head()

In [None]:
# get detailed info about column types and missing values
print("Column data types and non-null counts:")
df.info()

In [None]:
# statistical summary of numerical columns
print("="*60)
print("Statistical summary:")
df.describe()

## 2. Data Cleaning & Preprocessing

In [None]:
# checking for any missing values in each column
print("Missing values per column:")
print(df.isnull().sum())

In [None]:
# total count of missing values across entire dataset
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

In [None]:
# TotalCharges is stored as text instead of number, need to fix this
# converting to numeric, errors='coerce' will turn invalid values into NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [None]:
# see how many NaN values got created from that conversion
print(f"TotalCharges NaN count after conversion: {df['TotalCharges'].isnull().sum()}")

In [None]:
# filling missing TotalCharges with the median value (middle value)
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)
print("Filled missing values with median")

In [None]:
# converting churn from Yes/No to 1/0 for modeling
# Yes becomes 1 (churned), No becomes 0 (stayed)
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
print("Converted Churn to binary (1 = Yes, 0 = No)")

In [None]:
# final check - data is clean now!
print("\nData cleaning complete!")
print(f"Final shape: {df.shape}")

## 3. Exploratory Data Analysis (EDA)

In [None]:
# calculate the overall churn rate (percentage of customers who left)
churn_rate = df['Churn'].mean()
print(f"Overall Churn Rate: {churn_rate:.2%}")

In [None]:
# creating a figure with 2 subplots side by side
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

In [None]:
# count how many customers churned vs stayed
churn_counts = df['Churn'].value_counts()
print(f"Retained: {churn_counts[0]}, Churned: {churn_counts[1]}")

In [None]:
# first subplot - bar chart showing counts
# green for retained, red for churned
ax[0].bar(['Retained', 'Churned'], churn_counts.values, color=['#2ecc71', '#e74c3c'])
ax[0].set_ylabel('Count')
ax[0].set_title('Customer Churn Distribution')

# adding count labels on top of bars
for i, v in enumerate(churn_counts.values):
    ax[0].text(i, v + 50, str(v), ha='center', fontweight='bold')

In [None]:
# second subplot - pie chart showing percentages
ax[1].pie(churn_counts.values, labels=['Retained', 'Churned'], autopct='%1.1f%%',
          colors=['#2ecc71', '#e74c3c'], startangle=90)
ax[1].set_title('Churn Rate')

plt.tight_layout()
plt.savefig('../images/churn_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# group by contract type and calculate churn rate for each
# sort from highest to lowest churn rate
contract_churn = df.groupby('Contract')['Churn'].mean().sort_values(ascending=False)
print("Churn Rate by Contract Type:")
print(contract_churn)

In [None]:
# create bar chart to visualize this
plt.figure(figsize=(10, 6))
contract_churn.plot(kind='bar', color='#3498db', edgecolor='black')
plt.title('Churn Rate by Contract Type', fontsize=14, fontweight='bold')
plt.xlabel('Contract Type')
plt.ylabel('Churn Rate')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# add percentage labels on top of each bar
for i, v in enumerate(contract_churn.values):
    plt.text(i, v + 0.01, f'{v:.2%}', ha='center', fontweight='bold')
    
plt.tight_layout()
plt.savefig('../images/churn_by_contract.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# same thing but for payment method
payment_churn = df.groupby('PaymentMethod')['Churn'].mean().sort_values(ascending=False)
print("\nChurn Rate by Payment Method:")
print(payment_churn)

In [None]:
# horizontal bar chart (easier to read long labels)
plt.figure(figsize=(10, 6))
payment_churn.plot(kind='barh', color='#e67e22', edgecolor='black')
plt.title('Churn Rate by Payment Method', fontsize=14, fontweight='bold')
plt.xlabel('Churn Rate')
plt.ylabel('Payment Method')
plt.grid(axis='x', alpha=0.3)

# add percentage labels
for i, v in enumerate(payment_churn.values):
    plt.text(v + 0.01, i, f'{v:.2%}', va='center', fontweight='bold')
    
plt.tight_layout()
plt.savefig('../images/churn_by_payment.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# looking at the numerical features - tenure, monthly charges, total charges
numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
print(f"Analyzing: {numerical_cols}")

In [None]:
# create 3 histograms side by side
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, col in enumerate(numerical_cols):
    # plot histogram for retained customers (green)
    df[df['Churn'] == 0][col].hist(ax=axes[i], bins=30, alpha=0.6, label='Retained', 
                                     color='#2ecc71', edgecolor='black')
    # plot histogram for churned customers (red)
    df[df['Churn'] == 1][col].hist(ax=axes[i], bins=30, alpha=0.6, label='Churned', 
                                     color='#e74c3c', edgecolor='black')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'{col} Distribution by Churn')
    axes[i].legend()
    axes[i].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../images/distributions.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# calculate correlation between numerical features and churn
numerical_data = df[numerical_cols + ['Churn']].corr()
print("Correlation matrix:")
print(numerical_data)

In [None]:
# visualize correlations with a heatmap
# red = positive correlation, blue = negative correlation
plt.figure(figsize=(8, 6))
sns.heatmap(numerical_data, annot=True, cmap='coolwarm', center=0, 
            linewidths=1, linecolor='black', fmt='.2f', cbar_kws={'label': 'Correlation'})
plt.title('Correlation Heatmap - Numerical Features & Churn', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('../images/correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# boxplots to check for outliers and see the distribution by churn status
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, col in enumerate(numerical_cols):
    df.boxplot(column=col, by='Churn', ax=axes[i], patch_artist=True)
    axes[i].set_xlabel('Churn (0=Retained, 1=Churned)')
    axes[i].set_ylabel(col)
    axes[i].set_title(f'{col} by Churn Status')
    axes[i].get_figure().suptitle('')  # remove the default title

plt.tight_layout()
plt.savefig('../images/boxplots.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Feature Engineering & Preparation for Modeling

In [None]:
# remove customerID column - it's just a unique identifier, not useful for prediction
df_model = df.drop('customerID', axis=1)
print(f"Dropped customerID column. Now have {df_model.shape[1]} columns")

In [None]:
# find all categorical (text) columns
categorical_cols = df_model.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns to encode: {categorical_cols}")

In [None]:
# convert categorical columns to dummy variables (one-hot encoding)
# drop_first=True to avoid multicollinearity (drops one category as reference)
df_encoded = pd.get_dummies(df_model, columns=categorical_cols, drop_first=True)

print(f"\nBefore encoding: {df_model.shape[1]} features")
print(f"After encoding: {df_encoded.shape[1]} features")
print(f"Sample features: {list(df_encoded.columns[:10])}")

In [None]:
# separate features (X) from target variable (y)
X = df_encoded.drop('Churn', axis=1)  # all columns except Churn
y = df_encoded['Churn']  # only the Churn column
print(f"Features shape: {X.shape}, Target shape: {y.shape}")

In [None]:
# split data into training (80%) and testing (20%) sets
# stratify ensures same churn ratio in both sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Train churn rate: {y_train.mean():.2%}")
print(f"Test churn rate: {y_test.mean():.2%}")

In [None]:
# scale features to similar range (important for logistic regression)
# fit on training data only to avoid data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling complete!")
print("All features now have mean=0 and std=1")

## 5. Predictive Modeling

### 5.1 Logistic Regression

In [None]:
# Train Logistic Regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_lr = log_reg.predict(X_test_scaled)
y_pred_proba_lr = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluation
acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr)
rec_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_pred_proba_lr)

print("Logistic Regression Performance:")
print(f"Accuracy:  {acc_lr:.4f}")
print(f"Precision: {prec_lr:.4f}")
print(f"Recall:    {rec_lr:.4f}")
print(f"F1-Score:  {f1_lr:.4f}")
print(f"ROC-AUC:   {roc_auc_lr:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Retained', 'Churned']))

### 5.2 Decision Tree

In [None]:
# Train Decision Tree (using unscaled data as trees are scale-invariant)
dt = DecisionTreeClassifier(random_state=42, max_depth=10, min_samples_split=20)
dt.fit(X_train, y_train)

# Predictions
y_pred_dt = dt.predict(X_test)
y_pred_proba_dt = dt.predict_proba(X_test)[:, 1]

# Evaluation
acc_dt = accuracy_score(y_test, y_pred_dt)
prec_dt = precision_score(y_test, y_pred_dt)
rec_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)
roc_auc_dt = roc_auc_score(y_test, y_pred_proba_dt)

print("Decision Tree Performance:")
print(f"Accuracy:  {acc_dt:.4f}")
print(f"Precision: {prec_dt:.4f}")
print(f"Recall:    {rec_dt:.4f}")
print(f"F1-Score:  {f1_dt:.4f}")
print(f"ROC-AUC:   {roc_auc_dt:.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['Retained', 'Churned']))

## 6. Model Comparison & Evaluation

In [None]:
# Model comparison table
comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree'],
    'Accuracy': [acc_lr, acc_dt],
    'Precision': [prec_lr, prec_dt],
    'Recall': [rec_lr, rec_dt],
    'F1-Score': [f1_lr, f1_dt],
    'ROC-AUC': [roc_auc_lr, roc_auc_dt]
})

print("Model Performance Comparison:")
print(comparison.to_string(index=False))

In [None]:
# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logistic Regression
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[0], 
            xticklabels=['Retained', 'Churned'], yticklabels=['Retained', 'Churned'])
axes[0].set_title('Logistic Regression - Confusion Matrix')
axes[0].set_ylabel('Actual')
axes[0].set_xlabel('Predicted')

# Decision Tree
cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Greens', ax=axes[1],
            xticklabels=['Retained', 'Churned'], yticklabels=['Retained', 'Churned'])
axes[1].set_title('Decision Tree - Confusion Matrix')
axes[1].set_ylabel('Actual')
axes[1].set_xlabel('Predicted')

plt.tight_layout()
plt.savefig('../images/confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# ROC Curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_proba_lr)
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_proba_dt)

plt.figure(figsize=(10, 7))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {roc_auc_lr:.3f})', linewidth=2, color='#3498db')
plt.plot(fpr_dt, tpr_dt, label=f'Decision Tree (AUC = {roc_auc_dt:.3f})', linewidth=2, color='#2ecc71')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('../images/roc_curves.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Feature Importance Analysis

In [None]:
# Feature importance from Decision Tree
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt.feature_importances_
}).sort_values('Importance', ascending=False).head(15)

print("Top 15 Most Important Features (Decision Tree):")
print(feature_importance.to_string(index=False))

plt.figure(figsize=(10, 8))
plt.barh(range(len(feature_importance)), feature_importance['Importance'], color='#9b59b6', edgecolor='black')
plt.yticks(range(len(feature_importance)), feature_importance['Feature'])
plt.xlabel('Feature Importance', fontsize=12)
plt.title('Top 15 Churn-Driving Features', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('../images/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Logistic Regression coefficients (top absolute values)
lr_coef = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': log_reg.coef_[0]
}).assign(AbsCoef=lambda df: df['Coefficient'].abs()).sort_values('AbsCoef', ascending=False).head(15)

print("\nTop 15 Features by Logistic Regression Coefficient Magnitude:")
print(lr_coef[['Feature', 'Coefficient']].to_string(index=False))

plt.figure(figsize=(10, 8))
colors = ['#e74c3c' if c < 0 else '#2ecc71' for c in lr_coef['Coefficient']]
plt.barh(range(len(lr_coef)), lr_coef['Coefficient'], color=colors, edgecolor='black')
plt.yticks(range(len(lr_coef)), lr_coef['Feature'])
plt.xlabel('Coefficient Value', fontsize=12)
plt.title('Top 15 Features - Logistic Regression Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linewidth=1, linestyle='--')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('../images/lr_coefficients.png', dpi=300, bbox_inches='tight')
plt.show()

## 8. Business Insights & Retention Strategies

### Key Findings

**High-Risk Segments:**
1. **Month-to-month contracts** - Highest churn rate (~42%)
2. **Electronic check payment** - Elevated churn compared to auto-pay methods
3. **Low tenure customers** - First 12 months are critical
4. **Fiber optic internet users** - Potentially due to pricing or service quality issues
5. **Customers without online security/tech support** - Lack of value-added services

**Protective Factors:**
- Long-term contracts (1-2 years)
- Auto-payment methods (bank transfer, credit card)
- Higher tenure (>24 months)
- Multiple service subscriptions

### Actionable Retention Strategies

1. **Contract Incentives**
   - Offer discounts for customers switching from month-to-month to annual contracts
   - Create loyalty rewards program for contract renewals

2. **Payment Method Optimization**
   - Promote auto-pay enrollment with incentives (e.g., $5/month discount)
   - Simplify payment process for electronic check users

3. **First-Year Engagement**
   - Intensive customer success outreach in months 1-12
   - Onboarding program to maximize value realization
   - Proactive support check-ins

4. **Service Bundling**
   - Cross-sell online security, tech support to at-risk segments
   - Create attractive bundles for fiber optic customers

5. **Predictive Intervention**
   - Deploy this model in production to score customers monthly
   - Trigger retention campaigns for customers with churn probability >0.6
   - Prioritize high-value customers (high MonthlyCharges/TotalCharges)

### Model Deployment Recommendation

Use the **Decision Tree model** for production due to:
- Better interpretability for business stakeholders
- Strong performance (F1-Score ~0.74, ROC-AUC ~0.84)
- Easier to explain feature importance to retention teams