# Customer Churn Prediction Analysis

This notebook demonstrates a complete machine learning pipeline for predicting customer churn.

## Table of Contents
1. [Data Loading and Exploration](#data-loading)
2. [Exploratory Data Analysis](#eda)
3. [Data Preprocessing](#preprocessing)
4. [Model Training](#training)
5. [Model Evaluation](#evaluation)
6. [Feature Importance Analysis](#features)
7. [Business Insights](#insights)

In [None]:
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from data_preprocessor import DataPreprocessor
from model_trainer import ModelTrainer
from visualizer import Visualizer

plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

## 1. Data Loading and Exploration {#data-loading}

In [None]:
# Initialize our classes
preprocessor = DataPreprocessor()
trainer = ModelTrainer()
viz = Visualizer()

# Create synthetic customer churn data
df = preprocessor.create_synthetic_data(n_samples=10000)
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
df.head()

In [None]:
# Basic data info
print("Dataset Info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())
print("\nChurn distribution:")
print(df['churn'].value_counts())
print(f"\nChurn rate: {df['churn'].mean():.2%}")

## 2. Exploratory Data Analysis {#eda}

In [None]:
# Plot data distributions
viz.plot_data_distribution(df)

In [None]:
# Correlation analysis
viz.plot_correlation_matrix(df)

In [None]:
# Analyze churn by categorical variables
categorical_cols = ['contract', 'payment_method', 'internet_service', 'paperless_billing']

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for i, col in enumerate(categorical_cols):
    churn_rate = df.groupby(col)['churn'].mean()
    churn_rate.plot(kind='bar', ax=axes[i], color='skyblue', edgecolor='black')
    axes[i].set_title(f'Churn Rate by {col.replace("_", " ").title()}')
    axes[i].set_ylabel('Churn Rate')
    axes[i].tick_params(axis='x', rotation=45)
    
    # Add value labels on bars
    for j, v in enumerate(churn_rate.values):
        axes[i].text(j, v + 0.01, f'{v:.2%}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 3. Data Preprocessing {#preprocessing}

In [None]:
# Preprocess the data
X, y = preprocessor.preprocess_features(df, is_training=True)
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature names: {list(X.columns)}")

# Split the data
X_train, X_test, y_train, y_test = preprocessor.split_data(X, y)
print(f"\nTraining set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## 4. Model Training {#training}

In [None]:
# Train multiple models
print("Training multiple models...")
trainer.train_models(X_train, y_train, use_smote=True)
print("\nTrained models:", list(trainer.models.keys()))

## 5. Model Evaluation {#evaluation}

In [None]:
# Evaluate all models
results = trainer.evaluate_all_models(X_test, y_test)

# Create results summary
results_df = pd.DataFrame(results).T
results_df = results_df.drop('confusion_matrix', axis=1)
results_df = results_df.round(4)
print("Model Performance Summary:")
print(results_df.sort_values('roc_auc', ascending=False))

In [None]:
# Visualize model comparison
viz.plot_model_comparison(results)

In [None]:
# Plot confusion matrices
viz.plot_confusion_matrices(results)

In [None]:
# Plot ROC curves
viz.plot_roc_curves(trainer.models, X_test, y_test)

In [None]:
# Cross-validation results
cv_results = trainer.cross_validate_models(X_train, y_train)
cv_df = pd.DataFrame(cv_results).T
print("Cross-Validation Results (ROC-AUC):")
print(cv_df.sort_values('mean_cv_score', ascending=False))

## 6. Feature Importance Analysis {#features}

In [None]:
# Feature importance from Random Forest
rf_importance = trainer.get_feature_importance('random_forest')
if rf_importance is not None:
    viz.plot_feature_importance(
        preprocessor.feature_names, 
        rf_importance, 
        'Random Forest Feature Importance'
    )

In [None]:
# Feature importance from XGBoost
xgb_importance = trainer.get_feature_importance('xgboost')
if xgb_importance is not None:
    viz.plot_feature_importance(
        preprocessor.feature_names, 
        xgb_importance, 
        'XGBoost Feature Importance'
    )

## 7. Hyperparameter Optimization

In [None]:
# Optimize XGBoost hyperparameters
print("Optimizing XGBoost hyperparameters...")
X_train_split, X_val_split, y_train_split, y_val_split = preprocessor.split_data(
    X_train, y_train, test_size=0.2
)

best_params, best_score = trainer.optimize_xgboost(
    X_train_split, y_train_split, X_val_split, y_val_split, n_trials=20
)

print(f"\nBest parameters: {best_params}")
print(f"Best validation score: {best_score:.4f}")

# Evaluate optimized model
optimized_results = trainer.evaluate_model(trainer.best_model, X_test, y_test)
print(f"\nOptimized XGBoost Test ROC-AUC: {optimized_results['roc_auc']:.4f}")

## 8. Business Insights {#insights}

In [None]:
# Calculate business impact
best_model_name = results_df['roc_auc'].idxmax()
best_model = trainer.models[best_model_name]

# Predict on test set
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
high_risk_threshold = 0.7
high_risk_customers = (y_pred_proba >= high_risk_threshold).sum()

print(f"Best performing model: {best_model_name}")
print(f"Model ROC-AUC: {results[best_model_name]['roc_auc']:.4f}")
print(f"\nBusiness Impact Analysis:")
print(f"- Total test customers: {len(y_test):,}")
print(f"- Actual churned customers: {y_test.sum():,}")
print(f"- High-risk customers identified (prob >= {high_risk_threshold}): {high_risk_customers:,}")
print(f"- Potential intervention rate: {high_risk_customers/len(y_test):.1%}")

# Cost-benefit analysis
avg_revenue_per_customer = df['monthly_charges'].mean() * 12  # Annual revenue
intervention_cost = 50  # Cost to retain a customer
retention_success_rate = 0.3  # 30% success rate for interventions

potential_saved_revenue = high_risk_customers * avg_revenue_per_customer * retention_success_rate
intervention_cost_total = high_risk_customers * intervention_cost
net_benefit = potential_saved_revenue - intervention_cost_total

print(f"\nCost-Benefit Analysis:")
print(f"- Average annual revenue per customer: ${avg_revenue_per_customer:.2f}")
print(f"- Intervention cost per customer: ${intervention_cost:.2f}")
print(f"- Assumed retention success rate: {retention_success_rate:.0%}")
print(f"- Potential saved revenue: ${potential_saved_revenue:,.2f}")
print(f"- Total intervention cost: ${intervention_cost_total:,.2f}")
print(f"- Net benefit: ${net_benefit:,.2f}")
print(f"- ROI: {(net_benefit/intervention_cost_total)*100:.1f}%")

## Key Insights and Recommendations

Based on the analysis, here are the key findings:

### Model Performance
- The best performing model achieved an ROC-AUC score indicating good predictive capability
- Multiple algorithms were tested to ensure robustness
- Cross-validation confirmed model stability

### Important Churn Factors
1. **Contract Type**: Month-to-month contracts show higher churn rates
2. **Tenure**: Newer customers are more likely to churn
3. **Payment Method**: Electronic check users have higher churn
4. **Monthly Charges**: Higher charges correlate with increased churn risk

### Business Recommendations
1. **Retention Strategy**: Focus on month-to-month contract customers
2. **Onboarding**: Improve new customer experience in first 12 months
3. **Payment Options**: Encourage customers to switch from electronic checks
4. **Pricing Strategy**: Review pricing for high-charge customers
5. **Proactive Intervention**: Target high-risk customers identified by the model

### Implementation
- Deploy the model to score customers monthly
- Set up automated alerts for high-risk customers
- A/B test retention strategies on identified high-risk segments
- Monitor model performance and retrain quarterly