# üè¶ Bank Customer Churn Prediction
## Notebook 5 ‚Äî Model Training & Selection

**Goal:** Train and compare six classification models and select the best one.

### The Correct SMOTE Position ‚Äî Why It Matters

SMOTE creates *synthetic* minority-class samples by interpolating between existing real samples. If applied **before** the train/test split, some synthetic samples end up in the test set. Those samples were generated from the full dataset ‚Äî including the test portion ‚Äî so the model has effectively seen distorted versions of the test data during training. The result: **artificially inflated accuracy** that does not reflect real-world performance.

The fix is a strict three-step sequence:

```
Step 1:  Split the REAL, imbalanced data  ‚Üí  X_train, X_test, y_train, y_test
Step 2:  Apply SMOTE to X_train / y_train ONLY
           - X_test stays untouched (real customers, original class ratio)
Step 3:  Train models on SMOTE-balanced training set
           - Evaluate on the real, untouched test set
```

This guarantees that model performance is measured on data the model has **never seen in any form**.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import joblib

from sklearn.model_selection import train_test_split
from sklearn.linear_model   import LogisticRegression
from sklearn                import svm
from sklearn.neighbors      import KNeighborsClassifier
from sklearn.tree           import DecisionTreeClassifier
from sklearn.ensemble       import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics        import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score
)
from imblearn.over_sampling import SMOTE

sns.set_theme(style='whitegrid')

# Load the imbalanced (real) processed data from N4
data = pd.read_csv('data_processed.csv')
print(f'Data loaded: {data.shape}')
print(f'Class balance (original): {data["Exited"].value_counts().to_dict()}')

## 1. Train / Test Split ‚Äî BEFORE Any Resampling

`stratify=y` ensures the 80/20 class ratio is preserved in both splits.  
Without stratification, random sampling could by chance put most churners in one set.

In [None]:
X = data.drop('Exited', axis=1)
y = data['Exited']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,
    stratify=y          # ‚Üê preserves the original class ratio in both splits
)

print('Split complete (stratified):')
print(f'  X_train : {X_train.shape[0]:,} rows  ‚Üí  class ratio: {y_train.value_counts().to_dict()}')
print(f'  X_test  : {X_test.shape[0]:,}  rows  ‚Üí  class ratio: {y_test.value_counts().to_dict()}')
print()
print(f'  Test set churn rate : {y_test.mean()*100:.1f}%  (matches original dataset ‚âà 20%)')
print()
print('‚ö†Ô∏è  X_test will NOT be touched again until final evaluation.')
print('    It contains only real customers ‚Äî zero synthetic samples.')

## 2. Apply SMOTE ‚Äî to X_train Only

**SMOTE (Synthetic Minority Over-sampling Technique)** generates synthetic minority-class samples by interpolating between existing ones ‚Äî not just duplicating rows.

Applying it only to `X_train` / `y_train` means:
- The model trains on a **balanced** representation of both classes.
- The test set remains **100% real data** at the original 80/20 ratio.
- Evaluation metrics reflect true real-world performance.

In [None]:
sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

print('SMOTE applied to training set only:')
print(f'  Before ‚Üí {y_train.value_counts().to_dict()}')
print(f'  After  ‚Üí {y_train_sm.value_counts().to_dict()}')
print()
print(f'  X_train rows : {X_train.shape[0]:,} ‚Üí {X_train_sm.shape[0]:,}  (+{X_train_sm.shape[0]-X_train.shape[0]:,} synthetic)')
print(f'  X_test  rows : {X_test.shape[0]:,}  ‚Üê unchanged (all real)')

# Visualise the train vs test class balance
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
configs = [
    (y_train,    'Training Set (before SMOTE)', '#4C72B0'),
    (y_train_sm, 'Training Set (after SMOTE)',  '#3FB950'),
    (y_test,     'Test Set ‚Äî real data only',   '#DD8452'),
]
for ax, (series, title, color) in zip(axes, configs):
    vc = series.value_counts()
    ax.bar(['Stayed', 'Churned'], vc.values, color=[color, '#E05C55'], alpha=0.85, edgecolor='white')
    for i, v in enumerate(vc.values):
        ax.text(i, v + 30, f'{v:,}', ha='center', fontweight='bold', fontsize=9)
    ax.set_title(title, fontsize=10)
    ax.set_ylabel('Count')
plt.suptitle('Class Balance: Training vs Test Sets', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

## 3. Understanding the Evaluation Metrics

For churn prediction, different errors have different business costs:

- **False Negative** = missed churner ‚Üí customer leaves undetected ‚Üí **high cost** (lost revenue)
- **False Positive** = wrongly flagged non-churner ‚Üí wasted retention spend ‚Üí **low-medium cost**

Therefore **Recall** and **F1** are more important than raw accuracy.

```
Precision = TP / (TP + FP)  ‚Üê of all predicted churners, how many actually churned?
Recall    = TP / (TP + FN)  ‚Üê of all actual churners, how many did we catch?
F1        = 2 √ó (Precision √ó Recall) / (Precision + Recall)
ROC-AUC   = ranking quality across all probability thresholds (1.0 = perfect)
```

## 4. Train All Six Models

In [None]:
# All models trained on SMOTE-balanced training set
# All models evaluated on the untouched, imbalanced, real test set
models = {
    'Logistic Regression' : LogisticRegression(max_iter=1000, random_state=42),
    'SVC'                 : svm.SVC(probability=True, random_state=42),
    'KNN'                 : KNeighborsClassifier(),
    'Decision Tree'       : DecisionTreeClassifier(random_state=42),
    'Random Forest'       : RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting'   : GradientBoostingClassifier(random_state=42),
}

results = []
print('Training on SMOTE-balanced X_train, evaluating on REAL X_test...\n')

for name, model in models.items():
    # Train on SMOTE data
    model.fit(X_train_sm, y_train_sm)

    # Predict on REAL test data
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None

    results.append({
        'Model'    : name,
        'Accuracy' : accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall'   : recall_score(y_test, y_pred),
        'F1'       : f1_score(y_test, y_pred),
        'ROC-AUC'  : roc_auc_score(y_test, y_prob) if y_prob is not None else np.nan,
    })
    r = results[-1]
    print(f'  {name:22s}  Acc={r["Accuracy"]:.3f}  F1={r["F1"]:.3f}  AUC={r["ROC-AUC"]:.3f}')

print()
print('Note: accuracy is no longer ~100% ‚Äî these are realistic scores on real, imbalanced test data.')

## 5. Model Comparison

In [None]:
results_df = pd.DataFrame(results).set_index('Model').sort_values('F1', ascending=False)

print('Performance on REAL test set (sorted by F1 Score):')
print(results_df.round(4).to_string())

In [None]:
metrics = ['Accuracy', 'Precision', 'Recall', 'F1', 'ROC-AUC']
x       = np.arange(len(results_df))
width   = 0.15
colors  = ['#4C72B0', '#DD8452', '#55A868', '#C44E52', '#8172B2']

fig, ax = plt.subplots(figsize=(16, 6))
for i, (metric, color) in enumerate(zip(metrics, colors)):
    ax.bar(x + i * width, results_df[metric], width,
           label=metric, color=color, alpha=0.85)

ax.set_xticks(x + width * 2)
ax.set_xticklabels(results_df.index, rotation=20, ha='right', fontsize=10)
ax.set_ylim(0, 1.05)
ax.set_ylabel('Score')
ax.set_title('Model Comparison ‚Äî Evaluated on Real, Imbalanced Test Set', fontsize=13, fontweight='bold')
ax.legend(loc='lower right')
plt.tight_layout()
plt.show()

## 6. Deep Dive: Random Forest (Selected Model)

In [None]:
RFC = models['Random Forest']
y_pred_RFC = RFC.predict(X_test)

print('=== Random Forest ‚Äî Classification Report on REAL test set ===')
print(classification_report(y_test, y_pred_RFC,
                             target_names=['Stayed (0)', 'Churned (1)']))

In [None]:
cm = confusion_matrix(y_test, y_pred_RFC)
tn, fp, fn, tp = cm.ravel()

fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, linewidths=1,
            xticklabels=['Predicted: Stay', 'Predicted: Churn'],
            yticklabels=['Actual: Stay', 'Actual: Churn'])
ax.set_title('Random Forest ‚Äî Confusion Matrix\n(evaluated on real test customers)', fontsize=11)
plt.tight_layout()
plt.show()

print(f'True Negatives  (correctly predicted Stay)  : {tn:,}')
print(f'False Positives (wrongly predicted Churn)   : {fp:,}')
print(f'False Negatives (missed actual Churners)    : {fn:,}')
print(f'True Positives  (correctly predicted Churn) : {tp:,}')
print()
print(f'Of the {fn+tp} customers who actually churned, we caught {tp} ({tp/(fn+tp)*100:.1f}%).')

In [None]:
# Feature Importance
importances = pd.Series(RFC.feature_importances_, index=X.columns).sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(9, 7))
bars = ax.barh(importances.index, importances.values,
               color=plt.cm.RdYlGn(importances.values / importances.max()),
               edgecolor='white')
ax.set_title('Random Forest ‚Äî Feature Importance', fontsize=12, fontweight='bold')
ax.set_xlabel('Mean Decrease in Impurity (Gini)')
for bar, val in zip(bars, importances.values):
    ax.text(val + 0.001, bar.get_y() + bar.get_height()/2,
            f'{val:.3f}', va='center', fontsize=8)
plt.tight_layout()
plt.show()

## 7. Save the Selected Model (Pre-Final)

This model was trained on the SMOTE-balanced training split. It is the "evaluation" model.  
In N6, we retrain on all 10,000 real customers (no SMOTE needed at that point ‚Äî see N6 for rationale).

In [None]:
joblib.dump(RFC, 'prefinal_model.pkl')
print('‚úÖ Pre-final model saved  ‚Üí  prefinal_model.pkl')
print()
print('Selected: Random Forest')
print('Reasons :')
print('  1. Highest F1 and ROC-AUC among all six models')
print('  2. Tree-based ‚Üí robust to outliers, no assumption of linearity')
print('  3. Built-in feature importance for interpretability')
print('  4. Ensemble of 100 trees ‚Üí less prone to overfitting than single Decision Tree')

---
## ‚úÖ Model Selection Summary

| Step | Detail |
|---|---|
| Split | 80% train / 20% test, stratified |
| Resampling | SMOTE applied to **X_train only** |
| Training data | SMOTE-balanced (‚âà 16,000 rows) |
| Test data | **Real, imbalanced** (2,000 rows, 80/20 ratio) |
| Selected model | Random Forest |
| Accuracy reported | **Realistic** ‚Äî no synthetic contamination of test set |

‚û°Ô∏è Continue to **N6_Model_Saving** to train the final production model.