<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/labs/lab03_churn_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3 ‚Äî Churn Prediction: Full Pipeline
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 20 | **Format:** Individual | **Due:** End of Week 4

| Part | Skills (Chapter) | Points |
|------|-----------------|--------|
| A: EDA | Cram√©r's V, Mann-Whitney U, business cost (Ch. 4) | 4 |
| B: Logistic Regression | Baseline model + coefficient interpretation (Ch. 4) | 3 |
| C: Neural Network | Keras ANN + dropout + early stopping (Ch. 5) | 4 |
| D: Model Comparison | ROC curves + metrics table (Ch. 5) | 3 |
| E: Written Analysis | Business recommendation (300+ words) | 4 |
| F: Preprocessing | Pipeline runs correctly | 2 |
| Bonus | Third model variant | +3 |

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GRADING PHILOSOPHY</strong><br>
  This lab rewards <strong>process over perfection</strong>. If your ANN performs <em>worse</em> than logistic regression, that's a valid result ‚Äî your written analysis should explain why.
</div>

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è IMPORTANT</strong><br>
  Do NOT use the Telco dataset from class. You must use one of the two options below. Using the Telco dataset = <strong>-5 point deduction</strong>.
</div>

### Student Information
- **Name:**
- **Date:**
- **Dataset Chosen:** (A or B)

---
## Setup

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run this cell. Do not modify.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from scipy.stats import chi2_contingency, mannwhitneyu
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix,
                             ConfusionMatrixDisplay, roc_curve, roc_auc_score,
                             accuracy_score, precision_score, recall_score, f1_score)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

np.random.seed(42)
tf.random.set_seed(42)

plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

# Helper functions (pre-built ‚Äî use these in your EDA)
def cramers_v(x, y):
    """Cram√©r's V: association between two categorical variables (0‚Äì1)."""
    ct = pd.crosstab(x, y)
    chi2 = chi2_contingency(ct)[0]
    n = ct.sum().sum()
    r, k = ct.shape
    return np.sqrt(chi2 / (n * (min(r, k) - 1)))

def cohens_d(group1, group2):
    """Cohen's d: effect size between two groups."""
    n1, n2 = len(group1), len(group2)
    pooled = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled if pooled > 0 else 0

print(f"TensorFlow: {tf.__version__}")
print("‚úÖ Setup complete ‚Äî helper functions loaded: cramers_v(), cohens_d()")

---
## Choose Your Dataset + Run Preprocessing

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Uncomment <strong>ONE</strong> option below and run the cell. This handles all preprocessing and gives you clean train/test splits.
</div>

In [None]:
# ============================================================
# OPTION A ‚Äî Bank Customer Churn (~10,000 rows)
# Uncomment the lines below if choosing Option A
# ============================================================
url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/Churn_Modelling.csv"
df_raw = pd.read_csv(url)
TARGET = "Exited"
DOMAIN = "Banking"

# Preprocessing
df = df_raw.drop(columns=["RowNumber", "CustomerId", "Surname"])
df["Gender"] = df["Gender"].map({"Male": 1, "Female": 0})
df = pd.get_dummies(df, columns=["Geography"], drop_first=True, dtype=int)

# Feature lists for EDA
cat_features = ["Gender", "HasCrCard", "IsActiveMember", "NumOfProducts",
                "Geography_Germany", "Geography_Spain"]
num_features = ["CreditScore", "Age", "Tenure", "Balance", "EstimatedSalary"]

# ============================================================
# OPTION B ‚Äî Credit Card Customer Attrition (~10,000 rows)
# Uncomment the lines below if choosing Option B
# ============================================================
# url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/BankChurners.csv"
# df_raw = pd.read_csv(url)
# TARGET = "Attrition_Flag"
# DOMAIN = "Credit Card Services"
#
# # Preprocessing
# # Drop ID and the two Naive Bayes leakage columns
# leak_cols = [c for c in df_raw.columns if c.startswith("Naive_Bayes")]
# df = df_raw.drop(columns=["CLIENTNUM"] + leak_cols)
#
# # Encode target: Attrited Customer = 1, Existing Customer = 0
# df[TARGET] = df[TARGET].map({"Attrited Customer": 1, "Existing Customer": 0})
#
# # Encode categoricals
# df["Gender"] = df["Gender"].map({"M": 1, "F": 0})
# df = pd.get_dummies(df, columns=["Education_Level", "Marital_Status",
#                                    "Income_Category", "Card_Category"],
#                      drop_first=True, dtype=int)
#
# # Feature lists for EDA
# cat_features = ["Gender"] + [c for c in df.columns if any(
#     c.startswith(p) for p in ["Education_Level_", "Marital_Status_",
#                                "Income_Category_", "Card_Category_"])]
# num_features = ["Customer_Age", "Dependent_count", "Months_on_book",
#                 "Total_Relationship_Count", "Months_Inactive_12_mon",
#                 "Contacts_Count_12_mon", "Credit_Limit", "Total_Revolving_Bal",
#                 "Avg_Open_To_Buy", "Total_Amt_Chng_Q4_Q1", "Total_Trans_Amt",
#                 "Total_Trans_Ct", "Total_Ct_Chng_Q4_Q1", "Avg_Utilization_Ratio"]

# ============================================================
# Common pipeline (runs for whichever option you chose)
# ============================================================
X = df.drop(columns=[TARGET])
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

feature_names = X_train.columns.tolist()
n_features = len(feature_names)

print(f"Dataset: {DOMAIN}")
print(f"Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns ‚Üí {n_features} features")
print(f"Train: {X_train.shape[0]:,} | Test: {X_test.shape[0]:,}")
print(f"Churn rate: {y.mean():.1%}")
print(f"\n‚úÖ Preprocessing complete ‚Äî X_train_scaled, X_test_scaled, y_train, y_test ready")

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHAT THE PREPROCESSING DID</strong><br>
  <ul>
    <li>Dropped non-predictive ID columns</li>
    <li>Encoded the target as binary (1 = churned, 0 = stayed)</li>
    <li>Converted categorical features to dummy variables with <code>drop_first=True</code></li>
    <li>Scaled all features with <code>StandardScaler</code> (fit on train, transform on test)</li>
    <li><strong>Option B only:</strong> Removed two columns that contained pre-computed model outputs ‚Äî using them would be <strong>data leakage</strong> (the model would "cheat" by seeing answers derived from the target)</li>
  </ul>
  <code>cat_features</code> and <code>num_features</code> lists are ready for your EDA.
</div>

---
# Part A ‚Äî Exploratory Data Analysis (4 points)

### Task 1 ‚Äî Data Inspection (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Print the shape, <code>.info()</code>, churn rate, and first 5 rows. Describe the dataset in 2‚Äì3 sentences.
</div>

In [None]:
# Task 1: Data inspection
# YOUR CODE HERE


**Dataset description (2‚Äì3 sentences):**

*(Write here)*

### Task 2 ‚Äî Cram√©r's V Analysis (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Compute Cram√©r's V between each feature in <code>cat_features</code> and the target. Display as a sorted bar chart.
</div>

In [None]:
# Task 2: Cram√©r's V
# Hint: use the pre-built cramers_v() function and cat_features list
# YOUR CODE HERE


**Interpretation (2‚Äì3 sentences):** Which categorical features have the strongest association with churn?

*(Write here)*

### Task 3 ‚Äî Mann-Whitney U + Cohen's d (1 pt)

In [None]:
# Task 3: Mann-Whitney U + Cohen's d
# Hint: use the pre-built cohens_d() function and num_features list
# Split data: churned = df[df[TARGET]==1], stayed = df[df[TARGET]==0]
# YOUR CODE HERE


**Interpretation (2‚Äì3 sentences):** Which numerical features show the largest effect sizes?

*(Write here)*

### Task 4 ‚Äî Business Cost Estimate (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Estimate the annual cost of churn. State your assumptions clearly in comments.<br>
  Use reasonable estimates for your domain (banking or credit card services).
</div>

In [None]:
# Task 4: Business cost estimate
# State your assumptions in comments
# YOUR CODE HERE


---
# Part B ‚Äî Logistic Regression (3 points)

### Task 5 ‚Äî Build and Evaluate (1.5 pts)

In [None]:
# Task 5: Logistic regression
# Use X_train_scaled, X_test_scaled, y_train, y_test
# Store: lr_predictions, lr_probabilities
# Print classification report + AUC
# YOUR CODE HERE


### Task 6 ‚Äî Coefficient Interpretation (1.5 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Display top 5 positive and top 5 negative coefficients. Explain the top 3 churn drivers in business terms.
</div>

In [None]:
# Task 6: Coefficient interpretation
# YOUR CODE HERE


**Interpretation (3‚Äì4 sentences):** What does the model say drives churn in this business? Would these findings surprise company leadership?

*(Write here)*

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT</strong><br>
  LR should show reasonable accuracy (70‚Äì85%) and an AUC above 0.70. If accuracy equals the majority class rate exactly, the model may be predicting all one class.
</div>

---
# Part C ‚Äî Neural Network (4 points)

### Task 7 ‚Äî Build and Train a Keras ANN (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build a Sequential model with at least 2 hidden layers, dropout, and early stopping. Train and capture history.
</div>

In [None]:
# Task 7: Build and train ANN
# YOUR CODE HERE


### Task 8 ‚Äî Training Curves (1 pt)

In [None]:
# Task 8: Plot training vs validation loss and accuracy
# YOUR CODE HERE


**Interpretation (2‚Äì3 sentences):** What epoch did early stopping trigger? Is there evidence of overfitting?

*(Write here)*

### Task 9 ‚Äî Evaluate the ANN (1 pt)

In [None]:
# Task 9: Generate predictions + probabilities, print classification report
# Store: ann_predictions, ann_probabilities
# YOUR CODE HERE


---
# Part D ‚Äî Model Comparison (3 points)

### Task 10 ‚Äî ROC + Metrics Table (3 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  <ol>
    <li>Plot ROC curves for both models on a single figure (LR = navy, ANN = coral)</li>
    <li>Build a comparison table: accuracy, precision, recall, F1, AUC for both</li>
    <li>Count customers flagged by ANN but missed by LR</li>
  </ol>
</div>

In [None]:
# Task 10: ROC curves + comparison table + additional catches
# YOUR CODE HERE


---
# Part E ‚Äî Written Analysis (4 points)

### Task 11 ‚Äî Model Recommendation (minimum 300 words)

Write a recommendation addressed to the business leadership of your chosen domain. Address ALL five points:

1. Which model should they deploy for their retention campaign, and why?
2. What are the top 3 features driving churn, and what can the business do about each one?
3. How many high-risk customers did your models identify? What's the estimated value of retaining them?
4. What are the tradeoffs between the two models (accuracy vs interpretability)?
5. Is there a scenario where deploying both models makes sense?

*(Write here)*

---
# Bonus Challenge (+3 points)

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° OPTIONAL</strong><br>
  Train a <strong>third model</strong> with a meaningfully different architecture. Change at least TWO of: number of layers, neurons per layer, dropout rate, optimizer. Add it to your ROC plot and comparison table.
</div>

In [None]:
# Bonus: Third model variant
# YOUR CODE HERE


**Bonus interpretation (3‚Äì4 sentences):**

*(Write here if attempting bonus)*

---
## Troubleshooting

| Problem | Fix |
|---------|-----|
| ANN predicts all one class (accuracy = churn rate) | Check architecture ‚Äî may need more neurons or different learning rate |
| `ValueError: shapes not aligned` | Verify `input_shape=(n_features,)` matches your feature count |
| Option B accuracy is suspiciously high (>95%) | Check that Naive Bayes columns were dropped |
| ROC curve is a straight line | Using predictions (0/1) instead of probabilities |
| Training runs all 200 epochs | EarlyStopping not in callbacks list |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 3 ‚Äî Churn Prediction: Full Pipeline | 20 Points (+3 Bonus)
</p>