<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/exercises/week04_group_exercise_session2_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 Group Exercise ‚Äî SOLUTION KEY üîë ‚Äî Churn: Logistic Regression vs Neural Network
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 10 | **Duration:** ~45 minutes | **Deliverable:** Completed notebook + 2‚Äì3 minute presentation

**Objective:** Build and compare a logistic regression model and a Keras neural network on the Telco churn dataset. Present your confusion matrices, ROC curves, and model recommendation.

### Group Members & Roles

| Role | Name | Responsibility |
|------|------|----------------|
| üñ•Ô∏è **Lead Coder** | | Drives the notebook |
| üìä **Data Interpreter** | | Reads outputs, explains metrics |
| üé§ **Presenter** | | Delivers the 2‚Äì3 minute share-out |
| ‚úÖ **QA Reviewer** | | Checks outputs against checkpoints |

*If 3 members, Lead Coder also handles QA.*

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GROUP DISCUSSION (before coding ‚Äî 3 minutes)</strong><br>
  A telecom company has a $500/year budget per customer for retention efforts. They can only afford to target 200 customers this quarter, but the churn model flags 350 as high-risk.
  <ol>
    <li>What happens if they target the wrong 200?</li>
    <li>Would you rather have a model with high <strong>precision</strong> (fewer false alarms) or high <strong>recall</strong> (catches more churners)? Why?</li>
    <li>Is there a scenario where the "worse" model on paper is the better business choice?</li>
  </ol>
</div>

**Our group's answers (minimum 3 sentences):**

**Sample:** If they target the wrong 200, they waste $100K in retention budget on loyal customers while 150 actual churners leave uncontacted. A retention team should prefer high recall ‚Äî it's better to contact some loyal customers unnecessarily than to miss churners who represent lost lifetime revenue. A model with lower AUC but higher recall on the churned class could be the better business choice if the cost of a missed churner ($500+ lifetime value) far exceeds the cost of a wasted retention call ($50).

---

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the setup cell below. It loads the dataset, runs the full preprocessing pipeline from the demo, and creates the train/test split. <strong>Do not modify.</strong>
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# Full preprocessing pipeline from the demo.
# ============================================================
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix,
                             ConfusionMatrixDisplay, roc_curve, roc_auc_score,
                             accuracy_score, precision_score, recall_score, f1_score)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

np.random.seed(42)
tf.random.set_seed(42)

plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

# Load + preprocess
url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(url)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna(subset=["TotalCharges"]).drop(columns=["customerID"])

replace_cols = ["OnlineSecurity", "OnlineBackup", "DeviceProtection",
                "TechSupport", "StreamingTV", "StreamingMovies", "MultipleLines"]
for col in replace_cols:
    df[col] = df[col].replace({"No internet service": "No", "No phone service": "No"})

binary_cols = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn"]
for col in binary_cols:
    df[col] = df[col].map({"Yes": 1, "No": 0})
df["gender"] = df["gender"].map({"Male": 1, "Female": 0})
for col in replace_cols:
    df[col] = df[col].map({"Yes": 1, "No": 0})

df = pd.get_dummies(df, columns=["InternetService", "Contract", "PaymentMethod"],
                     drop_first=True, dtype=int)

X = df.drop(columns=["Churn"])
y = df["Churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
continuous = ["tenure", "MonthlyCharges", "TotalCharges"]
X_train[continuous] = scaler.fit_transform(X_train[continuous])
X_test[continuous] = scaler.transform(X_test[continuous])

feature_names = X_train.columns.tolist()
n_features = len(feature_names)

print(f"‚úÖ Preprocessing complete")
print(f"   Features: {n_features} | Train: {X_train.shape[0]:,} | Test: {X_test.shape[0]:,}")
print(f"   Churn rate ‚Äî Train: {y_train.mean():.1%} | Test: {y_test.mean():.1%}")

# Note: Results may vary slightly across runs even with seeds set.
print(f"   TensorFlow: {tf.__version__}")

---
## Task 1 ‚Äî Build the Logistic Regression Model (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build and train a <code>LogisticRegression</code> on the scaled training data. Use <code>max_iter=1000, random_state=42</code>.<br>
  Store predictions in <code>lr_predictions</code> and probabilities in <code>lr_probabilities</code>.
</div>

In [None]:
# Task 1: Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)
lr_probabilities = lr_model.predict_proba(X_test)[:, 1]
print(f"LR Accuracy: {accuracy_score(y_test, lr_predictions):.4f}")

---
## Task 2 ‚Äî Classification Report + Interpretation (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Print the classification report using <code>target_names=['Stayed', 'Churned']</code>.
</div>

In [None]:
# Task 2: Classification report
print(classification_report(y_test, lr_predictions, target_names=["Stayed", "Churned"]))

**Interpretation (2‚Äì3 sentences):** What do precision and recall mean *specifically for the "Churned" class*? Which metric matters more for a retention team, and why?

**Sample:** Precision for Churned means 'of all customers we flagged as likely to churn, what percentage actually did?' ‚Äî it measures how trustworthy our alerts are. Recall for Churned means 'of all customers who actually churned, what percentage did we catch?' ‚Äî it measures how many churners slip through. For a retention team, recall matters more because a missed churner (false negative) costs the company a customer's lifetime value, while a false alarm only costs a retention call.

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 1</strong><br>
  LR should show ‚âà80% accuracy and 50‚Äì55% recall on Churned. If recall is below 40% or above 70%, check preprocessing.
</div>

---
## Task 3 ‚Äî Build the Keras Neural Network (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Build a Keras Sequential model:
  <ul>
    <li>Hidden layer 1: <code>n_features</code> neurons, ReLU</li>
    <li>Dropout: 0.3</li>
    <li>Hidden layer 2: 15 neurons, ReLU</li>
    <li>Dropout: 0.2</li>
    <li>Output: 1 neuron, sigmoid</li>
  </ul>
  Compile with Adam + binary crossentropy. Print <code>model.summary()</code>.
</div>

In [None]:
# Task 3: Build ANN
model = Sequential([
    Dense(n_features, activation="relu", input_shape=(n_features,)),
    Dropout(0.3),
    Dense(15, activation="relu"),
    Dropout(0.2),
    Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

---
## Task 4 ‚Äî Train with Early Stopping (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Train with <code>epochs=200, batch_size=32, validation_split=0.2</code>.<br>
  Use <code>EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)</code>.
</div>

In [None]:
# Task 4: Train with early stopping
early_stop = EarlyStopping(monitor="val_loss", patience=10, restore_best_weights=True, verbose=1)

history = model.fit(
    X_train, y_train,
    epochs=200, batch_size=32,
    validation_split=0.2,
    callbacks=[early_stop],
    verbose=0
)
print(f"Training stopped at epoch {len(history.history['loss'])}")

---
## Task 5 ‚Äî Plot Training Curves (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Side-by-side plot (1 row, 2 cols): training vs validation loss (left) and accuracy (right).
</div>

In [None]:
# Task 5: Training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(history.history["loss"], label="Training Loss", color="steelblue")
axes[0].plot(history.history["val_loss"], label="Validation Loss", color="salmon")
axes[0].set_title("Loss Curves")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(history.history["accuracy"], label="Training Accuracy", color="steelblue")
axes[1].plot(history.history["val_accuracy"], label="Validation Accuracy", color="salmon")
axes[1].set_title("Accuracy Curves")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Accuracy")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Interpretation (2‚Äì3 sentences):** Is there evidence of overfitting? How can you tell from the curves?

**Sample:** The training and validation loss curves track fairly close together, with only a small gap ‚Äî this suggests dropout and early stopping are effectively preventing overfitting. If the validation loss were rising while training loss continued falling, that would indicate overfitting. The early stopping triggered well before 200 epochs, confirming the model found its optimal point.

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 2</strong><br>
  Training should stop between epochs 30‚Äì60. Validation loss should track close to training loss. If it ran all 200 epochs, check EarlyStopping config.
</div>

---
## Task 6 ‚Äî Evaluate the ANN (1 pt)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Generate predictions (threshold=0.5) and probabilities. Print classification report.<br>
  Store in <code>ann_predictions</code> and <code>ann_probabilities</code>.
</div>

In [None]:
# Task 6: Evaluate ANN
ann_probabilities = model.predict(X_test, verbose=0).ravel()
ann_predictions = (ann_probabilities > 0.5).astype(int)

print(f"ANN Accuracy: {accuracy_score(y_test, ann_predictions):.4f}")
print()
print(classification_report(y_test, ann_predictions, target_names=["Stayed", "Churned"]))

---
## Task 7 ‚Äî ROC Curve Comparison (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Plot both ROC curves on a single figure. LR = navy <code>#0f3460</code>, ANN = coral <code>#e94560</code>. Show AUC in legend.
</div>

In [None]:
# Task 7: ROC curve comparison
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probabilities)
ann_fpr, ann_tpr, _ = roc_curve(y_test, ann_probabilities)

lr_auc = roc_auc_score(y_test, lr_probabilities)
ann_auc = roc_auc_score(y_test, ann_probabilities)

plt.figure(figsize=(8, 6))
plt.plot(lr_fpr, lr_tpr, color="#0f3460", linewidth=2, label=f"Logistic Regression (AUC={lr_auc:.3f})")
plt.plot(ann_fpr, ann_tpr, color="#e94560", linewidth=2, label=f"Neural Network (AUC={ann_auc:.3f})")
plt.plot([0, 1], [0, 1], "k--", alpha=0.5, label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison ‚Äî LR vs ANN")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 3</strong><br>
  ANN should show slightly better recall and AUC than LR (1‚Äì5 percentage points). If dramatically better or worse, check architecture.
</div>

---
## Task 8 ‚Äî Model Recommendation (1 pt)

**Answer all of the following (minimum 4 sentences):**

1. Which model has better recall on churners?
2. Which model has better AUC?
3. Which model can explain *why* a customer is flagged?
4. If the company can only pick one model, which one and why?
5. Is there a scenario where deploying both makes sense?

**Sample:** The ANN shows slightly higher recall on churners (catching 2-5% more actual churners), and its AUC is marginally better, meaning it ranks customers by risk more effectively across all thresholds. However, logistic regression can explain *why* a customer is flagged ‚Äî month-to-month contract, fiber optic service, low tenure ‚Äî which the ANN cannot. If the company can only pick one, we recommend logistic regression for the initial deployment because the retention team needs to know *what to say* when they call a customer, not just *who to call*. A strong use case for both: use the ANN to generate the target list, then use LR coefficients to script the retention conversation for each customer segment.

---

## Troubleshooting

| Problem | Fix |
|---------|-----|
| ANN accuracy = 0.734 and doesn't change | Model is predicting all "Stayed" ‚Äî check architecture and compilation |
| `ValueError: shapes not aligned` | Check that `input_shape=(n_features,)` matches your data |
| Training runs all 200 epochs | EarlyStopping not in `callbacks` list ‚Äî check `model.fit(callbacks=[early_stop])` |
| ROC curve is a straight diagonal | You're plotting predictions (0/1) instead of probabilities ‚Äî use `predict_proba` or `model.predict` |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 4 Group Exercise ‚Äî Churn: LR vs ANN | 10 Points
</p>