<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/solutions/exercises/week04_group_exercise_session1_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 Group Exercise ‚Äî Session 1 ‚Äî SOLUTION KEY üîë: Churn EDA + Logistic Regression
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Points:** 10 | **Duration:** ~45 minutes | **Deliverable:** Completed notebook + 2‚Äì3 minute presentation

**Objective:** Perform statistical EDA on a churn dataset, estimate the business cost of churn, build a logistic regression baseline, and interpret coefficients as actionable business insights.

**Chapter 4 skills practiced:**
- Cram√©r's V (categorical association)
- Mann-Whitney U + Cohen's d (continuous feature analysis)
- Business cost estimation
- Logistic regression + coefficient interpretation

### Group Members & Roles

| Role | Name | Responsibility |
|------|------|----------------|
| üñ•Ô∏è **Lead Coder** | | Drives the notebook |
| üìä **Data Interpreter** | | Reads outputs, explains metrics |
| üé§ **Presenter** | | Delivers the 2‚Äì3 minute share-out |
| ‚úÖ **QA Reviewer** | | Checks outputs against checkpoints |

*If 3 members, Lead Coder also handles QA.*

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GROUP DISCUSSION (before coding ‚Äî 3 minutes)</strong><br>
  You're a data team at a telecom company. Before looking at any data, discuss:
  <ol>
    <li>What features do you <em>think</em> predict customer churn? List your top 3 guesses.</li>
    <li>How would you measure the cost of losing a customer? What numbers would you need?</li>
    <li>Why would a business prefer a model that can <em>explain</em> churn drivers over one that just predicts who will leave?</li>
  </ol>
</div>

**Our group's predictions and reasoning (minimum 3 sentences):**

**Sample:** We predicted that contract type, tenure, and monthly charges would be the top churn predictors. Customers with no contract commitment can leave at any time, shorter-tenure customers haven't built loyalty yet, and higher monthly charges create price sensitivity. For cost estimation, we'd need: average revenue per customer, average customer lifetime, and the cost of acquiring a replacement ‚Äî marketing, onboarding, and the revenue gap while ramping a new customer.

---

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the setup cell below. It loads the Telco dataset and creates helper functions for your EDA. <strong>Do not modify.</strong>
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

from scipy.stats import chi2_contingency, mannwhitneyu, pointbiserialr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix,
                             accuracy_score, roc_auc_score)

np.random.seed(42)
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["figure.dpi"] = 100
sns.set_style("whitegrid")

# Helper functions
def cramers_v(x, y):
    """Cram√©r's V: association between two categorical variables (0‚Äì1)."""
    ct = pd.crosstab(x, y)
    chi2 = chi2_contingency(ct)[0]
    n = ct.sum().sum()
    r, k = ct.shape
    return np.sqrt(chi2 / (n * (min(r, k) - 1)))

def cohens_d(group1, group2):
    """Cohen's d: effect size between two groups."""
    n1, n2 = len(group1), len(group2)
    pooled = np.sqrt(((n1-1)*group1.std()**2 + (n2-1)*group2.std()**2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled if pooled > 0 else 0

# Load and clean
url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(url)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna(subset=["TotalCharges"]).drop(columns=["customerID"])

# Feature lists for your EDA
cat_features = [c for c in df.select_dtypes(include="object").columns if c != "Churn"]
num_features = ["tenure", "MonthlyCharges", "TotalCharges"]

print(f"‚úÖ Dataset loaded: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"   Churn rate: {(df['Churn']=='Yes').mean():.1%}")
print(f"   Categorical features ({len(cat_features)}): {cat_features}")
print(f"   Numerical features ({len(num_features)}): {num_features}")
print(f"   Helper functions ready: cramers_v(), cohens_d()")

---
## Task 1 ‚Äî Cram√©r's V Analysis (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Compute Cram√©r's V between each feature in <code>cat_features</code> and <code>Churn</code>. Create a sorted horizontal bar chart.
</div>

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  In Week 3, we used Pearson correlation for numeric features. Churn is categorical (Yes/No), so we need Cram√©r's V instead. It works like correlation but for categories: 0 = no association, 1 = perfect association.
</div>

In [None]:
# Task 1: Cram√©r's V for all categorical features vs Churn
cv_results = pd.DataFrame({
    "Feature": cat_features,
    "Cram√©r's V": [cramers_v(df[col], df["Churn"]) for col in cat_features]
}).sort_values("Cram√©r's V", ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(cv_results["Feature"], cv_results["Cram√©r's V"], color="steelblue")
plt.xlabel("Cram√©r's V (0 = no association, 1 = perfect)")
plt.title("Categorical Features vs Churn ‚Äî Cram√©r's V")
plt.axvline(x=0.1, color="orange", linestyle="--", alpha=0.7, label="Weak (0.1)")
plt.axvline(x=0.3, color="red", linestyle="--", alpha=0.7, label="Moderate (0.3)")
plt.legend()
plt.tight_layout()
plt.show()

print(cv_results.to_string(index=False))

**Interpretation (2‚Äì3 sentences):** Which 2 features have the strongest association with churn? Were they in your pre-coding predictions?

**Sample:** Contract and InternetService have the strongest Cram√©r's V values, both above 0.2. Contract was in our predictions ‚Äî month-to-month customers can leave anytime. InternetService being high was slightly surprising; it turns out fiber optic customers churn more than DSL, possibly due to higher prices or service quality expectations.

---
## Task 2 ‚Äî Mann-Whitney U + Cohen's d (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  For each feature in <code>num_features</code>, run Mann-Whitney U and compute Cohen's d between churners and non-churners. Display as a table.
</div>

In [None]:
# Task 2: Mann-Whitney U + Cohen's d
churn_yes = df[df["Churn"] == "Yes"]
churn_no = df[df["Churn"] == "No"]

mw_results = []
for col in num_features:
    u_stat, p_val = mannwhitneyu(churn_yes[col], churn_no[col], alternative="two-sided")
    d = cohens_d(churn_yes[col], churn_no[col])
    mw_results.append({
        "Feature": col,
        "U Statistic": f"{u_stat:,.0f}",
        "p-value": f"{p_val:.2e}",
        "Cohen's d": f"{d:.3f}",
        "Effect": "Large" if abs(d) > 0.8 else "Medium" if abs(d) > 0.5 else "Small"
    })

mw_df = pd.DataFrame(mw_results)
print("Mann-Whitney U + Cohen's d (Churned vs Stayed):")
print(mw_df.to_string(index=False))

**Interpretation (2‚Äì3 sentences):** Which numerical feature has the largest effect size? What does the sign of Cohen's d tell you about churners vs stayers?

**Sample:** Tenure shows the largest Cohen's d with a large negative effect ‚Äî churners have significantly shorter tenure than stayers. The negative sign means the churner group has a lower mean, confirming that newer customers are the most at risk. MonthlyCharges shows a positive effect, meaning churners tend to pay more per month, which could indicate price sensitivity among higher-tier subscribers.

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 1</strong><br>
  <ul>
    <li><strong>Cram√©r's V:</strong> Contract and InternetService should be the top 2 (both above 0.2)</li>
    <li><strong>Cohen's d:</strong> tenure should show a large negative effect (churners have shorter tenure)</li>
    <li>If Gender is in your top 3 for either test, double-check ‚Äî gender is typically negligible here</li>
  </ul>
</div>

---
## Task 3 ‚Äî Business Cost of Churn (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Calculate the estimated annual cost of churn. Use these data points:
  <ul>
    <li><code>df["MonthlyCharges"].mean()</code> ‚Äî average monthly revenue per customer</li>
    <li><code>(df["Churn"]=="Yes").sum()</code> ‚Äî number of churned customers</li>
    <li>$300 acquisition cost per replacement customer (industry benchmark)</li>
  </ul>
  Compute: (1) annual revenue at risk, (2) replacement cost, (3) total impact.
</div>

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Models don't convince leadership ‚Äî dollars do. Before building any model, we establish what churn <em>costs</em>. This number frames every decision from here: "Our model missed 150 churners √ó $X per customer = $Y in preventable losses."
</div>

In [None]:
# Task 3: Business cost of churn
avg_monthly = df["MonthlyCharges"].mean()
churned_count = (df["Churn"] == "Yes").sum()
acquisition_cost = 300

annual_revenue_at_risk = churned_count * avg_monthly * 12
replacement_cost = churned_count * acquisition_cost
total_impact = annual_revenue_at_risk + replacement_cost

print(f"{'='*50}")
print(f"  BUSINESS COST OF CHURN")
print(f"{'='*50}")
print(f"  Churned customers:      {churned_count:,}")
print(f"  Avg monthly charge:     ${avg_monthly:,.2f}")
print(f"  Acquisition cost each:  ${acquisition_cost}")
print(f"{'='*50}")
print(f"  Annual revenue at risk: ${annual_revenue_at_risk:,.0f}")
print(f"  Replacement cost:       ${replacement_cost:,.0f}")
print(f"  TOTAL ANNUAL IMPACT:    ${total_impact:,.0f}")
print(f"{'='*50}")

---
## Task 4 ‚Äî Logistic Regression Baseline (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  The preprocessing cell below is pre-filled. Run it, then build a <code>LogisticRegression</code> and print the classification report.
</div>

In [None]:
# Pre-filled preprocessing ‚Äî Run this cell. Do not modify.
df_model = df.copy()

replace_cols = ["OnlineSecurity", "OnlineBackup", "DeviceProtection",
                "TechSupport", "StreamingTV", "StreamingMovies", "MultipleLines"]
for col in replace_cols:
    df_model[col] = df_model[col].replace({"No internet service": "No", "No phone service": "No"})

binary_cols = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn"]
for col in binary_cols:
    df_model[col] = df_model[col].map({"Yes": 1, "No": 0})
df_model["gender"] = df_model["gender"].map({"Male": 1, "Female": 0})
for col in replace_cols:
    df_model[col] = df_model[col].map({"Yes": 1, "No": 0})

df_model = pd.get_dummies(df_model, columns=["InternetService", "Contract", "PaymentMethod"],
                           drop_first=True, dtype=int)

X = df_model.drop(columns=["Churn"])
y = df_model["Churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
continuous = ["tenure", "MonthlyCharges", "TotalCharges"]
X_train[continuous] = scaler.fit_transform(X_train[continuous])
X_test[continuous] = scaler.transform(X_test[continuous])

feature_names = X_train.columns.tolist()
print(f"‚úÖ Preprocessing complete ‚Äî {len(feature_names)} features, {X_train.shape[0]:,} train / {X_test.shape[0]:,} test")

In [None]:
# Task 4: Build logistic regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

predictions = lr_model.predict(X_test)
probabilities = lr_model.predict_proba(X_test)[:, 1]

print("Logistic Regression ‚Äî Classification Report:")
print(classification_report(y_test, predictions, target_names=["Stayed", "Churned"]))
print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}")
print(f"AUC:      {roc_auc_score(y_test, probabilities):.4f}")

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë CHECKPOINT 2</strong><br>
  <ul>
    <li>Accuracy ‚âà 80%</li>
    <li>Recall on "Churned" ‚âà 50‚Äì55% ‚Äî the model catches about half of actual churners</li>
    <li>AUC ‚âà 0.84</li>
  </ul>
</div>

---
## Task 5 ‚Äî Coefficient Interpretation + Confusion Matrix (2 pts)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  <ol>
    <li>Plot the top 5 positive and top 5 negative coefficients as a horizontal bar chart</li>
    <li>Generate and display the confusion matrix</li>
    <li>Calculate the dollar cost of the False Negatives (churners the model missed)</li>
  </ol>
</div>

In [None]:
# Task 5a: Coefficient bar chart
coef_df = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": lr_model.coef_[0]
}).sort_values("Coefficient", ascending=False)

display_df = pd.concat([coef_df.head(5), coef_df.tail(5)])
colors = ["salmon" if c > 0 else "steelblue" for c in display_df["Coefficient"]]

plt.figure(figsize=(10, 7))
plt.barh(display_df["Feature"], display_df["Coefficient"], color=colors)
plt.xlabel("Coefficient (positive = increases churn)")
plt.title("Top 5 Positive & Negative Churn Drivers")
plt.axvline(x=0, color="black", linewidth=0.5)
plt.tight_layout()
plt.show()

In [None]:
# Task 5b: Confusion matrix + dollar cost of false negatives
cm = confusion_matrix(y_test, predictions)
tn, fp, fn, tp = cm.ravel()

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Stayed", "Churned"],
            yticklabels=["Stayed", "Churned"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix ‚Äî Logistic Regression")
plt.tight_layout()
plt.show()

cost_per_missed = df["MonthlyCharges"].mean() * 12
print(f"False Negatives (missed churners): {fn}")
print(f"Cost per missed churner:           ${cost_per_missed:,.0f}/year")
print(f"Total cost of missed churners:     ${fn * cost_per_missed:,.0f}")
print(f"\nTrue Positives (caught): {tp} | False Positives (false alarms): {fp}")

**Interpretation (minimum 4 sentences):** Explain the top 3 churn drivers to a non-technical executive. For each one, suggest a specific business action the company could take.

**Sample:** The biggest churn driver is month-to-month contracts ‚Äî customers with no commitment are 3x more likely to leave. Recommendation: offer incentive pricing for customers who switch to annual contracts (even a 10% discount pays for itself if it reduces churn). Second, fiber optic customers churn more than DSL customers despite paying more ‚Äî this may indicate service quality issues or unmet expectations. Recommendation: audit customer satisfaction scores for fiber customers and prioritize service reliability. Third, electronic check payers churn more than auto-pay customers ‚Äî there's less friction to cancel when you're not on automatic billing. Recommendation: offer a small monthly discount ($3‚Äì5) for customers who enroll in autopay ‚Äî the reduced churn will far exceed the discount cost.

---

## Share-Out (2‚Äì3 minutes)

Present to the class:
1. **Top 2 churn drivers** from Cram√©r's V and Cohen's d ‚Äî did they match your predictions?
2. **Business cost** ‚Äî what does churn cost this company annually?
3. **Model performance** ‚Äî how many churners does the model catch? How many slip through?
4. **One actionable recommendation** based on the coefficients

## Troubleshooting

| Problem | Fix |
|---------|-----|
| `cramers_v()` gives error | Make sure you're passing two Series, not strings: `cramers_v(df["Contract"], df["Churn"])` |
| Cohen's d is 0 for all features | Check your group split: `churn_yes = df[df["Churn"]=="Yes"]` ‚Äî note the string "Yes" not integer 1 |
| Classification report shows all zeros for one class | Model may be predicting all one class ‚Äî check `max_iter=1000` |
| Confusion matrix shape error | Make sure you ran the preprocessing cell first |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 4 Group Exercise ‚Äî Session 1: Churn EDA + Logistic Regression | 10 Points
</p>