# Midterm - Business-Case Predictive Strategy Practicum + Project Baseline Submission

<hr>

<center>
<div>
<img src="https://raw.githubusercontent.com/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/main/notebooks/figures/mgmt_474_ai_logo_02-modified.png" width="200"/>
</div>
</center>

# <center><a class="tocSkip"></center>
# <center>MGMT47400 Predictive Analytics</center>
# <center>Professor: Davi Moreira </center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davi-moreira/2026Summer_predictive_analytics_purdue_MGMT474/blob/main/notebooks/10_midterm_casebook_student.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Translate business cases into predictive tasks (target, unit, horizon, KPI)
2. Select split strategy and metrics aligned to case and cost structure
3. Identify leakage risks and data availability constraints
4. Propose a modeling shortlist and an evaluation plan
5. Deliver a baseline model + evaluation plan for the course project

---

> **üìã Participation Reminder:** This notebook contains **2 PAUSE-AND-DO exercises**. You are expected to complete all exercises before submitting your notebook.

---

## Midterm Instructions

### Academic Integrity

**Allowed resources:**
- Course notebooks (Days 1-9)
- scikit-learn documentation
- Course textbooks (ISLP, ESL, Provost & Fawcett)
- Gemini for code generation (with explain + verify workflow)

**NOT allowed:**
- Communication with other students during the exam
- Pre-written code from external sources without attribution
- Asking Gemini for complete answers without your own analysis

**Gemini usage boundaries:**
- ‚úì Ask Gemini to generate code scaffolds
- ‚úì Ask Gemini to explain concepts
- ‚úì Use Gemini to debug errors
- ‚úó Ask Gemini to solve entire case problems
- ‚úó Copy Gemini's analysis without your own critical thinking

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.precision', 4)
RANDOM_SEED = 474
np.random.seed(RANDOM_SEED)
print("‚úì Setup complete!")

**Reading the output:**

The setup cell imports the core libraries you may need during the midterm: `pandas` and `numpy` for data manipulation, `matplotlib` and `seaborn` for visualization, and `train_test_split`, `StratifiedKFold`, `Pipeline`, `StandardScaler`, `LogisticRegression`, `classification_report`, and `roc_auc_score` from scikit-learn. The random seed is set to **RANDOM_SEED = 474** for reproducibility.

The **"Setup complete!"** confirmation means the environment is ready. You will not need to install any additional packages -- all libraries used in the previous notebooks are available here.

**Why this matters:** The midterm is a *design* exercise, not a coding exercise. The setup cell is provided so that if you choose to write code to support your analysis (for example, computing expected costs or simulating a threshold sweep), you have the tools at hand. However, most of your work will be written analysis in the markdown cells below.

---

## Case 1: Customer Churn Prediction for Subscription Service

### Business Context

**Company:** StreamFlix, a video streaming service

**Problem:** Monthly churn rate is 5%. Acquiring new customers costs $50, while retaining costs $10 (discount offer).

**Data available:**
- Customer demographics (age, location, device type)
- Usage patterns (watch hours, genres, last login)
- Account history (tenure, payment method, support tickets)
- Churn status (binary: churned in next month or not)

**Business goal:** Identify customers likely to churn in the next 30 days to target with retention offers.

### Your Task (Case 1)

Design a predictive modeling plan that addresses:

1. **Prediction Target**: What exactly are we predicting?
2. **Prediction Unit**: What is one row in the dataset?
3. **Prediction Horizon**: How far ahead are we predicting?
4. **Primary Metric**: Which metric aligns with business cost structure?
5. **Split Strategy**: Train/val/test? Time-based? Why?
6. **Leakage Risks**: List 3 potential leakage sources
7. **Model Shortlist**: Which 2-3 models would you try first?
8. **Threshold Selection**: How would you choose the decision threshold?

---

## üìù PAUSE-AND-DO: Case 1 Response (5 minutes)

Write your structured response below.

---

### CASE 1: YOUR RESPONSE

#### 1. Prediction Target
[What exactly are we predicting? Be specific.]

**Your answer:**

---

#### 2. Prediction Unit
[What is one row? Customer-month? Customer? Subscription?]

**Your answer:**

---

#### 3. Prediction Horizon
[How far ahead? Why this timeframe?]

**Your answer:**

---

#### 4. Primary Metric
[Which metric and why? Consider business costs.]

**Your answer:**

**Justification based on costs:**
- Cost of False Negative (missing a churner): $\_\_\_\_
- Cost of False Positive (unnecessary retention offer): $\_\_\_\_
- Therefore, metric should be: \_\_\_\_\_

---

#### 5. Split Strategy
[Random? Time-based? Stratified? Why?]

**Your answer:**

---

#### 6. Leakage Risks
[List 3 specific features or patterns that could cause leakage]

**Risk 1:**

**Risk 2:**

**Risk 3:**

---

#### 7. Model Shortlist
[Which 2-3 models would you try? Why?]

**Model 1:**

**Model 2:**

**Model 3 (optional):**

---

#### 8. Threshold Selection
[How would you choose the threshold? What factors matter?]

**Your answer:**

---

## Case 2: Loan Default Prediction

### Business Context

**Company:** FinTech lending platform

**Problem:** Approve loans for creditworthy applicants, reject risky ones.

**Data available:**
- Applicant demographics (age, income, employment)
- Credit history (credit score, previous loans, payment history)
- Loan details (amount, purpose, term)
- Default status (binary: defaulted within 12 months or not)

**Business constraints:**
- Average loan: $10,000
- Average profit per successful loan: $500
- Average loss per default: $7,000
- Must maintain default rate < 3% for regulatory compliance

### Your Task (Case 2)

Design an evaluation plan:

1. **Primary Metric**: What metric aligns with profit/loss?
2. **Constraint Handling**: How to enforce default rate < 3%?
3. **Class Imbalance**: If defaults are rare (1-2%), how do you handle it?
4. **Threshold Logic**: Write the decision rule for loan approval
5. **Monitoring**: What would you track in production?

---

## üìù PAUSE-AND-DO: Case 2 Response (5 minutes)

---

### CASE 2: YOUR RESPONSE

#### 1. Primary Metric
[What metric captures profit/loss tradeoff?]

**Your answer:**

**Expected value calculation:**
- E[value | approve good applicant] = $\_\_\_\_
- E[value | approve bad applicant] = $\_\_\_\_
- E[value | reject good applicant] = $\_\_\_\_
- E[value | reject bad applicant] = $\_\_\_\_

---

#### 2. Constraint Handling
[How to enforce default rate < 3%?]

**Your answer:**

---

#### 3. Class Imbalance Handling
[Defaults are 1-2%. What strategies would you use?]

**Strategy 1:**

**Strategy 2:**

**Strategy 3:**

---

#### 4. Decision Rule
[Write the logic for loan approval]

**Your answer:**

```
if probability_default < threshold:
    approve loan
else:
    reject loan

threshold = ???? (explain how you'd set this)
```

---

#### 5. Production Monitoring
[What metrics would you track over time?]

**Metric 1:**

**Metric 2:**

**Metric 3:**

**Metric 4:**

---

## Mini-Case 3: Medical Diagnosis (Optional Bonus)

### Quick Scenario

**Task:** Predict rare disease (0.1% prevalence) from lab tests

**Constraints:**
- False negative (missing disease) = life-threatening
- False positive = unnecessary expensive test ($5,000)
- Dataset: 100,000 patients, 100 with disease

**Questions:**
1. Which metric: Precision, Recall, F1, ROC-AUC, or PR-AUC? Why?
2. What threshold bias (high/low)? Why?
3. How to handle 100:99,900 class imbalance?

---

### MINI-CASE 3: YOUR RESPONSE (OPTIONAL)

**Metric choice:**

**Threshold bias:**

**Imbalance handling:**

---

## Project Milestone 2: Baseline Model + Evaluation Plan

### Deliverable Requirements

Your project baseline submission must include:

#### 1. Baseline Pipeline (Code)
- Train/val/test splits
- Simple preprocessing
- At least one baseline model
- Evaluation on validation set

#### 2. Baseline Report (Table)
Comparison table with:
- Baseline model
- At least one improved model
- Multiple metrics
- Clear winner identified

#### 3. Evaluation Plan (Documentation)
- Primary metric + justification
- Split/CV design
- Leakage prevention measures
- Next steps identified

---

## Grading Rubric

### Midterm Cases (70 points)

**Case 1 (35 points)**
- Problem framing (10 pts): Target, unit, horizon clearly defined
- Metric selection (10 pts): Justified by business costs
- Leakage awareness (5 pts): Specific, realistic risks identified
- Modeling plan (10 pts): Reasonable shortlist + threshold strategy

**Case 2 (35 points)**
- Expected value logic (10 pts): Correct cost calculations
- Constraint handling (10 pts): Concrete approach to 3% limit
- Imbalance handling (10 pts): Multiple actionable strategies
- Monitoring plan (5 pts): Relevant production metrics

### Project Baseline (30 points)
- Code quality (10 pts): Runs without errors, proper pipeline
- Evaluation rigor (10 pts): Appropriate metrics, no leakage
- Documentation (10 pts): Clear plan, justified choices

---

## Common Mistakes to Avoid

### Case Analysis Mistakes
- ‚úó Generic answers ("use cross-validation")
- ‚úì Specific answers ("use time-based split because...")

- ‚úó Ignoring business costs
- ‚úì Justify metrics with cost structure

- ‚úó Vague leakage risks ("data leakage might happen")
- ‚úì Specific risks ("including 'account_status_after_churn' would leak")

### Code Mistakes
- ‚úó Fitting on full dataset before split
- ‚úì Split first, then fit only on train

- ‚úó Looking at test set during development
- ‚úì Lock test set, use only validation

- ‚úó No baseline for comparison
- ‚úì Always include simple baseline

---

## How to Earn Full Credit

### Excellent Response Characteristics

1. **Specificity**: Answers are tailored to the specific case
2. **Justification**: Every choice is explained with reasoning
3. **Business alignment**: Technical choices driven by business needs
4. **Completeness**: All parts of each question addressed
5. **Realistic**: Acknowledges tradeoffs and limitations

### Example of Excellent vs Poor Answer

**Question:** Which metric for churn prediction?

**Poor answer:**  
"I would use accuracy because it's a good metric."

**Excellent answer:**  
"I would use PR-AUC as primary metric with a focus on recall at 20% precision. Reasoning: (1) Churn is likely imbalanced (~5% rate), making accuracy misleading. (2) Missing churners (FN) costs $50 (lost customer acquisition cost), while unnecessary retention offers (FP) cost only $10. (3) We can afford high FP rate if it means catching most churners. (4) Will use precision-recall curve to find optimal operating point where cost is minimized, likely favoring high recall even with moderate precision."

---

## Participation Assignment Submission Instructions

### To Submit This Notebook:

1. **Complete all exercises**: Fill in both PAUSE-AND-DO exercise cells with your findings
2. **Run All Cells**: Execute `Runtime ‚Üí Run all` to ensure everything works
3. **Save a Copy**: `File ‚Üí Save a copy in Drive or Download the .ipynb extension`
4. **Submit**: Upload your `.ipynb` file in the participation assignment you find in the course Brightspace page.

### Before Submitting, Check:

- [ ] All cells execute without errors
- [ ] All outputs are visible
- [ ] Both exercise responses are complete
- [ ] Notebook is shared with correct permissions
- [ ] You can explain every line of code you wrote

### Next Step:

Complete the **Quiz** in Brightspace (auto-graded)

---

## Bibliography

- Provost, F., & Fawcett, T. (2013). *Data Science for Business* - End-to-end predictive modeling process and business framing
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). *An Introduction to Statistical Learning with Python* - Assessment/selection + classification/regression chapters
- scikit-learn User Guide: [Common pitfalls](https://scikit-learn.org/stable/common_pitfalls.html) - Especially leakage and improper evaluation

---



<center>

Thank you!

</center>