# Week 8 Lab: ML Project Planning Workshop

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/08_wk8_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to your first hands-on machine learning lab! Today you'll work as data science consultants, helping organizations plan their ML projects before building any models. You'll practice the critical thinking and planning skills that separate successful ML projects from costly failures.

## 🎯 Learning Objectives
By the end of this lab, you will be able to:
- Transform vague business requests into clear, actionable ML problem statements
- Design appropriate train/test split strategies based on data characteristics
- Identify potential data leakage risks and mitigation strategies
- Assess fairness, privacy, and interpretability concerns for ML applications

## 📚 This Lab Reinforces
- **Chapter 19: Introduction to Machine Learning and AI**
- **Chapter 20: Before We Build - Key Considerations**
- **Tuesday Lecture: ML Fundamentals and Planning**

## 🕐 Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Group work (teams of 4)

- **[0–10 min]** Lab introduction and team formation
- **[10–60 min]** Group planning session (50 minutes)
- **[60–75 min]** Team presentations and wrap-up

## 💡 Why This Matters
Before you can build effective machine learning models, you must master the art of project planning. Studies show that 85% of ML projects fail not because of poor algorithms, but because of inadequate planning around problem definition, data quality, and ethical considerations. Today's skills will prevent you from building technically impressive models that solve the wrong problems or fail in real-world deployment.

## Lab Instructions

### Your Role Today
You are **data science consultants** working for "Analytics Solutions Inc." Five different companies have approached your firm with machine learning requests. Your job is to help them plan their projects properly **before** they invest in model development.

### Deliverable
Your team will create a **4-slide PowerPoint presentation** to present to your "technical boss" (the instructor). Your presentation should demonstrate professional-level planning and critical thinking.

### Team Formation
- **Groups of 4 students** (instructor will facilitate formation)
- **Each group gets assigned one business scenario**
- **50 minutes for planning and presentation creation**
- **4 minutes per team for presentations**

### Presentation Opportunity
- **Volunteers who present get 5 extra credit points**
- If no volunteers, random selection (no extra credit)
- **Save your presentation for homework submission**

### Evaluation Criteria
Your presentation will be evaluated on:
- **Problem framing clarity and specificity**
- **Technical accuracy of data considerations**
- **Thoughtful identification of risks and mitigation strategies**
- **Professional presentation quality and business focus**

## Required Presentation Structure

Your 4-slide presentation must follow this exact structure:

### Slide 1: Problem Definition & Success Metrics
- **Original business question** (quote the stakeholder's request)
- **Reframed analytic problem** (specific, measurable, actionable)
- **Decision context** (who uses the model, what actions they take)
- **Success measures** (both technical metrics and business KPIs)

### Slide 2: Data Strategy & Train/Test Approach
- **Data quality concerns** you anticipate
- **Train/test split strategy** (random vs. strategic splitting)
- **Justification** for your splitting approach
- **Timeline considerations** for data availability

### Slide 3: Data Leakage Prevention
- **Potential leakage sources** specific to this problem
- **Red flag features** to avoid or investigate carefully
- **Mitigation strategies** to prevent leakage
- **Validation approaches** to catch leakage if it occurs

### Slide 4: Ethics & Implementation Recommendations
- **Fairness concerns** and potential bias sources
- **Privacy implications** and protection strategies
- **Interpretability requirements** and approaches
- **Next steps** and realistic timeline for responsible development

# Business Scenarios

## Scenario A: TechTalent Recruiting Platform

**Company:** TechTalent Solutions (AI-powered recruiting platform)

**Stakeholder Request:** 
*"We want to build an AI system that can automatically screen resumes and predict which candidates will be successful in tech roles. This will help our clients save time and find the best talent faster. We have 50,000 historical resumes with data on whether candidates were hired and their performance ratings after 6 months."*

**Business Context:**
- B2B platform serving 200+ tech companies
- Currently takes human recruiters 15-20 minutes per resume
- Clients want to reduce time-to-hire from 45 days to 20 days
- High competition in tech recruiting market

**Available Data:**
- **Resume text** (education, experience, skills, job titles)
- **Candidate demographics** (age, location, years of experience)
- **Application metadata** (source channel, application date, salary expectations)
- **Hiring decisions** (hired/rejected, time-to-decision)
- **Performance ratings** (6-month reviews, manager ratings 1-5 scale)
- **Retention data** (still employed after 1 year)

**Sample Data Preview:**
```
candidate_id  | years_exp | education_level | previous_company | hired | 6mo_rating | retained_1yr
TECH_001     | 3         | Bachelor's      | Google          | Yes   | 4.2        | Yes
TECH_002     | 1         | Master's        | Startup         | No    | N/A        | N/A
TECH_003     | 7         | PhD             | Amazon          | Yes   | 3.8        | No
```

**Business Concerns:**
- Need to process 1000+ resumes per week
- Legal compliance with employment discrimination laws
- Client satisfaction depends on quality hires, not just speed
- Bad hires cost clients $50,000+ in training and turnover

## Scenario B: FinanceFlow Credit Decisions

**Company:** FinanceFlow (Digital lending platform)

**Stakeholder Request:**
*"We need an AI model to instantly approve or deny personal loan applications. Currently our underwriters take 3-5 days to review each application, but our competitors offer instant decisions. We want to approve low-risk applicants immediately and only send complex cases to human review."*

**Business Context:**
- Online personal loans $5,000 to $50,000
- Competing against instant-approval fintech companies
- Current default rate: 8% (industry average: 10%)
- Manual underwriting costs $200 per application

**Available Data:**
- **Credit bureau data** (credit score, payment history, debt-to-income ratio)
- **Application information** (loan amount, purpose, income, employment)
- **Bank account data** (account age, average balance, transaction patterns)
- **Social media profiles** (LinkedIn, Facebook - if provided)
- **Device/session data** (IP address, device type, time spent on application)
- **Historical outcomes** (approved/denied, default status, payment history)

**Sample Data Preview:**
```
application_id | credit_score | income | debt_ratio | loan_amount | approved | defaulted
LOAN_001      | 720          | 65000  | 0.35       | 15000       | Yes      | No
LOAN_002      | 580          | 35000  | 0.65       | 25000       | No       | N/A
LOAN_003      | 680          | 45000  | 0.42       | 10000       | Yes      | Yes
```

**Business Concerns:**
- Must comply with Fair Credit Reporting Act (FCRA)
- Need to explain loan denials to applicants
- Balance between approval rates and default risk
- Regulatory scrutiny on algorithmic lending decisions

## Scenario C: HealthStream Patient Risk Assessment

**Company:** HealthStream Medical Center (Regional hospital system)

**Stakeholder Request:**
*"We want to predict which patients are at high risk for readmission within 30 days of discharge. This will help us provide better care coordination and reduce Medicare penalties. We have 5 years of patient data and want to implement the model in our electronic health record system."*

**Business Context:**
- 400-bed hospital system with 25,000 annual admissions
- Medicare penalties for high readmission rates cost $2M annually
- Care coordination team can only handle 50 high-risk patients per month
- Joint Commission accreditation requires quality improvement initiatives

**Available Data:**
- **Patient demographics** (age, gender, zip code, insurance type)
- **Medical history** (diagnoses, procedures, medications, allergies)
- **Current admission** (length of stay, diagnosis codes, treatment complexity)
- **Social factors** (living situation, transportation, family support)
- **Discharge planning** (discharge destination, home health services, follow-up appointments)
- **Outcomes** (readmitted within 30 days, emergency department visits)

**Sample Data Preview:**
```
patient_id | age | primary_dx    | length_stay | discharge_dest | readmitted_30d
HEALTH_001 | 78  | Heart Failure | 4           | Home          | Yes
HEALTH_002 | 45  | Pneumonia     | 2           | Home          | No
HEALTH_003 | 82  | Hip Fracture  | 6           | Skilled Nursing| No
```

**Business Concerns:**
- HIPAA privacy compliance requirements
- Clinical staff must trust and understand model recommendations
- Risk stratification affects resource allocation decisions
- Model errors could impact patient safety and care quality

## Scenario D: RetailMax Dynamic Pricing

**Company:** RetailMax (Large e-commerce retailer)

**Stakeholder Request:**
*"We want to use machine learning to set optimal prices for our products in real-time. Our competitors change prices multiple times per day, and we're losing market share. We need a system that can automatically adjust prices based on demand, inventory, and competitor pricing to maximize revenue."*

**Business Context:**
- 50,000+ SKUs across electronics, home goods, and apparel
- $2B annual revenue with 3% profit margins
- Competitors change prices 2-5 times daily
- Inventory carrying costs are significant

**Available Data:**
- **Product catalog** (SKU, category, brand, cost, specifications)
- **Sales history** (price, quantity sold, revenue, time stamps)
- **Inventory levels** (current stock, lead times, storage costs)
- **Competitor pricing** (scraped daily from major competitors)
- **Customer behavior** (views, cart additions, purchase patterns)
- **External factors** (seasonality, holidays, economic indicators)

**Sample Data Preview:**
```
sku        | current_price | inventory | competitor_avg | daily_sales | margin
RETAIL_001 | 299.99       | 45        | 289.99         | 12          | 0.15
RETAIL_002 | 89.99        | 200       | 94.99          | 35          | 0.22
RETAIL_003 | 1299.99      | 8         | 1199.99        | 1           | 0.08
```

**Business Concerns:**
- Price changes must happen within minutes of market changes
- Cannot price below cost or violate minimum advertised pricing
- Customer trust affected by perceived price fairness
- Revenue optimization vs. inventory management trade-offs

## Scenario E: DriveRight Insurance Telematics

**Company:** DriveRight Auto Insurance

**Stakeholder Request:**
*"We want to use driving behavior data from smartphones and connected cars to personalize insurance premiums. Good drivers should pay less, and risky drivers should pay more. This will help us compete with usage-based insurance companies and reduce claims costs."*

**Business Context:**
- Traditional auto insurer with 2M policyholders
- Losing customers to app-based insurers offering behavior-based pricing
- Average claim costs increasing 5% annually
- Regulatory pressure to justify rate changes with data

**Available Data:**
- **Traditional risk factors** (age, gender, location, vehicle type, driving record)
- **Telematics data** (speed, acceleration, braking, cornering, time of day)
- **Trip characteristics** (distance, duration, road types, weather conditions)
- **Phone usage** (calls, texts, app usage while driving)
- **Geographic patterns** (frequent routes, parking locations, travel patterns)
- **Claims history** (accidents, claims amounts, fault determination)

**Sample Data Preview:**
```
policy_id  | age | vehicle_year | avg_speed | hard_brakes_pm | phone_usage | claims_3yr
DRIVE_001  | 35  | 2020         | 67.2      | 2.1           | 0.05        | 0
DRIVE_002  | 22  | 2015         | 73.8      | 8.7           | 0.23        | 2
DRIVE_003  | 58  | 2018         | 61.4      | 1.2           | 0.01        | 1
```

**Business Concerns:**
- Privacy concerns about location and behavior tracking
- State insurance regulations vary on allowable rating factors
- Need actuarial justification for rate changes
- Customer opt-in required for data collection programs

## Tips for Success

- **Be specific:** Avoid vague statements like "consider data quality"
- **Show business thinking:** Connect technical concepts to business impact
- **Anticipate questions:** Your instructor will ask follow-up questions
- **Stay focused:** Don't try to solve everything - focus on planning
- **Use available resources:** Refer to textbook chapters for guidance and don't be afraid to use AI (ChatGPT, etc.) to brainstorm ideas.
- **Be fair with assumptions:** Make reasonable assumptions about data and context, but state them clearly.

## Presentation Guidelines

### Timing
- **4 minutes maximum per team**
- **1 minute per slide recommended**
- **Instructor will provide time warnings**

### Content Focus
- **Demonstrate critical thinking** about ML planning
- **Show understanding** of concepts from readings and lecture
- **Make specific recommendations** rather than general observations
- **Connect technical concepts** to business outcomes

### Presentation Style
- **Professional tone** - you're presenting to your "technical boss"
- **Clear, confident delivery**
- **Engage the audience** with eye contact
- **Be prepared** for follow-up questions

### Common Mistakes to Avoid
- **Don't jump to algorithms** - this lab is about planning, not modeling
- **Don't ignore ethical considerations** - they're required, not optional
- **Don't be vague** - "data quality issues" is not specific enough
- **Don't exceed time limit** - practice your timing

## Wrap-Up & Reflection

### Key Learning Outcomes
After completing this lab, you should understand:

1. **Problem framing is critical** - vague business requests must be transformed into specific, measurable ML problems
2. **Data strategy matters** - the way you split and handle data can make or break your model's real-world performance
3. **Leakage is sneaky** - future information can slip into training data in subtle ways that destroy model validity
4. **Ethics are essential** - fairness, privacy, and interpretability must be considered from the start, not as an afterthought

### Reflection Questions
Consider these questions individually:
- Which aspect of ML planning was most challenging for your team?
- What surprised you about the complexity of "simple" business requests?
- How would you approach your scenario differently after hearing other teams?
- What connections do you see between today's planning concepts and previous data analysis work?

### Next Steps
- **Homework:** Submit your team's PowerPoint presentation
- **Next Tuesday:** Introduction to specific ML algorithms and implementation
- **Upcoming Labs:** Hands-on model building using Python

### Real-World Application
The planning skills you practiced today are exactly what data scientists use in industry. Companies pay consultants thousands of dollars for the type of analysis you just completed. These foundational planning skills will serve you throughout your data science career, regardless of which specific algorithms or tools you eventually use.

---

**💾 Remember to save your presentation** for homework submission. Your planning work today forms the foundation for all successful machine learning projects!