# Week 8 Lab: TA Guidance & Solutions
## ML Project Planning Workshop

This notebook provides comprehensive guidance for facilitating the Week 8 lab, including complete solutions for all business scenarios, common student difficulties, and facilitation strategies.

## Lab Overview
- **Format:** Group project planning workshop
- **Duration:** 75 minutes
- **Groups:** Teams of 4 students
- **Deliverable:** 4-slide presentation on ML project planning
- **Assessment:** Presentations with extra credit for volunteers

## Pre-Lab Preparation (20-30 minutes)

### Essential Preparation Steps
1. **Review Tuesday's slides** - especially the StreamFlix case study and HealthFirst insurance example
2. **Read Chapters 19-20** to understand key concepts students should apply
3. **Study all 5 scenarios** and practice explaining the solutions below
4. **Prepare group formation strategy** - aim for diverse skill levels in each group
5. **Set up presentation logistics** - screen sharing, timing, volunteer tracking

### Materials Needed
- Scenario assignments (A-E) ready to distribute
- Timer for 4-minute presentations
- Extra credit point tracking sheet
- Backup plans for technical issues

### Common Technical Issues
- **PowerPoint access:** Some students may need Google Slides alternative
- **Screen sharing:** Test beforehand, have backup presentation method
- **Group dynamics:** Be prepared to reassign groups if needed

## Facilitation Timeline (75 minutes)

### Opening & Setup (10 minutes)
- **[0-3 min]** Welcome and lab overview
- **[3-7 min]** Group formation (aim for balanced teams)
- **[7-10 min]** Scenario distribution and initial questions

### Group Work Session (50 minutes)
- **[10-15 min]** Individual reading and note-taking
- **[15-30 min]** Group discussion and planning
- **[30-50 min]** Slide development and practice
- **[50-60 min]** Final preparation and volunteer solicitation

### Presentations & Wrap-up (15 minutes)
- **[60-72 min]** Team presentations (3-4 teams presenting)
- **[72-75 min]** Quick debrief and homework reminder

### Timing Management Tips
- **Be firm about transitions** - this lab depends on good pacing
- **Circulate during group work** - don't just sit at the front
- **Give time warnings** - "5 minutes left for this section"
- **Have backup plans** if groups finish early or run late

## Group Formation Strategy

### Optimal Team Composition
- **Mixed experience levels** - avoid all strong or all struggling students in one group
- **Diverse perspectives** - different majors/backgrounds when possible
- **4 students per group** - fewer than 4 reduces collaboration, more than 4 reduces participation

### Assignment Methods
1. **Random assignment** (fastest, usually works well)
2. **Strategic assignment** (if you know students well)
3. **Self-selection with constraints** ("no more than 2 from same major")

### Role Assignment
Each team should designate:
- **Problem Framer** (Slide 1) - often good for detail-oriented students
- **Data Strategist** (Slide 2) - good for technically strong students
- **Risk Analyst** (Slide 3) - good for critical thinkers
- **Ethics Officer** (Slide 4) - good for students interested in business/policy

### Managing Group Dynamics
- **Watch for dominating members** - encourage equal participation
- **Support quiet students** - private encouragement often helps
- **Address conflicts quickly** - usually stem from misunderstanding requirements
- **Be willing to redistribute** if group chemistry is poor

# Complete Solutions by Scenario

## Scenario A: TechTalent Recruiting Platform

### Slide 1: Problem Definition & Success Metrics

**Original Request:** *"We want to build an AI system that can automatically screen resumes and predict which candidates will be successful in tech roles."*

**Reframed Problem:** 
"Build a classification model to predict whether a candidate will receive a performance rating ≥4.0 at their 6-month review, using only information available at application time."

**Decision Context:**
- **Users:** Recruiting coordinators and hiring managers
- **Actions:** Prioritize candidates for human review, flag for fast-track interviews
- **Timing:** Real-time scoring as applications are submitted

**Success Metrics:**
- **Technical:** 75%+ precision on "high performer" predictions, 90%+ recall to avoid missing good candidates
- **Business:** Reduce time-to-hire by 40%, increase 6-month performance ratings by 15%, maintain/improve diversity hiring

### Slide 2: Data Strategy & Train/Test Approach

**Data Quality Concerns:**
- Resume text inconsistency (formats, missing info, embellishments)
- Performance rating subjectivity (different managers, different standards)
- Incomplete outcome data (6-month reviews might be missing)
- Self-reported information accuracy (salary expectations, experience)

**Train/Test Strategy:** Time-based split
- **Training:** Applications from 2019-2022
- **Testing:** Applications from 2023
- **Rationale:** Hiring criteria, job market, and candidate pools change over time

**Timeline Considerations:**
- Need 6+ months after hire date to get performance data
- Most recent training data would be from early 2023 hires

### Slide 3: Data Leakage Prevention

**Potential Leakage Sources:**
- **Temporal leakage:** Using 6-month review data or retention info
- **Target leakage:** Including final hiring decision as a feature
- **Feature leakage:** Using post-application information (interview feedback, reference checks)

**Red Flag Features:**
- Interview scores or hiring manager notes
- Anything related to job performance or retention
- Post-application contact information or updates

**Mitigation Strategies:**
- Strict feature timeline: only data available at application submission
- Separate feature engineering from outcome definition
- Regular validation with domain experts (recruiters)

### Slide 4: Ethics & Implementation

**Fairness Concerns:**
- Educational institution bias (prestigious schools favored)
- Experience bias (discriminating against career changers)
- Demographic bias through proxy variables (location, previous companies)
- Historical hiring bias perpetuation

**Privacy Implications:**
- Resume text contains personal information
- Social media data raises consent questions
- Performance data confidentiality

**Interpretability Requirements:**
- Recruiters need to understand why candidates are flagged
- Legal compliance may require explanation of decisions
- Continuous model auditing for bias

**Recommendations:**
- Start with human-in-the-loop approach (model assists, doesn't decide)
- Regular bias auditing across demographic groups
- 6-month development timeline with extensive testing
- Legal review before deployment

## Scenario B: FinanceFlow Credit Decisions

### Slide 1: Problem Definition & Success Metrics

**Original Request:** *"We need an AI model to instantly approve or deny personal loan applications."*

**Reframed Problem:**
"Build a classification model to predict loan default risk within 12 months, enabling instant approval for low-risk applicants while flagging medium/high-risk applications for human review."

**Decision Context:**
- **Users:** Automated approval system + human underwriters
- **Actions:** Auto-approve (low risk), human review (medium risk), auto-deny (high risk)
- **Timing:** <30 seconds for competitive advantage

**Success Metrics:**
- **Technical:** <5% false negative rate (missing defaults), maintain current 8% portfolio default rate
- **Business:** Approve 60% of applications instantly, reduce processing costs by 70%, maintain profitability

### Slide 2: Data Strategy & Train/Test Approach

**Data Quality Concerns:**
- Income verification accuracy (self-reported vs. verified)
- Credit bureau data freshness and completeness
- Bank account data availability (not all applicants bank with them)
- Social media data reliability and consent issues

**Train/Test Strategy:** Time-based split with business cycle considerations
- **Training:** 2019-2021 applications (includes recession period)
- **Testing:** 2022-2023 applications
- **Rationale:** Economic conditions affect default rates; need recent economic context

**Timeline Considerations:**
- Need 12+ months of outcome data for default measurement
- Economic cycle effects on both application and default patterns

### Slide 3: Data Leakage Prevention

**Potential Leakage Sources:**
- **Temporal leakage:** Post-application credit bureau updates
- **Target leakage:** Including human underwriter decisions or notes
- **Feature leakage:** Using post-approval account behavior

**Red Flag Features:**
- Underwriter ratings or decision rationale
- Post-application employment verification
- Payment history on the loan being predicted

**Mitigation Strategies:**
- Snapshot credit data at application time
- Separate pre-approval and post-approval datasets
- Regular temporal validation checks

### Slide 4: Ethics & Implementation

**Fairness Concerns:**
- Historical lending discrimination in training data
- Geographic bias (zip code correlations with protected classes)
- Income verification disparities across employment types
- Credit score bias reflecting historical discrimination

**Privacy Implications:**
- Financial data sensitivity
- Social media data usage consent
- Credit bureau data retention policies

**Interpretability Requirements:**
- FCRA requires explanation of adverse actions
- Regulatory examination requirements
- Internal risk management needs

**Recommendations:**
- Phased rollout starting with clear-cut cases
- Regular fairness auditing across demographic groups
- Legal and compliance review at each stage
- 4-6 month development with extensive regulatory preparation

## Scenario C: HealthStream Patient Risk Assessment

### Slide 1: Problem Definition & Success Metrics

**Original Request:** *"We want to predict which patients are at high risk for readmission within 30 days of discharge."*

**Reframed Problem:**
"Build a classification model to identify patients at high risk for 30-day readmission at the time of discharge, prioritizing the top 50 highest-risk patients monthly for intensive care coordination."

**Decision Context:**
- **Users:** Care coordinators, discharge planners, clinical staff
- **Actions:** Intensive discharge planning, home health services, follow-up calls
- **Timing:** 24 hours before planned discharge

**Success Metrics:**
- **Technical:** High recall (capture 80%+ of readmissions), precision tuned for resource constraints
- **Business:** Reduce 30-day readmission rate by 20%, save $400K annually in penalties

### Slide 2: Data Strategy & Train/Test Approach

**Data Quality Concerns:**
- Diagnosis coding accuracy and completeness
- Social factors documentation inconsistency
- Discharge planning data entry timing
- Patient transfer vs. readmission classification

**Train/Test Strategy:** Time-based split with seasonal consideration
- **Training:** 2019-2022 discharges
- **Testing:** 2023 discharges
- **Rationale:** Medical practices evolve; seasonal readmission patterns exist

**Timeline Considerations:**
- 30-day outcome window for each discharge
- Seasonal variations in readmission patterns
- COVID impact on historical patterns

### Slide 3: Data Leakage Prevention

**Potential Leakage Sources:**
- **Temporal leakage:** Post-discharge information in discharge records
- **Target leakage:** Readmission planning notes in discharge summary
- **Feature leakage:** Using planned length of stay instead of actual

**Red Flag Features:**
- Discharge disposition decided after readmission occurred
- Follow-up appointment outcomes
- Home health service effectiveness measures

**Mitigation Strategies:**
- Use only data available at time of discharge decision
- Careful review of discharge summary timing
- Validate with clinical staff on realistic data availability

### Slide 4: Ethics & Implementation

**Fairness Concerns:**
- Socioeconomic bias in social support factors
- Insurance type correlation with care access
- Geographic disparities in follow-up care availability
- Age and comorbidity bias

**Privacy Implications:**
- HIPAA compliance requirements
- Patient consent for data usage
- Staff access controls and audit trails

**Interpretability Requirements:**
- Clinical staff need to understand and trust recommendations
- Quality committee review requirements
- Joint Commission accreditation documentation

**Recommendations:**
- Extensive clinical validation before deployment
- Start with decision support, not decision automation
- Regular clinical outcome monitoring
- 8-month development timeline with clinical integration

## Scenario D: RetailMax Dynamic Pricing

### Slide 1: Problem Definition & Success Metrics

**Original Request:** *"We want to use machine learning to set optimal prices for our products in real-time."*

**Reframed Problem:**
"Build a regression model to predict optimal price points that maximize revenue while maintaining competitive positioning, updating prices multiple times daily based on demand, inventory, and competitor changes."

**Decision Context:**
- **Users:** Automated pricing system with human oversight
- **Actions:** Adjust prices within defined guardrails, flag unusual situations
- **Timing:** Real-time updates (minutes after market changes)

**Success Metrics:**
- **Technical:** Predict price elasticity within 5%, forecast inventory turnover accurately
- **Business:** Increase revenue by 8%, maintain inventory turnover rates, stay competitive

### Slide 2: Data Strategy & Train/Test Approach

**Data Quality Concerns:**
- Competitor price scraping accuracy and timeliness
- Inventory data real-time synchronization
- Seasonal demand pattern representation
- Customer behavior tracking across channels

**Train/Test Strategy:** Time-based split with seasonal validation
- **Training:** 18 months of historical data including full seasonal cycles
- **Testing:** Most recent 3 months
- **Rationale:** Consumer behavior and competitive landscape evolve rapidly

**Timeline Considerations:**
- Need multiple seasonal cycles for training
- Rapid market changes require frequent model updates
- Holiday and promotional period special handling

### Slide 3: Data Leakage Prevention

**Potential Leakage Sources:**
- **Temporal leakage:** Using future competitor prices or demand data
- **Target leakage:** Including final sales results in price optimization
- **Feature leakage:** Using post-price-change customer behavior

**Red Flag Features:**
- Next-day competitor prices
- Customer price sensitivity measured after price changes
- Inventory levels affected by pricing decisions

**Mitigation Strategies:**
- Strict temporal boundaries on all input data
- Separate price setting from outcome measurement
- Real-time data validation pipelines

### Slide 4: Ethics & Implementation

**Fairness Concerns:**
- Geographic price discrimination
- Customer segment targeting (premium vs. budget)
- Access to competitive prices for all customers
- Potential price manipulation during supply shortages

**Privacy Implications:**
- Customer behavior tracking and profiling
- Purchase history usage for personalization
- Competitive intelligence gathering methods

**Interpretability Requirements:**
- Business stakeholders need to understand pricing logic
- Customer service needs to explain price changes
- Regulatory compliance for pricing practices

**Recommendations:**
- Start with limited product categories
- Implement pricing guardrails and human oversight
- Regular competitive and customer impact assessment
- 6-month phased rollout with extensive monitoring

## Scenario E: DriveRight Insurance Telematics

### Slide 1: Problem Definition & Success Metrics

**Original Request:** *"We want to use driving behavior data to personalize insurance premiums."*

**Reframed Problem:**
"Build a regression model to predict claim risk based on telematics data, enabling personalized premium adjustments for voluntary participants while maintaining actuarial soundness and regulatory compliance."

**Decision Context:**
- **Users:** Underwriters, actuaries, customer service representatives
- **Actions:** Adjust renewal premiums, offer discounts, recommend driving improvements
- **Timing:** Quarterly rate adjustments based on 3-month driving patterns

**Success Metrics:**
- **Technical:** Improve loss ratio prediction accuracy by 15%, calibrate with traditional risk factors
- **Business:** Reduce claims costs by 10%, increase customer retention by 8%, grow telematics program participation

### Slide 2: Data Strategy & Train/Test Approach

**Data Quality Concerns:**
- Smartphone vs. connected car data consistency
- Data collection opt-out bias (risky drivers less likely to participate)
- GPS accuracy and privacy masking effects
- Phone usage detection false positives/negatives

**Train/Test Strategy:** Time-based split with geographic stratification
- **Training:** 2019-2022 telematics participants with 12+ months of data
- **Testing:** 2023 participants
- **Rationale:** Driving patterns change over time; geographic differences in risk

**Timeline Considerations:**
- Need 12+ months of driving data for reliable patterns
- Claims development period (some claims reported late)
- Seasonal driving pattern variations

### Slide 3: Data Leakage Prevention

**Potential Leakage Sources:**
- **Temporal leakage:** Post-claim driving behavior changes
- **Target leakage:** Including claim-related trip data
- **Feature leakage:** Using driving patterns after policy changes

**Red Flag Features:**
- Driving behavior immediately before or after claims
- Trip data from accident dates
- Behavioral changes following premium adjustments

**Mitigation Strategies:**
- Exclude data from claim periods and aftermath
- Use historical driving patterns for rate setting
- Validate temporal consistency in training data

### Slide 4: Ethics & Implementation

**Fairness Concerns:**
- Economic discrimination (smartphone quality affects data)
- Age bias in technology adoption
- Geographic bias in mobile coverage
- Disability accommodations for driving patterns

**Privacy Implications:**
- Location tracking and movement patterns
- Family member driving detection
- Data sharing with third parties
- Consent withdrawal and data deletion

**Interpretability Requirements:**
- Customers need to understand premium changes
- Regulatory filing requirements for rate factors
- Actuarial justification for risk differentials

**Recommendations:**
- Voluntary participation only with clear opt-out
- Regular fairness auditing across demographic groups
- Strong privacy controls and data minimization
- 12-month pilot program before full deployment

## Common Student Difficulties & Responses

### Problem Framing Issues

**Student Difficulty:** "We'll use machine learning to solve the business problem"
**TA Response:** "What specific prediction will the model make? What exact action will be taken based on that prediction?"

**Student Difficulty:** Vague success metrics like "improve business outcomes"
**TA Response:** "Give me a number. What exactly will you measure, and how much improvement counts as success?"

**Student Difficulty:** Jumping to algorithms ("we'll use random forest")
**TA Response:** "We're not building models today - we're planning. What question are you trying to answer?"

### Data Strategy Confusion

**Student Difficulty:** "We'll randomly split the data"
**TA Response:** "Think about how this model will be used in the real world. Are you predicting the past or the future?"

**Student Difficulty:** Ignoring data quality entirely
**TA Response:** "What could go wrong with this data? What assumptions are you making?"

**Student Difficulty:** Not considering business context for splitting
**TA Response:** "If you were deploying this model next month, what data would actually be available?"

### Data Leakage Blindness

**Student Difficulty:** "We'll use all available data"
**TA Response:** "When exactly would each piece of data be available? Draw me a timeline."

**Student Difficulty:** Missing obvious leakage
**TA Response:** "If your model is 99% accurate on this problem, what might be wrong?"

**Student Difficulty:** Confusing correlation with causation
**TA Response:** "Does X cause Y, or does Y cause X? Or do they both happen at the same time?"

### Ethics Afterthoughts

**Student Difficulty:** "We'll consider ethics later"
**TA Response:** "Who could be harmed by this model? What groups might be unfairly affected?"

**Student Difficulty:** Generic fairness concerns
**TA Response:** "Be specific to your scenario. What historical biases might exist in this industry?"

**Student Difficulty:** Ignoring interpretability
**TA Response:** "If the model denies someone's loan application, what explanation would you give them?"

### Time Management Issues

**Groups finishing too early:** Have them critique other scenarios or develop contingency plans
**Groups running behind:** Help prioritize key points for each slide
**Uneven participation:** Privately encourage quiet members, redirect dominant ones

## Presentation Facilitation Guide

### Setting Up for Success
- **Volunteer solicitation:** Start asking for volunteers at 50-minute mark
- **Backup plan:** Have random selection method ready
- **Technical setup:** Test screen sharing and presentation tools
- **Time management:** Use visible timer, give 30-second warnings

### During Presentations
- **Encourage the audience:** "What questions do you have for this team?"
- **Ask follow-up questions:** Test understanding beyond their slides
- **Point out good insights:** "That's a great observation about..."
- **Connect to course concepts:** "How does this relate to what we learned on Tuesday?"

### Good Follow-up Questions
- "What would happen if your model was wrong in this case?"
- "How would you convince a skeptical stakeholder this approach is right?"
- "What would you do differently if you had unlimited time and budget?"
- "Which assumption are you most worried about?"

### Managing Difficult Situations
- **Unprepared groups:** Ask easier questions, focus on learning points
- **Overly confident groups:** Push on edge cases and assumptions
- **Technical difficulties:** Have backup presentation methods ready
- **Time overruns:** Politely interrupt and summarize key points

### Wrap-up Synthesis
After presentations, briefly highlight:
- **Common themes** across scenarios
- **Different approaches** teams took
- **Real-world relevance** of their planning work
- **Connection to upcoming content**

## Assessment & Extra Credit Tracking

### Extra Credit Guidelines
- **5 points** for volunteering to present
- **Applied to:** This lab grade or overall participation grade
- **Criteria:** Volunteering is sufficient; presentation quality doesn't affect points
- **Record keeping:** Track which students presented for grade book

### Presentation Evaluation
While extra credit is just for volunteering, use these criteria for feedback:

**Excellent (A-level):**
- Specific, actionable problem framing
- Thoughtful train/test strategy with clear justification
- Multiple concrete leakage risks identified
- Specific fairness/privacy concerns with mitigation strategies

**Good (B-level):**
- Clear problem statement with some specificity
- Reasonable data strategy with basic justification
- Some leakage risks identified
- General ethical considerations addressed

**Needs Improvement (C-level):**
- Vague problem framing
- Generic data strategy without justification
- Minimal attention to leakage or ethics
- Focus on algorithms rather than planning

### Homework Connection
Remind students that their presentations will be submitted as homework:
- **Format:** PowerPoint or Google Slides
- **Due date:** Standard homework deadline
- **Grading:** More detailed rubric for homework submission
- **Improvement opportunity:** Can revise based on class feedback

## Post-Lab Reflection & Improvement

### For TAs: What to Note
- **Timing accuracy:** Did sections run long/short?
- **Student struggles:** Which concepts were most confusing?
- **Group dynamics:** What worked well/poorly in team formation?
- **Presentation quality:** Were students prepared enough?

### Common Adjustments for Future
- **More time for group work** if teams struggled with presentations
- **Clearer examples** if students missed key concepts
- **Better group formation** if team dynamics were poor
- **Different scenario complexity** if too easy/hard

### Success Indicators
- Students can articulate specific ML planning considerations
- Groups demonstrate understanding of train/test strategy
- Teams identify realistic business constraints and ethical issues
- Presentations show critical thinking beyond memorized concepts

### Connection to Next Week
This lab sets up students for:
- Understanding why algorithm choice matters
- Appreciating the complexity of "simple" ML projects
- Recognizing that planning prevents problems
- Developing professional consulting mindset

## Emergency Backup Plans

### Technology Failures
- **No internet:** Have offline versions of scenarios printed
- **Presentation issues:** Use flip chart paper for slide content
- **Small groups:** Combine scenarios or do fishbowl discussions
- **No projector:** Have teams present to each other in corners

### Time Constraints
- **Behind schedule:** Focus on slides 1-2, discuss 3-4 in groups
- **Ahead of schedule:** Add scenario-swapping exercise between groups
- **No presentations:** Do gallery walk with poster presentations

### Student Issues
- **Attendance problems:** Adjust group sizes, have floating TA support
- **Unprepared students:** Pair with stronger groups as observers
- **Resistant students:** Emphasize professional skill development

### Last Resort Options
If the full lab format doesn't work:
1. **Modified workshop:** Work through one scenario as full class
2. **Case study analysis:** Deep dive on StreamFlix example from Tuesday
3. **Concept practice:** Structured Q&A on readings with examples
4. **Planning templates:** Have students fill out structured planning worksheets