# TA Guidance: Week 9 Lab - Correlation & Regression Analysis

## 🎯 Lab Overview and Teaching Philosophy

**Critical Understanding:** This lab serves as **BOTH** the in-class Thursday lab **AND** the weekly homework assignment. Students will use their numerical results from the three challenges to complete the Canvas quiz due Sunday.

### Learning Objectives
- Students will calculate and interpret correlation coefficients for business relationships
- Students will build and evaluate simple and multiple linear regression models using scikit-learn
- Students will apply proper train/test split methodology for honest model evaluation
- Students will interpret regression coefficients and evaluation metrics in business contexts

### Time Allocation & Teaching Strategy
- **Part 1 & 2 (30-40 minutes)**: TA-guided instruction and demonstrations
- **Part 3 (35-40 minutes)**: **Independent group work** - Students complete challenges as homework
- **Wrap-up (5 minutes)**: Brief summary and Canvas quiz reminders

### Content Alignment
- Directly reinforces Tuesday's slides on correlation analysis and regression modeling
- Provides hands-on practice with all key concepts from Chapters 21-22
- Uses the same datasets as textbook end-of-chapter exercises

## 🛠️ Pre-Lab Setup Instructions

**Technical Setup:**
- Ensure all students can access Google Colab and load the lab notebook
- Test that ISLP package loads correctly (if issues arise, provide CSV fallback files)
- Have backup plan for internet connectivity issues

**Grouping Strategy:**
- Form groups of 2-4 students before starting
- Mix students with different programming experience levels
- Encourage collaboration during challenges

**Materials Needed:**
- Lab notebook: `09_wk9_lab.ipynb`
- This TA guidance notebook
- Access to course datasets

## 📚 Key Concepts to Emphasize

1. **Correlation vs Causation**: Constantly remind students that correlation measures association, not causation
2. **Business Interpretation**: Every coefficient and metric should be explained in business terms
3. **Train/Test Split**: This is the most important concept - models must be evaluated on unseen data
4. **Homework Integration**: Challenge results directly feed into Canvas quiz answers

## Part 1 & 2 Teaching Guide: Guided Instruction (30-40 minutes)

### Teaching Approach for Parts 1 & 2
- **Walk through systematically**: Students should follow along and execute code
- **Explain business context**: Connect every concept to real-world applications
- **Encourage questions**: This is the time for clarification before independent work
- **Demonstrate best practices**: Model the workflow students will use in challenges

### Key Teaching Points:
- **Start with business context**: "Marketing managers need to know which advertising channels work best"
- **Emphasize interpretation**: What does correlation of 0.78 vs 0.23 mean for strategy?
- **Visual reinforcement**: Scatterplots help students internalize correlation strength
- **Model building workflow**: Always start with feature matrix X and target vector y
- **Train/test importance**: "Would you trust a model that was only tested on the data it learned from?"

### Common Student Questions & Responses:
- **Q: "Why is newspaper correlation so low?"**
  - A: "This suggests newspaper advertising might be less effective, but remember - correlation doesn't prove causation."
- **Q: "What's a 'good' correlation?"**
  - A: "It depends on the business context. In marketing, 0.7+ is strong, but in financial markets, even 0.3 can be valuable."
- **Q: Double brackets confusion**
  - A: "For single features, use `[[column]]` to maintain DataFrame structure for scikit-learn"

### Transition to Independent Work
After Parts 1 & 2, clearly communicate:
- "Now you'll work independently on challenges that serve as your homework"
- "Record your numerical results carefully - you'll need them for the Canvas quiz"
- "I'll circulate to help, but you need to work through the problems yourselves"

## Part 3 Teaching Strategy: Independent Challenge Work (35-40 minutes)

### 🚨 Critical Teaching Philosophy for Challenges

**DO NOT walk through the challenges step-by-step.** This is their homework, and they need to work through it independently.

### Your Role During Challenges:
1. **Circulate and support**: Walk around the room, check in with groups
2. **Answer specific questions**: Help with syntax errors, conceptual clarifications
3. **Provide strategic hints**: Guide thinking without giving away answers
4. **Demonstrate techniques**: If many groups struggle with the same concept, show it briefly
5. **Encourage collaboration**: Groups should work together and learn from each other

### What NOT to Do:
- ❌ **Don't solve entire challenges for them**
- ❌ **Don't provide complete code solutions**
- ❌ **Don't walk through every step systematically**
- ❌ **Don't give away numerical answers they need for the quiz**

### Strategic Support Approaches:
- **For struggling groups**: "What specific part are you stuck on?" then provide targeted help
- **For syntax issues**: Show the correct syntax pattern, let them apply it
- **For conceptual confusion**: Ask guiding questions to help them think through the problem
- **For interpretation questions**: "What do you think this number means in business terms?"

### Time Management:
- **Challenge 1**: Allow 12 minutes, then briefly check progress
- **Challenge 2**: Allow 12 minutes, provide hints if many groups are stuck
- **Challenge 3**: Allow 11 minutes, may extend if needed
- **Final 5-10 minutes**: Address common issues, Canvas quiz reminders

## Challenge Solutions - FOR TA REFERENCE ONLY

**⚠️ IMPORTANT**: These solutions are for your reference to help students when they're stuck. Do NOT walk through these solutions step by step with the class. Use them to:
- Verify your own understanding
- Help debug student issues
- Provide targeted hints
- Check if student answers are on the right track

### Setup Code (Run This First)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, root_mean_squared_error
from ISLP import load_data
import warnings

# Suppress numerical warnings that can occur with large-scale differences in features
warnings.filterwarnings('ignore', category=RuntimeWarning)

# Load datasets
advertising = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/Advertising.csv")
credit = load_data('Credit')
hitters = load_data('Hitters')
college = load_data('College')

print("Datasets loaded successfully!")

## Challenge 1 Solutions: Credit Card Balance Analysis

**Expected Difficulty**: Easy to moderate  
**Critical Parameters**: Must use random_state=123 for reproducible quiz answers  
**Common Issues**: Forgetting to round coefficient to 2 decimal places, using wrong random_state

In [None]:
# Challenge 1, Task 1: Correlation Analysis
numeric_cols = ['Balance', 'Income', 'Limit', 'Age']
correlation_matrix = credit[numeric_cols].corr()

print("=== Task 1: Correlation Analysis ===")
print("Correlations with Balance:")
for var in ['Income', 'Limit', 'Age']:
    corr_val = correlation_matrix.loc['Balance', var]
    print(f"Balance vs {var}: {corr_val:.3f}")

# Answer: Limit has strongest correlation (around 0.862)

In [None]:
# Challenge 1, Task 2: Data Splitting
X = credit[['Income']]
y = credit['Balance']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123
)

print("=== Task 2: Data Splitting ===")
print(f"Training set observations: {len(X_train)}")
print(f"Test set observations: {len(X_test)}")
print(f"Total observations: {len(X_train) + len(X_test)}")

In [None]:
# Challenge 1, Task 3: Model Building
model = LinearRegression()
model.fit(X_train, y_train)

print("=== Task 3: Model Building ===")
print(f"Model successfully fitted on training data")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Income coefficient (raw): {model.coef_[0]:.6f}")

In [None]:
# Challenge 1, Task 4: Coefficient Interpretation
income_coef = round(model.coef_[0], 2)

print("=== Task 4: Coefficient Interpretation ===")
print(f"Income coefficient (rounded to 2 decimal places): {income_coef}")
print(f"\nBusiness interpretation:")
print(f"For every $1,000 increase in income, credit card balance increases by ${income_coef * 1000:.2f}")
print(f"\nTo the CEO: 'Higher income customers tend to carry higher balances.'")
print(f"'Each additional $1K in income is associated with ${income_coef * 1000:.2f} more in credit card debt.'")

In [None]:
# Challenge 1, Task 5: Prediction
prediction_input = pd.DataFrame({'Income': [115]})
predicted_balance = model.predict(prediction_input)[0]

print("=== Task 5: Prediction ===")
print(f"Predicted balance for income=$115,000: ${predicted_balance:.2f}")
print(f"\nBusiness interpretation:")
print(f"A customer earning $115,000 annually is expected to carry ${predicted_balance:.2f} in credit card debt")

In [None]:
# Challenge 1, Task 6: Model Evaluation
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

train_rmse = root_mean_squared_error(y_train, train_predictions)
test_rmse = root_mean_squared_error(y_test, test_predictions)

print("=== Task 6: Model Evaluation ===")
print(f"Training RMSE: ${train_rmse:.2f}")
print(f"Test RMSE: ${test_rmse:.2f}")
print(f"Difference: ${test_rmse - train_rmse:.2f}")

if abs(test_rmse - train_rmse) < 50:  # Arbitrary threshold
    print("\n✅ Model generalizes well (small difference between train/test RMSE)")
else:
    print("\n⚠️ Potential overfitting concern (large difference between train/test RMSE)")

print(f"\nBusiness interpretation of test RMSE:")
print(f"On average, our predictions are off by ${test_rmse:.2f}")
print(f"This represents the typical error when predicting customer credit card balances")

## Challenge 2 Solutions: Baseball Salary Analysis

**Expected Difficulty**: Moderate  
**Critical Parameters**: Must use random_state=456 and include RBI as third predictor  
**Common Issues**: Forgetting to include RBI, using wrong random_state, not rounding coefficients correctly

In [None]:
# Challenge 2, Task 1: Data Cleaning
print("=== Task 1: Data Cleaning ===")
print(f"Original dataset shape: {hitters.shape}")
print(f"Missing salary values: {hitters['Salary'].isna().sum()}")

hitters_clean = hitters.dropna(subset=['Salary'])
players_removed = len(hitters) - len(hitters_clean)

print(f"Clean dataset shape: {hitters_clean.shape}")
print(f"Players removed due to missing salary: {players_removed}")

In [None]:
# Challenge 2, Task 2: Data Splitting
X = hitters_clean[['Years', 'Hits', 'RBI']]
y = hitters_clean['Salary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=456
)

print("=== Task 2: Data Splitting ===")
print(f"Training set: {len(X_train)} players")
print(f"Test set: {len(X_test)} players")
print(f"Total players: {len(X_train) + len(X_test)}")

In [None]:
# Challenge 2, Task 3: Multiple Regression
model = LinearRegression()
model.fit(X_train, y_train)

print("=== Task 3: Multiple Regression Model ===")
print(f"Model fitted with predictors: {list(X.columns)}")
print(f"\nWhy these variables matter for salary:")
print(f"- Years: Experience and proven track record")
print(f"- Hits: Offensive performance and fan appeal")
print(f"- RBI: Clutch performance and team contribution")

In [None]:
# Challenge 2, Task 4: Coefficient Analysis
years_coef = round(model.coef_[0], 2)
hits_coef = round(model.coef_[1], 2)
rbi_coef = round(model.coef_[2], 2)

print("=== Task 4: Coefficient Analysis ===")
print(f"Years coefficient: {years_coef}")
print(f"Hits coefficient: {hits_coef}")
print(f"RBI coefficient: {rbi_coef}")

# Determine strongest impact
coeffs = {'Years': abs(years_coef), 'Hits': abs(hits_coef), 'RBI': abs(rbi_coef)}
strongest = max(coeffs, key=coeffs.get)
print(f"\nStrongest impact on salary: {strongest}")
print(f"This means {strongest.lower()} has the largest effect on player salaries")

In [None]:
# Challenge 2, Task 5: Business Interpretation
print("=== Task 5: Business Interpretation ===")
print(f"\nTo the team owner:")
print(f"- Each year of experience increases salary by ${years_coef:,}")
print(f"- Each additional hit increases salary by ${hits_coef:.2f}")
print(f"- Each additional RBI increases salary by ${rbi_coef:.2f}")

print(f"\nFor evaluating potential signings:")
if abs(years_coef) > abs(hits_coef) and abs(years_coef) > abs(rbi_coef):
    print(f"Focus on experienced players - years of service have the biggest salary impact")
elif abs(hits_coef) > abs(rbi_coef):
    print(f"Focus on consistent hitters - hits have bigger impact than RBIs")
else:
    print(f"Focus on clutch performers - RBIs drive salary more than total hits")

In [None]:
# Challenge 2, Task 6: Model Performance
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)
r2_diff = train_r2 - test_r2

print("=== Task 6: Model Performance ===")
print(f"Training R²: {train_r2:.3f}")
print(f"Test R²: {test_r2:.3f}")
print(f"R² difference: {r2_diff:.3f}")

if r2_diff > 0.1:  # 10% threshold
    print("\n⚠️ Model shows signs of overfitting")
    print("The model may not predict well for new players")
else:
    print("\n✅ Model generalizes reasonably well")
    print("The model should be reliable for predicting salaries of new players")

In [None]:
# Challenge 2, Task 7: Prediction
prediction_input = pd.DataFrame({
    'Years': [10],
    'Hits': [150], 
    'RBI': [75]
})

predicted_salary = model.predict(prediction_input)[0]

print("=== Task 7: Salary Prediction ===")
print(f"Player stats: 10 years experience, 150 hits, 75 RBIs")
print(f"Predicted salary: ${predicted_salary:,.2f}")
print(f"\nTo the GM: This player should expect a salary of approximately ${predicted_salary:,.0f}")

## Challenge 3 Solutions: College Applications Analysis

**Expected Difficulty**: Moderate to challenging  
**Critical Parameters**: Must use random_state=789 and 75/25 split  
**Common Issues**: Forgetting `drop_first=True`, using wrong random_state, not rounding to 1 decimal

### Challenge 3 Teaching Notes:
- **Expected difficulty**: Moderate to challenging
- **Critical parameters**: Must use random_state=789 and 75/25 split
- **Quiz focus**: Students need coefficients rounded to 1 decimal place, performance metrics, prediction
- **Common issues**: 
  - Forgetting `drop_first=True` in dummy encoding
  - Using wrong random_state or split ratio
  - Not rounding to 1 decimal place
  - Confusion about interpreting Private_Yes coefficient
  - **RuntimeWarning about matmul**: Students may apply `get_dummies()` to numeric columns - guide them to only dummy encode categorical variables
  - **Numerical warnings**: Large scale differences between features may cause harmless overflow warnings - these can be ignored
- **Key insight**: Private coefficient shows difference between private and public colleges
- **Teaching moment**: Emphasize "holding other factors constant" interpretation for categorical variables

In [None]:
# Challenge 3, Task 1: Data Preparation
# Combine numeric columns with dummy-encoded categorical variable
college_encoded = pd.concat([
    college[['Top10perc', 'Outstate']], 
    pd.get_dummies(college[['Private']], drop_first=True)
], axis=1)

print("=== Task 1: Data Preparation ===")
print(f"Column names after dummy encoding: {list(college_encoded.columns)}")
print(f"\nWhat Private_Yes = 1 represents:")
print(f"Private_Yes = 1 means the college is a private institution")
print(f"Private_Yes = 0 means the college is a public institution")
print(f"\nFirst few rows:")
print(college_encoded.head())

In [None]:
# Challenge 3, Task 2: Data Splitting
X = college_encoded
y = college['Apps']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=789
)

print("=== Task 2: Data Splitting ===")
print(f"Training set: {len(X_train)} colleges")
print(f"Test set: {len(X_test)} colleges")
print(f"Total colleges: {len(X_train) + len(X_test)}")
print(f"Split ratio: {len(X_train)/(len(X_train) + len(X_test)):.2f} train, {len(X_test)/(len(X_train) + len(X_test)):.2f} test")

In [None]:
# Challenge 3, Task 3: Multiple Regression
model = LinearRegression()
model.fit(X_train, y_train)

print("=== Task 3: Multiple Regression Model ===")
print(f"Model fitted with predictors: {list(X.columns)}")
print(f"\nWhy these factors influence applications:")
print(f"- Top10perc: Academic reputation attracts high-achieving students")
print(f"- Outstate: Higher tuition may signal quality but also create barriers")
print(f"- Private_Yes: Private vs public affects cost, size, and prestige")

In [None]:
# Challenge 3, Task 4: Coefficient Analysis
top10_coef = round(model.coef_[0], 1)
outstate_coef = round(model.coef_[1], 1)
private_coef = round(model.coef_[2], 1)

print("=== Task 4: Coefficient Analysis ===")
print(f"Top10perc coefficient: {top10_coef}")
print(f"Outstate coefficient: {outstate_coef}")
print(f"Private_Yes coefficient: {private_coef}")

# Determine largest positive and negative impacts
coeffs = {'Top10perc': top10_coef, 'Outstate': outstate_coef, 'Private_Yes': private_coef}
largest_positive = max(coeffs, key=lambda k: coeffs[k] if coeffs[k] > 0 else -float('inf'))
most_negative = min(coeffs, key=coeffs.get)

print(f"\nLargest positive impact: {largest_positive} ({coeffs[largest_positive]})")
if min(coeffs.values()) < 0:
    print(f"Factor that might discourage applications: {most_negative} ({coeffs[most_negative]})")
else:
    print(f"All factors have positive impact on applications")

In [None]:
# Challenge 3, Task 5: Categorical Interpretation
print("=== Task 5: Categorical Interpretation ===")
print(f"Private_Yes coefficient: {private_coef}")

print(f"\nTo a college president:")
if private_coef > 0:
    print(f"Private colleges receive about {private_coef:.1f} MORE applications than public colleges,")
    print(f"holding academic quality (Top10perc) and tuition (Outstate) constant.")
else:
    print(f"Private colleges receive about {abs(private_coef):.1f} FEWER applications than public colleges,")
    print(f"holding academic quality (Top10perc) and tuition (Outstate) constant.")

print(f"\nThis suggests that the private/public distinction itself influences application volume,")
print(f"beyond what can be explained by academic reputation and cost factors.")

In [None]:
# Challenge 3, Task 6: Model Validation
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)
test_rmse = root_mean_squared_error(y_test, test_predictions)

print("=== Task 6: Model Validation ===")
print(f"Training R²: {train_r2:.3f}")
print(f"Test R²: {test_r2:.3f}")
print(f"Test RMSE: {test_rmse:.1f} applications")

print(f"\nModel explains {test_r2*100:.1f}% of the variation in college applications")
if test_r2 > 0.5:
    print(f"This is a useful level of predictive power for strategic planning")
elif test_r2 > 0.3:
    print(f"This provides moderate predictive power - useful but not highly precise")
else:
    print(f"This provides limited predictive power - other factors are likely important")

In [None]:
# Challenge 3, Task 7: Business Prediction
prediction_input = pd.DataFrame({
    'Top10perc': [75],
    'Outstate': [25000],
    'Private_Yes': [1]  # 1 for private college
})

predicted_apps = model.predict(prediction_input)[0]

print("=== Task 7: Business Prediction ===")
print(f"College profile: Private, 75% from top 10%, $25,000 out-of-state tuition")
print(f"Predicted applications: {predicted_apps:.1f}")

print(f"\nConsulting advice:")
print(f"This college should expect approximately {predicted_apps:,.0f} applications.")
print(f"This is {'above' if predicted_apps > 5000 else 'below'} average for similar institutions.")
print(f"\nStrategic recommendations based on the model:")
if top10_coef > 0:
    print(f"- Focus on attracting high-achieving students (increases applications)")
if outstate_coef < 0:
    print(f"- Consider tuition strategy (high tuition may reduce applications)")
if private_coef > 0:
    print(f"- Leverage private institution advantages in marketing")

## 🎯 Wrap-Up Guidance (5 minutes)

### Key Points to Emphasize:
1. **Homework Connection**: "The challenges you just worked on ARE your homework this week"
2. **Canvas Quiz**: "Use your numerical results for the Canvas quiz due Sunday"
3. **Real-world relevance**: "These techniques are used daily in business analytics"
4. **Process importance**: "The train/test split workflow is crucial for any ML project"

### Canvas Quiz Reminders:
- Students need their specific numerical results from each challenge
- Emphasize the importance of using exact parameters specified (random_state values)
- Quiz will include matching, multiple choice, and fill-in-the-blank questions
- Due Sunday before class

### Preview Next Week:
- "Next Tuesday we'll explore classification models - predicting categorical outcomes"
- "We'll learn to predict things like spam vs not spam, click vs not click"
- "The evaluation principles you learned today apply to all machine learning models"

## 🚨 Common Issues & Solutions

### Technical Issues:
1. **ISLP package not loading**: Provide CSV backup files
2. **Random state confusion**: Emphasize that different random states give different results
3. **Dummy encoding errors**: Remind about `drop_first=True`
4. **Coefficient rounding**: Each challenge has different rounding requirements

### Conceptual Issues:
1. **Correlation vs causation**: Keep reinforcing this distinction
2. **Overfitting confusion**: Use simple analogies (memorizing vs understanding)
3. **Coefficient interpretation**: Always bring back to business context
4. **Train/test split importance**: "Never trust a model evaluated on its training data"

### Time Management:
- If behind schedule: Focus on Challenge 1 and 2, Challenge 3 can be completed outside class
- If ahead of schedule: Encourage deeper business interpretation discussions
- Always preserve time for Canvas quiz reminders

## 📋 Post-Lab Checklist

**For TAs:**
- [ ] All groups attempted at least Challenges 1-2
- [ ] Students understand this is their homework
- [ ] Canvas quiz parameters clearly communicated
- [ ] Students know results are due Sunday
- [ ] Note any concepts that need reinforcement next week

**For Students:**
- [ ] Have attempted all three challenges
- [ ] Understand business interpretation of regression coefficients
- [ ] Can explain why train/test splits matter
- [ ] Know their numerical results will be used for Canvas quiz
- [ ] Understand homework is due Sunday

---

**Lab Success Metrics:**
- Students can work independently on regression problems
- Students can interpret results in business terms
- Students understand the homework-lab integration
- Students are prepared for the Canvas quiz on Sunday