# Week 9 Lab: Correlation & Regression Analysis

<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/labs/09_wk9_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lab, you'll apply correlation analysis and linear regression to real business datasets, following the complete machine learning workflow from data exploration through model evaluation. You'll work with multiple datasets to understand how businesses use these techniques to drive strategic decisions, predict outcomes, and measure model performance.

## 🎯 Learning Objectives
By the end of this lab, you will be able to:
- Calculate and interpret correlation coefficients for business relationships
- Build and evaluate simple and multiple linear regression models using scikit-learn
- Apply proper train/test split methodology for honest model evaluation
- Interpret regression coefficients and evaluation metrics in business contexts

## 📚 This Lab Reinforces
- **Chapter 21: Correlation and Linear Regression Foundations**
- **Chapter 22: Evaluating Regression Models**
- **Week 9 Tuesday Slides: Correlation & Regression Foundations**

## 🕐 Estimated Time & Structure
**Total Time:** 75 minutes  
**Mode:** Group (2-4 students)

- **[0–30 min]** Guided practice with correlation analysis and regression modeling
- **[30–35 min]** Class discussion and Q&A
- **[35–70 min]** Independent group challenges with real business datasets
- **[70–75 min]** Lab wrap-up and homework quiz preparation

You are encouraged to work in small groups of **2–4 students** and complete the lab together.

## 💡 Why This Matters
Regression analysis is one of the most widely used techniques in business analytics. From predicting sales based on advertising spend to understanding factors that drive customer satisfaction, regression provides the foundation for data-driven decision making. This lab will prepare you to build, evaluate, and interpret regression models that businesses actually use to guide strategy and operations.

## Setup
We'll work with multiple datasets today: the Advertising dataset for guided practice, and three ISLP datasets (Credit, Hitters, College) for the independent challenges. These are the same datasets used in your textbook's end-of-chapter exercises.  You can read more about the ISLP datasets at [https://islp.readthedocs.io/en/latest/data.html](https://islp.readthedocs.io/en/latest/data.html).

In [None]:
# Required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, root_mean_squared_error, mean_absolute_percentage_error

# Load the Advertising dataset for guided practice
advertising = pd.read_csv("https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/Advertising.csv")

# Load ISLP datasets for challenges
from ISLP import load_data
credit = load_data('Credit')
hitters = load_data('Hitters')
college = load_data('College')

# Quick preview
print("Advertising dataset shape:", advertising.shape)
print("Credit dataset shape:", credit.shape)
print("Hitters dataset shape:", hitters.shape)
print("College dataset shape:", college.shape)

## Part 1 — Correlation Analysis & Interpretation (10 minutes)

We'll start by reinforcing correlation analysis using the Advertising dataset - the same data from Tuesday's lecture. This will help you practice measuring relationships between business variables and interpreting correlation coefficients.

### Understanding Business Relationships

Correlation helps us answer questions like: "Which advertising channels have the strongest relationship with sales?"

**📋 Step-by-step instructions:**
1. Explore the Advertising dataset structure
2. Calculate correlation coefficients between all variables
3. Interpret the relationships in business terms
4. Create visualizations to support your findings

In [None]:
# Explore the advertising dataset
print("Advertising Dataset Overview:")
print(advertising.head())
print("\nDataset Info:")
print(advertising.info())
print("\nSummary Statistics:")
print(advertising.describe())

In [None]:
# Calculate correlation matrix
correlation_matrix = advertising.corr()
print("Correlation Matrix:")
print(correlation_matrix)

# Focus on sales correlations
print("\nCorrelations with Sales:")
sales_correlations = correlation_matrix['sales'].drop('sales').sort_values(ascending=False)
print(sales_correlations)

### 🧠 Your Turn — Visualizing Relationships

Create scatterplots to visualize the relationships between each advertising channel and sales.

**Tasks:**
- Create a subplot with 3 scatterplots (TV vs sales, radio vs sales, newspaper vs sales)
- Add appropriate titles and labels
- Based on the visualizations, rank the advertising channels by relationship strength

💡 **Hint:** Use `plt.subplots(1, 3, figsize=(15, 4))` to create side-by-side plots

In [None]:
# Your code here


### ✅ Check Your Understanding

Based on your correlation analysis and visualizations:

**Questions to consider:**
- Which advertising channel has the strongest linear relationship with sales?
- What does a correlation of 0.78 vs 0.05 tell you about business strategy?

**Expected Result:** TV should show the strongest correlation (~0.78), followed by radio (~0.58), with newspaper showing a very weak relationship (~0.23).

## Part 2 — Linear Regression Modeling (20 minutes)

Now we'll move from measuring relationships to making predictions using linear regression. You'll build both simple and multiple regression models and learn to interpret the results for business decision-making.

### Simple Linear Regression

Let's start with a simple regression model predicting sales from TV advertising spend.

**Example:** A marketing manager wants to know: "If I spend $50,000 on TV advertising, what sales can I expect?"

In [None]:
# Simple regression: Sales predicted by TV advertising
X_simple = advertising[['TV']]  # Feature matrix (note double brackets)
y = advertising['sales']  # Target variable

# Fit the model
simple_model = LinearRegression()
simple_model.fit(X_simple, y)

Now that we've fit the model, let's interpret the coefficients. Based on the model output, we can say that for every additional $1,000 spent on TV advertising, sales increase by approximately $48. The intercept indicates that if no money is spent on TV advertising, the baseline sales would be around $7,033.

In [None]:
# Extract results
print(f"Simple Regression Results:")
print(f"Intercept: {simple_model.intercept_:.3f}")
print(f"TV Coefficient: {simple_model.coef_[0]:.3f}")
print(f"\nEquation: Sales = {simple_model.intercept_:.3f} + {simple_model.coef_[0]:.3f} × TV")

We can also make predictions. For example, if the marketing manager spends $50,000 on TV advertising, we can predict sales using the regression equation.

In [None]:
# Make a prediction
tv_spend = pd.DataFrame({'TV': [50]})
predicted_sales = simple_model.predict(tv_spend)
print(f"\nPrediction: $50k TV spend → {predicted_sales[0]:.2f} sales units")

However, simple models can miss important factors. Let's expand to multiple regression and train a model that includes all three advertising channels.

In [None]:
# Multiple regression: Sales predicted by all advertising channels
X_multiple = advertising[['TV', 'radio', 'newspaper']]

# Fit the model
multiple_model = LinearRegression()
multiple_model.fit(X_multiple, y)

If we assess the coefficients of the multiple regression model, we find that TV and radio advertising both have positive impacts on sales, while newspaper advertising has a negligible effect. This suggests that the marketing manager should prioritize TV and radio channels for advertising spend.

In [None]:
# Extract results
print(f"Multiple Regression Results:")
print(f"Intercept: {multiple_model.intercept_:.3f}")
for feature, coef in zip(X_multiple.columns, multiple_model.coef_):
    print(f"{feature} Coefficient: {coef:.3f}")

If we compare the R² values, we see that the multiple regression model explains significantly more variance in sales than the simple model, indicating a better fit.

In [None]:
# Compare R² values
simple_r2 = simple_model.score(X_simple, y)
multiple_r2 = multiple_model.score(X_multiple, y)
print(f"\nSimple Model R²: {simple_r2:.3f}")
print(f"Multiple Model R²: {multiple_r2:.3f}")
print(f"Improvement: {multiple_r2 - simple_r2:.3f}")

### 🧪 Practice Exercise — Model Evaluation with Train/Test Split

**Business Scenario:** The marketing team wants to know how well their advertising prediction model will perform on future data. They need honest performance estimates before making budget decisions.

**Your Task:** Implement proper train/test evaluation for the multiple regression model.

**Step-by-step approach:**
1. Split the data into training (70%) and test (30%) sets using `random_state=42`
2. Train the multiple regression model on training data only
3. Calculate R², RMSE, and MAE for both training and test sets
4. Interpret the results: Is the model overfitting, underfitting, or generalizing well?

In [None]:
# Your solution here


## Class Discussion/Q&A (5 minutes)

**Discussion prompts:**
- How do you interpret the difference between training and test performance?
- Which evaluation metric (R², RMSE, MAE) would be most useful for marketing budget planning?
- What business questions can regression analysis help answer?

**Common blockers and clarifications:**
- Remember that correlation ≠ causation - regression shows associations, not causal relationships
- Train/test splits simulate real-world deployment where models make predictions on unseen data

## Part 3 — Independent Group Challenges (35 minutes)

For the next three challenges, you'll work through complete regression workflows that directly prepare you for this week's homework quiz. The specific numerical results you obtain from these challenges will be used to answer questions on the Canvas quiz due Sunday.

* You will not be given starter code to work with; rather, you need to start from a blank cell.
* **DO NOT USE AI** to generate code for you. This is a group exercise, and you should be writing the code together.
* Work with your group to write the code.
* Feel free to ask questions or seek help from the instructor.
* We'll stop and walk through each challenge together after each time block.
* **Important**: Pay close attention to the specific parameters (random_state values, train/test splits) as these will generate the exact numerical answers needed for your homework quiz.

### Challenge 1 — Credit Card Balance Analysis (12 minutes)

**Business Question:** A regional bank wants to understand the factors that drive customers' credit card balances to improve their risk assessment models.

**Your Task:** Using the Credit dataset, complete the following specific steps:

1. **Correlation Analysis**: Calculate correlations between `Balance` and these variables: `Income`, `Limit`, `Age`. Based on the correlation values, which variable seems to be most strongly (and positively) correlated with the customer's credit card balance?

2. **Data Splitting**: Split data into train/test sets (70/30 split, random_state=123). How many observations are in your train vs. test sets?

3. **Model Building**: Build a simple regression model predicting `Balance` from `Income` using training data. Be sure to fit the model on the training data!

4. **Coefficient Interpretation**: Extract the Income coefficient (round to 2 decimal places). Based on this coefficient, how would you explain the relationship between a customer's income and their credit card balance to the bank CEO?

5. **Prediction**: Use your model to predict the credit card balance for someone with income = 115 (representing $115,000). What is the expected credit card balance for customers with this income level?

6. **Model Evaluation**: Calculate the RMSE on the training and test set. Based on these values, does it appear the model generalizes well or does under/overfitting appear to be an issue? Based on the test set RMSE, how would you interpret this metric in business context?

**Context:** These specific results will be used in your homework quiz, so record your numerical answers carefully.

In [None]:
# Your turn: write code here to analyze credit card balance relationships


### Challenge 2 — Baseball Salary Analysis (12 minutes)

**Business Question:** A baseball team's general manager wants to understand what drives player salaries to make better contract decisions.

**Your Task:** Complete these specific steps:

1. **Data Cleaning**: Remove rows with missing salary data using `hitters.dropna(subset=['Salary'])`. How many players were removed due to missing salary information?

2. **Data Splitting**: Split into train/test sets (70/30, random_state=456). How many players are in your training vs. test sets?

3. **Multiple Regression**: Build a model predicting `Salary` from `Years`, `Hits`, and `RBI`. Why do you think these three variables might be important for determining a player's salary?

4. **Coefficient Analysis**: Extract coefficients for all three predictors (round to 2 decimal places). Which factor appears to have the strongest impact on salary? Is this what you expected?

5. **Business Interpretation**: Based on your coefficients, how would you explain to the team's owner what drives player salaries? Which statistic should they focus on when evaluating potential signings?

6. **Model Performance**: Calculate train and test R² values. Based on the difference between these values, does the model show signs of overfitting? What does this mean for using the model to predict salaries for new players?

7. **Prediction**: Predict salary for a player with: Years=10, Hits=150, RBI=75. How much should the team expect to pay a player with these statistics?

**Context:** Record your specific numerical results for the homework quiz.

In [None]:
# Your turn: write code here to predict baseball salaries


### Challenge 3 — College Applications Analysis (11 minutes)

**Business Question:** An education consultant needs to advise colleges on factors that drive application numbers.

**Your Task:** Complete these specific steps:

1. **Data Preparation**: Create dummy variables for `Private` using `pd.get_dummies(drop_first=True)`. After encoding, what are the column names in your feature matrix? What does `Private_Yes = 1` represent?

2. **Data Splitting**: Split into train/test sets (75/25, random_state=789). How many colleges are in your training vs. test sets?

3. **Multiple Regression**: Build a model predicting `Apps` using `Top10perc`, `Outstate`, and `Private_Yes`. Why might these three factors influence the number of applications a college receives?

4. **Coefficient Analysis**: Extract all coefficients (round to 1 decimal place). Which factor has the largest positive impact on applications? Which factor might discourage applications?

5. **Categorical Interpretation**: Focus on the `Private_Yes` coefficient. How would you explain to a college president the difference in applications between private and public institutions? Do private colleges receive more or fewer applications than public colleges, holding other factors constant?

6. **Model Validation**: Calculate test set RMSE and R² values. Based on the R² value, how much of the variation in college applications does your model explain? Is this a useful level of predictive power?

7. **Business Prediction**: Predict applications for a private college with Top10perc=75 and Outstate=25000. If you were consulting for this college, what would you tell them about their expected application volume compared to similar institutions?

**Context:** These specific numerical results will appear on your homework quiz.

In [None]:
# Your turn: write code here to analyze college applications
# 💡 Tip: When working with mixed data types (numeric + categorical), it's best to handle them separately 
# and then combine them with `pd.concat()`. Here's starter code for Challenge 3, Task 1:
college_encoded = pd.concat([
    college[['Top10perc', 'Outstate']], 
    pd.get_dummies(college[['Private']], drop_first=True)
], axis=1)

## 🎓 Lab Wrap-Up & Reflection

### ✅ What You Accomplished
In this lab, you practiced:
- Calculating and interpreting correlation coefficients for business relationships
- Building simple and multiple linear regression models with scikit-learn
- Implementing proper train/test evaluation methodology
- Interpreting regression coefficients and evaluation metrics in business contexts

### 🤔 Reflection Questions
Take 2-3 minutes to consider:
- What was the most surprising relationship you discovered in the data?
- How would you explain the importance of train/test splits to a business manager?
- Which evaluation metric (R², RMSE, MAE) do you think is most useful for business decisions?

### 🔗 Connection to Course Goals
This lab bridges exploratory data analysis and predictive modeling - core skills for any business analyst. You've learned to move from "what happened?" (correlation) to "what will happen?" (regression) to "how confident should we be?" (evaluation).

### 📋 Next Steps
- **Homework:** Complete the Canvas quiz using your numerical results from today's challenges (due Sunday)
- **Next Tuesday:** Classification models - predicting categorical outcomes (spam vs not spam, click vs not click)
- **Additional Practice:** Explore additional ISLP datasets (https://islp.readthedocs.io/en/latest/data.html) and practice applying linear regression

---
**💾 Save your work** and be ready to share your findings and business recommendations. Your Canvas quiz will ask specific questions about the numerical results from today's three challenges.

## 🚨 Troubleshooting & Common Issues

**Issue 1:** "ValueError: Input contains NaN values"
- **Solution:** Use `.dropna()` or `.fillna()` to handle missing values before fitting models

**Issue 2:** "X has 1 feature(s) but LinearRegression is expecting X features"
- **Solution:** Make sure to use double brackets `[[column]]` for single features or ensure consistent feature selection between training and prediction

**Issue 3:** Dummy encoding not working properly
- **Solution:** Use `pd.get_dummies(data, drop_first=True)` to avoid multicollinearity

**Issue 4:** "RuntimeWarning: invalid value encountered in matmul" in Challenge 3
- **Problem:** Applying `pd.get_dummies()` to both numeric and categorical columns together
- **Solution:** Only apply dummy encoding to categorical variables. For Challenge 3, use:
  ```python
  # Correct approach: separate numeric and categorical processing
  college_encoded = pd.concat([
      college[['Top10perc', 'Outstate']], 
      pd.get_dummies(college[['Private']], drop_first=True)
  ], axis=1)
  ```

**General Debugging Tips:**
- Always check data shapes with `.shape` before fitting models
- Print intermediate results to verify your data transformations
- Use the same random_state value specified in challenges for reproducible results