<a href="https://colab.research.google.com/github/bradleyboehmke/uc-bana-4080/blob/main/example-notebooks/21_correlation_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Correlation and Linear Regression Foundations

This notebook contains code examples from the **Correlation and Linear Regression Foundations** chapter (Chapter 21) of the BANA 4080 textbook. Follow along to practice measuring relationships between variables and building predictive models using correlation and linear regression.

## 📚 Chapter Overview

In business, we rarely care about individual variables in isolation. Instead, we ask questions like: *Does increasing marketing spend actually increase sales?* This chapter introduces tools for analyzing relationships between variables through correlation and linear regression.

## 🎯 What You'll Practice

- Calculate and interpret correlation coefficients between variables
- Build and interpret simple linear regression models with one predictor
- Extend regression models to include multiple numeric predictors
- Handle categorical predictors using dummy encoding
- Make predictions and extract business insights from regression models

## 💡 How to Use This Notebook

1. **Read the chapter first** - This notebook supplements the textbook, not replaces it
2. **Run cells sequentially** - Code builds on previous examples
3. **Experiment freely** - Modify code to test your understanding
4. **Practice variations** - Try different approaches to reinforce learning

**Note:** Some code cells are collapsed to keep the notebook focused on key concepts. If any code seems overwhelming, don't worry about the implementation details - focus on understanding the results and business insights!

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# You may need to install the ISLP package if you haven't already:
# !pip install ISLP
# Don't worry about the details of this import if it seems complex!
try:
    from ISLP import load_data
    print("ISLP package loaded successfully!")
except ImportError:
    print("ISLP package not found. You may need to install it using: !pip install ISLP")

# Load the Advertising dataset directly from GitHub
advertising_url = "https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/Advertising.csv"
advertising = pd.read_csv(advertising_url)
print("\nAdvertising dataset loaded from GitHub:")
advertising.head()

## Understanding Correlation

Correlation measures how strongly two variables move together in a linear relationship. Let's start by exploring relationships in our data visually, then quantify them with correlation coefficients.

In [None]:
# Create example dataset: advertising spend vs. weekly sales
data = pd.DataFrame({
    "ad_spend": [400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300],
    "weekly_sales": [4200, 4400, 4100, 4800, 5600, 5200, 4900, 5500, 5300, 5900, 5700, 6300, 6900, 6200, 5800, 6600, 7100, 6800, 7300, 7800]
})

# First look at the relationship visually
plt.figure(figsize=(8, 6))
plt.scatter(data["ad_spend"], data["weekly_sales"])
plt.xlabel("Advertising Spend ($)")
plt.ylabel("Weekly Sales")
plt.title("Ad Spend vs. Weekly Sales")
plt.grid(True, alpha=0.3)
plt.show()

print("What pattern do you see in the relationship?")

In [None]:
# Now compute the correlation to quantify the relationship
correlation_matrix = data.corr()
print("Correlation Matrix:")
print(correlation_matrix)

# Extract just the correlation between ad_spend and weekly_sales
correlation_value = data["ad_spend"].corr(data["weekly_sales"])
print(f"\nCorrelation between ad_spend and weekly_sales: {correlation_value:.3f}")
print(f"This indicates a {'strong' if abs(correlation_value) > 0.7 else 'moderate' if abs(correlation_value) > 0.3 else 'weak'} {'positive' if correlation_value > 0 else 'negative'} linear relationship.")

### 🏃‍♂️ Try It Yourself

Using the Advertising dataset, create scatterplots between `sales` and each advertising channel (`TV`, `radio`, `newspaper`). Then compute correlations and identify which channel has the strongest relationship with sales.

In [None]:
# Your code here - create scatterplots and compute correlations
# Hint: Use plt.subplot() to create multiple plots, or create separate plots
# Use advertising.corr() to compute all correlations at once


## Simple Linear Regression

While correlation tells us variables move together, regression provides an equation to predict one variable from another. Let's build our first regression model.

In [None]:
# Prepare the data for regression
X = data[["ad_spend"]]  # Feature matrix (note the double brackets)
y = data["weekly_sales"]  # Target variable

# Fit the regression model
model = LinearRegression()
model.fit(X, y)

# Extract key model components
intercept = model.intercept_
coefficient = model.coef_[0]

print(f"Regression Results:")
print(f"Intercept: {intercept:.2f}")
print(f"Ad Spend Coefficient: {coefficient:.2f}")
print(f"\nEquation: Weekly Sales = {intercept:.2f} + {coefficient:.2f} × Ad Spend")

print(f"\nInterpretation:")
print(f"- When ad spend is $0, expected weekly sales = ${intercept:.0f}")
print(f"- For every $1 increase in ad spend, weekly sales increase by ${coefficient:.2f}")

In [None]:
# Visualize the regression line
plt.figure(figsize=(10, 6))
plt.scatter(data["ad_spend"], data["weekly_sales"], alpha=0.7, label="Data points")
plt.plot(data["ad_spend"], model.predict(X), color="red", linewidth=2, label="Regression line")

# Add a prediction example
prediction_x = 1500
prediction_y = model.predict([[prediction_x]])[0]
plt.scatter(prediction_x, prediction_y, color="orange", s=100, zorder=5, 
           label=f"Prediction: ${prediction_x} → ${prediction_y:.0f}")

plt.xlabel("Advertising Spend ($)")
plt.ylabel("Weekly Sales")
plt.title("Simple Linear Regression: Ad Spend vs Weekly Sales")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Make predictions with our model
# Scenario 1: What if we spend $1,500 on advertising?
new_ad_spend = pd.DataFrame({"ad_spend": [1500]})
predicted_sales = model.predict(new_ad_spend)
print(f"Prediction: If we spend $1,500 on advertising, we expect ${predicted_sales[0]:.0f} in weekly sales")

# Scenario 2: What about $2,500?
new_ad_spend2 = pd.DataFrame({"ad_spend": [2500]})
predicted_sales2 = model.predict(new_ad_spend2)
print(f"Prediction: If we spend $2,500 on advertising, we expect ${predicted_sales2[0]:.0f} in weekly sales")

print(f"\nNote: The second prediction extends beyond our data range (${data['ad_spend'].max()}), so be cautious about its reliability!")

### 🏃‍♂️ Try It Yourself

Using the Advertising dataset, fit a simple regression model to predict `sales` from `TV` advertising spend. Extract the intercept and coefficient, interpret them in business terms, and make a prediction for $50k in TV advertising.

In [None]:
# Your code here - fit a model using TV to predict sales
# Remember: X should be advertising[["TV"]] and y should be advertising["sales"]


## Multiple Linear Regression

Real business scenarios are rarely influenced by just one factor. Multiple regression allows us to include several predictors simultaneously and understand each one's independent effect.

In [None]:
# Create data with multiple predictors
data2 = pd.DataFrame({
    "ad_spend": [400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300],
    "num_stores": [3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9],
    "weekly_sales": [4200, 4400, 4100, 4800, 5600, 5200, 4900, 5500, 5300, 5900, 5700, 6300, 6900, 6200, 5800, 6600, 7100, 6800, 7300, 7800]
})

print("Data with multiple predictors:")
print(data2.head(10))

In [None]:
# Fit multiple regression model
X2 = data2[["ad_spend", "num_stores"]]  # Multiple features
y2 = data2["weekly_sales"]

model2 = LinearRegression()
model2.fit(X2, y2)

# Extract model components
intercept = model2.intercept_
ad_spend_coef = model2.coef_[0]
num_stores_coef = model2.coef_[1]

print(f"Multiple Regression Results:")
print(f"Intercept: {intercept:.2f}")
print(f"Ad Spend Coefficient: {ad_spend_coef:.2f}")
print(f"Num Stores Coefficient: {num_stores_coef:.2f}")

print(f"\nEquation: Weekly Sales = {intercept:.2f} + {ad_spend_coef:.2f} × Ad Spend + {num_stores_coef:.2f} × Num Stores")

print(f"\nInterpretation:")
print(f"- Holding num_stores constant, each $1 increase in ad_spend increases sales by ${ad_spend_coef:.2f}")
print(f"- Holding ad_spend constant, each additional store {'increases' if num_stores_coef > 0 else 'decreases'} sales by ${abs(num_stores_coef):.2f}")

In [None]:
# Make predictions with multiple features
print("Multiple Regression Predictions:")
print("\nScenario 1: $1500 ad spend, 5 stores")
scenario1 = pd.DataFrame({"ad_spend": [1500], "num_stores": [5]})
pred1 = model2.predict(scenario1)
print(f"Predicted sales: ${pred1[0]:.0f}")

print("\nScenario 2: $1500 ad spend, 7 stores") 
scenario2 = pd.DataFrame({"ad_spend": [1500], "num_stores": [7]})
pred2 = model2.predict(scenario2)
print(f"Predicted sales: ${pred2[0]:.0f}")

print(f"\nEffect of 2 additional stores: ${pred2[0] - pred1[0]:.0f} change in weekly sales")
print(f"This equals 2 × {num_stores_coef:.2f} = {2 * num_stores_coef:.0f}")

### 🏃‍♂️ Try It Yourself

Use the Advertising dataset to predict `sales` using all three advertising channels: `TV`, `radio`, and `newspaper`. Extract and interpret each coefficient, then write out the complete prediction equation.

In [None]:
# Your code here - fit a model using TV, radio, and newspaper to predict sales
# X should be advertising[["TV", "radio", "newspaper"]]


## Categorical Predictors with Dummy Encoding

Many business factors are categorical (like region, customer type, or product category). We need to convert these to numbers using dummy encoding before including them in regression models.

In [None]:
# Don't worry about the details of this data generation code!
# Focus on understanding the results and the dummy encoding process
np.random.seed(123)  # For reproducible results

# Create data with regional differences
# East region: lower baseline sales
east_ad_spend = np.linspace(500, 2000, 25)
east_base_sales = 3000 + 1.8 * east_ad_spend
east_noise = np.random.normal(0, 200, 25)
east_sales = east_base_sales + east_noise

# West region: higher baseline sales
west_ad_spend = np.linspace(500, 2000, 25)
west_base_sales = 3800 + 1.8 * west_ad_spend  # Same slope, higher baseline
west_noise = np.random.normal(0, 200, 25)
west_sales = west_base_sales + west_noise

# Combine into single DataFrame
data3 = pd.DataFrame({
    "ad_spend": np.concatenate([east_ad_spend, west_ad_spend]),
    "region": ["East"] * 25 + ["West"] * 25,
    "weekly_sales": np.concatenate([east_sales, west_sales])
})

print("Data with categorical predictor (region):")
print(data3.head(10))
print(f"\nRegion counts:")
print(data3["region"].value_counts())

In [None]:
# Create dummy variables using pandas
# drop_first=True means we drop one category to avoid the "dummy variable trap"
X3_encoded = pd.get_dummies(data3[["ad_spend", "region"]], drop_first=True)
print("Data after dummy encoding:")
print(X3_encoded.head(10))

print("\nNotice how:")
print("- 'region_West' = True (1) for West region stores")
print("- 'region_West' = False (0) for East region stores")
print("- East becomes the 'baseline' category")

In [None]:
# Fit regression model with categorical predictor
y3 = data3["weekly_sales"]

model3 = LinearRegression()
model3.fit(X3_encoded, y3)

# Extract model components
intercept = model3.intercept_
ad_spend_coef = model3.coef_[0]
region_west_coef = model3.coef_[1]

print(f"Regression with Categorical Predictor Results:")
print(f"Intercept: {intercept:.2f}")
print(f"Ad Spend Coefficient: {ad_spend_coef:.2f}")
print(f"Region West Coefficient: {region_west_coef:.2f}")

print(f"\nEquation: Weekly Sales = {intercept:.2f} + {ad_spend_coef:.2f} × Ad Spend + {region_west_coef:.2f} × Region_West")

print(f"\nInterpretation:")
print(f"- Baseline (East region): When ad_spend = 0, expected sales = ${intercept:.0f}")
print(f"- Each $1 increase in ad_spend increases sales by ${ad_spend_coef:.2f} (same for both regions)")
print(f"- West region stores have ${region_west_coef:.0f} higher baseline sales than East region stores")

In [None]:
# Visualize the categorical predictor model (creates parallel lines)
plt.figure(figsize=(10, 6))

# Separate data by region for plotting
east_data = data3[data3["region"] == "East"]
west_data = data3[data3["region"] == "West"]

# Plot data points
plt.scatter(east_data["ad_spend"], east_data["weekly_sales"], 
           color='blue', alpha=0.7, label='East Region', s=60)
plt.scatter(west_data["ad_spend"], west_data["weekly_sales"], 
           color='red', alpha=0.7, label='West Region', s=60)

# Create prediction lines for each region
ad_range = np.linspace(data3["ad_spend"].min(), data3["ad_spend"].max(), 100)

# East region line (region_West = 0)
east_predictions = model3.intercept_ + model3.coef_[0] * ad_range
plt.plot(ad_range, east_predictions, color='blue', linewidth=3, 
         label='East Region Line')

# West region line (region_West = 1)
west_predictions = model3.intercept_ + model3.coef_[0] * ad_range + model3.coef_[1]
plt.plot(ad_range, west_predictions, color='red', linewidth=3, 
         label='West Region Line')

plt.xlabel('Advertising Spend ($)')
plt.ylabel('Weekly Sales')
plt.title('Regression with Categorical Predictor: Parallel Lines by Region')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Notice how the lines are parallel - same slope, different intercepts!")

In [None]:
# Make predictions for both regions
print("Categorical Predictor Predictions:")

print("\nEast region store with $1500 ad spend:")
east_scenario = pd.DataFrame({
    "ad_spend": [1500], 
    "region_West": [0]  # East region
})
east_pred = model3.predict(east_scenario)
print(f"Predicted sales: ${east_pred[0]:.0f}")

print("\nWest region store with $1500 ad spend:")
west_scenario = pd.DataFrame({
    "ad_spend": [1500], 
    "region_West": [1]  # West region
})
west_pred = model3.predict(west_scenario)
print(f"Predicted sales: ${west_pred[0]:.0f}")

print(f"\nRegional difference: ${west_pred[0] - east_pred[0]:.0f}")
print(f"(This equals the Region_West coefficient: ${region_west_coef:.0f})")

### 🏃‍♂️ Try It Yourself

Using the Credit dataset from ISLP (if available), build a regression model with both numeric predictors and a categorical variable. If ISLP is not available, you can create your own sample dataset with a categorical variable.

In [None]:
# Try using the Credit dataset if ISLP is available
try:
    Credit = load_data('Credit')
    print("Credit dataset loaded successfully!")
    print(Credit.head())
    print("\nYour task: Build a regression model to predict 'Balance' using:")
    print("- Numeric predictors: Income, Limit, Age")
    print("- Categorical predictor: Gender (you'll need to dummy encode this)")
    
    # Your code here
    
except:
    print("ISLP not available. You can practice with the advertising dataset instead.")
    print("Try adding a categorical variable like 'season' to the advertising data!")

## Integration and Practice Challenges

Now let's combine everything we've learned in more complex scenarios that mirror real business decision-making.

In [None]:
# Comprehensive analysis of the Advertising dataset
print("Comprehensive Advertising Analysis")
print("=" * 40)

# Step 1: Explore correlations
print("\n1. Correlation Analysis:")
correlation_with_sales = advertising[['TV', 'radio', 'newspaper', 'sales']].corr()['sales'].sort_values(ascending=False)
print(correlation_with_sales)

# Step 2: Build comprehensive model
print("\n2. Multiple Regression Model:")
X_final = advertising[['TV', 'radio', 'newspaper']]
y_final = advertising['sales']

final_model = LinearRegression()
final_model.fit(X_final, y_final)

print(f"Intercept: {final_model.intercept_:.3f}")
for i, feature in enumerate(['TV', 'radio', 'newspaper']):
    print(f"{feature} coefficient: {final_model.coef_[i]:.3f}")

# Step 3: Business insights
print("\n3. Business Insights:")
coefficients = dict(zip(['TV', 'radio', 'newspaper'], final_model.coef_))
best_channel = max(coefficients, key=coefficients.get)
worst_channel = min(coefficients, key=coefficients.get)

print(f"- Most effective channel: {best_channel} (coefficient: {coefficients[best_channel]:.3f})")
print(f"- Least effective channel: {worst_channel} (coefficient: {coefficients[worst_channel]:.3f})")
print(f"- For every $1k spent on {best_channel}, sales increase by {coefficients[best_channel]:.3f} thousand units")

## 🚀 Practice Challenges

Test your understanding with these business scenarios that combine multiple concepts from the chapter.

### Challenge 1: Marketing Budget Optimization

You're a marketing manager with a $100k total budget to allocate across TV, radio, and newspaper advertising. Based on your regression model, which allocation would maximize expected sales?

In [None]:
# Your solution here
# Hint: Consider the coefficients from your model above
# Which channel gives the biggest "bang for the buck"?



### Challenge 2: Regional Strategy Development

Create a dataset with both advertising spend and a categorical variable (like "season" or "market_size"). Build a model and interpret how the categorical factor affects the relationship between advertising and sales.

In [None]:
# Your solution here
# Create your own dataset with both continuous and categorical predictors
# Remember to use dummy encoding for categorical variables



### Challenge 3: Model Comparison and Business Recommendations

Compare a simple regression model (using only the best predictor) with a multiple regression model (using all predictors). Which would you recommend for business decision-making and why?

In [None]:
# Your solution here
# Build both models and compare their predictions
# Think about simplicity vs. comprehensiveness trade-offs



## 📝 Chapter Summary

In this notebook, you practiced:

- ✅ **Measuring relationships** using correlation coefficients to quantify linear associations
- ✅ **Building simple regression models** to predict outcomes from single predictors
- ✅ **Extending to multiple regression** to understand how several factors jointly influence outcomes
- ✅ **Handling categorical predictors** using dummy encoding to include qualitative factors
- ✅ **Making business predictions** and translating statistical results into actionable insights

## 🔗 Connections to Other Chapters

- **Previous chapters**: Your data manipulation and visualization skills were essential for preparing data and understanding relationships
- **Upcoming chapters**: These regression foundations will be crucial for advanced machine learning techniques and model evaluation

## 📚 Additional Resources

- [Advertising Dataset on GitHub](https://raw.githubusercontent.com/bradleyboehmke/uc-bana-4080/main/data/Advertising.csv)
- [Guess the Correlation Game](https://www.guessthecorrelation.com/) - Practice interpreting correlation values
- [Scikit-learn Linear Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

## 🎯 Next Steps

1. **Review the chapter** to reinforce the theoretical foundations
2. **Complete the end-of-chapter exercises** using the ISLP datasets
3. **Practice with your own datasets** to build confidence with real-world data
4. **Move on to model evaluation techniques** to learn how to assess regression performance