<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Ames-Housing-Data" data-toc-modified-id="Ames-Housing-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Ames Housing Data</a></span></li><li><span><a href="#Build-a-Baseline-Model" data-toc-modified-id="Build-a-Baseline-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Build a Baseline Model</a></span><ul class="toc-item"><li><span><a href="#Initial-Data-Preparation" data-toc-modified-id="Initial-Data-Preparation-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Initial Data Preparation</a></span></li><li><span><a href="#Build-a-Model-with-No-Interaction-Terms" data-toc-modified-id="Build-a-Model-with-No-Interaction-Terms-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Build a Model with No Interaction Terms</a></span></li><li><span><a href="#Evaluate-the-Model-without-Interaction-Terms" data-toc-modified-id="Evaluate-the-Model-without-Interaction-Terms-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Evaluate the Model without Interaction Terms</a></span></li></ul></li><li><span><a href="#Identify-Good-Candidates-for-Interaction-Terms" data-toc-modified-id="Identify-Good-Candidates-for-Interaction-Terms-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Identify Good Candidates for Interaction Terms</a></span><ul class="toc-item"><li><span><a href="#Numeric-x-Categorical-Term" data-toc-modified-id="Numeric-x-Categorical-Term-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Numeric x Categorical Term</a></span></li><li><span><a href="#Numeric-x-Numeric-Term" data-toc-modified-id="Numeric-x-Numeric-Term-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Numeric x Numeric Term</a></span></li></ul></li><li><span><a href="#Build-and-Interpret-a-Model-with-Interactions" data-toc-modified-id="Build-and-Interpret-a-Model-with-Interactions-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Build and Interpret a Model with Interactions</a></span><ul class="toc-item"><li><span><a href="#Build-a-Second-Model" data-toc-modified-id="Build-a-Second-Model-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Build a Second Model</a></span></li><li><span><a href="#Evaluate-the-Model-with-Interactions" data-toc-modified-id="Evaluate-the-Model-with-Interactions-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Evaluate the Model with Interactions</a></span></li><li><span><a href="#Interpret-the-Model-Results" data-toc-modified-id="Interpret-the-Model-Results-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Interpret the Model Results</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Interactions - Lab

## Introduction

In this lab, you'll explore interactions in the Ames Housing dataset.

## Objectives

You will be able to:

- Determine if an interaction term would be useful for a specific model or set of data
- Create interaction terms out of independent variables in linear regression
- Interpret coefficients of linear regression models that contain interaction terms

## Ames Housing Data

Once again we will be using the Ames Housing dataset, where each record represents a home sale:

In [1]:
# Run this cell without changes
import pandas as pd

ames = pd.read_csv('ames.csv', index_col=0)

# Remove some outliers to make the analysis more intuitive
ames = ames[ames["GrLivArea"] < 3000]
ames = ames[ames["LotArea"] < 20_000]
ames

FileNotFoundError: [Errno 2] No such file or directory: 'ames.csv'

In particular, we'll use these numeric and categorical features:

In [None]:
# Run this cell without changes
numeric = ['LotArea', '1stFlrSF', 'GrLivArea']
categorical = ['KitchenQual', 'Neighborhood']

## Build a Baseline Model

### Initial Data Preparation

Use all of the numeric and categorical features described above. (We will call this the "baseline" model because we are making a comparison with and without an interaction term. In a complete modeling process you would start with a simpler baseline.)

One-hot encode the categorical features (dropping the first), and center (subtract the mean) from the numeric features.

In [None]:
# Your code here - prepare data for modeling
# numeric = ['LotArea', '1stFlrSF', 'GrLivArea']
# categorical = ['KitchenQual', 'Neighborhood']

# define X and y 
y = ames["SalePrice"]
X = ames[['LotArea', '1stFlrSF', 'GrLivArea','KitchenQual', 'Neighborhood' ]]

# center numeric features
X[['LotArea', '1stFlrSF', 'GrLivArea']] = X[['LotArea', '1stFlrSF', 'GrLivArea']] - X[['LotArea', '1stFlrSF', 'GrLivArea']].mean()

# one hot encoding 
X = pd.get_dummies(X, columns=['KitchenQual', 'Neighborhood'], drop_first=True, dtype=int)

X

### Build a Model with No Interaction Terms

Using the numeric and categorical features that you have prepared, as well as `SalePrice` as the target, build a StatsModels OLS model.

In [None]:
# Your code here - import relevant libraries and build model
# baseline model creation 
import statsmodels.api as sm

baseline_model = sm.OLS(endog=y,exog=sm.add_constant(X))
baseline_results = baseline_model.fit()
print(baseline_results.summary())

### Evaluate the Model without Interaction Terms

Describe the adjusted R-Squared as well as which coefficients are statistically significant. For now you can skip interpreting all of the coefficients.

In [None]:
# Your code here - evaluate the baseline model
print(f"Model adjusted R-squared: {baseline_results.rsquared_adj}")
print()

#check whether the coefficients are statistically significant
pvalues_df = pd.DataFrame(baseline_results.pvalues, columns=["p-value"])
pvalues_df["p < 0.05"] = pvalues_df["p-value"] < 0.05
pvalues_df[pvalues_df["p < 0.05"]]


In [None]:
# Your written answer here
"""
1. **The model’s adjusted R-squared is 0.83**, which means it explains about 83% of the variation in house prices — a strong fit.

2. **All the coefficients listed have p-values less than 0.05**, meaning they are statistically significant.

3. **This includes both numeric variables** (LotArea, 1stFlrSF, GrLivArea) and **categorical dummy variables** (like KitchenQual and Neighborhood types).

4. A p-value less than 0.05 means there's strong evidence that each variable has a real effect on predicting house prices.

5. We can be confident that all the included variables contribute meaningfully to the model.

"""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

The model overall explains about 83% of the variance in sale price.

We'll used the standard alpha of 0.05 to evaluate statistical significance:
    
* Coefficients for the intercept as well as all continuous variables are statistically significant
* Coefficients for `KitchenQual` are statistically significant
* Coefficients for most values of `Neighborhood` are statistically significant, while some are not. In this context the reference category was `Blmngtn`, which means that neighborhoods with statistically significant coefficients differ significantly from `Blmngtn` whereas neighborhoods with coefficients that are not statistically significant do not differ significantly from `Blmngtn`

</details>

## Identify Good Candidates for Interaction Terms

### Numeric x Categorical Term

Square footage of a home is often worth different amounts depending on the neighborhood. So let's see if we can improve the model by building an interaction term between `GrLivArea` and one of the `Neighborhood` categories.

Because there are so many neighborhoods to consider, we'll narrow it down to 2 options: `Neighborhood_OldTown` or `Neighborhood_NoRidge`.

First, create a plot that has:

* `GrLivArea` on the x-axis
* `SalePrice` on the y-axis
* A scatter plot of homes in the `OldTown` and `NoRidge` neighborhoods, identified by color
  * Hint: you will want to call `.scatter` twice, once for each neighborhood
* A line showing the fit of `GrLivArea` vs. `SalePrice` for the reference neighborhood

In [None]:
# Your code here - import plotting library and create visualization
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Filter to houses in specific neighborhoods
oldtown = ames[ames["Neighborhood"] == "OldTown"]
noridge = ames[ames["Neighborhood"] == "NoRidge"]

fig, ax = plt.subplots(figsize=(10,5))

# Create scatter plots with 2 different colors
oldtown.plot.scatter(x="GrLivArea", y="SalePrice", alpha=0.7, label="OldTown", ax=ax)
noridge.plot.scatter(x="GrLivArea", y="SalePrice", alpha=0.7, color="orange", label="NoRidge", ax=ax)

# Plot best fit line
intercept = baseline_results.params["const"]
slope = baseline_results.params["GrLivArea"]
ax.plot(ames["GrLivArea"], intercept + ames["GrLivArea"] * slope, color="gray", label="fit line (Blmngtn)")

ax.legend();


Looking at this plot, do either of these neighborhoods seem to have a **slope** that differs notably from the best fit line? If so, this is an indicator that an interaction term might be useful.

Identify what, if any, interaction terms you would create based on this information.

In [None]:
# Your written answer here
"""
The plot suggests that NoRidge has a noticeably steeper slope compared to the reference line, indicating a stronger effect of GrLivArea on SalePrice. This implies that an interaction term between GrLivArea and Neighborhood_NoRidge could improve the model
"""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

Your plot should look something like this:

![scatter plot solution](https://curriculum-content.s3.amazonaws.com/data-science/images/OldTown_vs_NoRidge.png)

If we drew the expected slopes based on the scatter plots, they would look something like this:

![scatter plot solution annotated](https://curriculum-content.s3.amazonaws.com/data-science/images/OldTown_vs_NoRidge_Annotated.png)

The slope of the orange line looks fairly different from the slope of the gray line, indicating that an interaction term for `NoRidge` might be useful.

</details>

### Numeric x Numeric Term

Let's also investigate to see whether adding an interaction term between two of the numeric features would be helpful.

We'll specifically focus on interactions with `LotArea`. Does the value of an extra square foot of lot area change depending on the square footage of the home? Both `1stFlrSF` and `GrLivArea` are related to home square footage, so we'll use those in our comparisons.

Create two side-by-side plots:

1. One scatter plot of `LotArea` vs. `SalePrice` where the color of the points is based on `1stFlrSF`
2. One scatter plot of `LotArea` vs. `SalePrice` where the color of the points is based on `GrLivArea`

In [None]:
# Your code here - create two visualizations
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,5))
ames.plot.scatter(x="LotArea", y="SalePrice", c="1stFlrSF", cmap="Greens", ax=ax1)
ames.plot.scatter(x="LotArea", y="SalePrice", c="GrLivArea", cmap="Blues", ax=ax2)
fig.tight_layout();

Looking at these plots, does the slope between `LotArea` and `SalePrice` seem to differ based on the color of the point? If it does, that is an indicator that an interaction term might be helpful.

Describe your interpretation below:

In [None]:
# Your written answer here
'''
For both 1stFlrSF and GrLivArea, it seems like a larger lot area doesn't matter very much for homes with less square footage. (In other words, the slope is closer to a flat line when the dots are lighter colored.) Then for homes with more square footage, a larger lot area seems to matter more for the sale price. (In other words, the slope is steeper when the dots are darker colored.)

This difference in slope based on color indicates that an interaction term for either/both of 1stFlrSF and GrLivArea with LotArea might be helpful.

For ease of model interpretation, it probably makes the most sense to create an interaction term between LotArea and 1stFlrSF, since we already have an interaction that uses GrLivArea.
'''

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

Your plots should look something like this:

![side by side plots solution](https://curriculum-content.s3.amazonaws.com/data-science/images/LotArea_vs_SalePrice.png)

For both `1stFlrSF` and `GrLivArea`, it seems like a larger lot area doesn't matter very much for homes with less square footage. (In other words, the slope is closer to a flat line when the dots are lighter colored.) Then for homes with more square footage, a larger lot area seems to matter more for the sale price. (In other words, the slope is steeper when the dots are darker colored.)

This difference in slope based on color indicates that an interaction term for either/both of `1stFlrSF` and `GrLivArea` with `LotArea` might be helpful.

For ease of model interpretation, it probably makes the most sense to create an interaction term between `LotArea` and `1stFlrSF`, since we already have an interaction that uses `GrLivArea`.

</details>

## Build and Interpret a Model with Interactions

### Build a Second Model

Based on your analysis above, build a model based on the baseline model with one or more interaction terms added.

In [None]:
# Your code here - build a model with one or more interaction terms
X_interaction = X.copy()
X_interaction["GrLivArea x Neighborhood_NoRidge"] = X_interaction["GrLivArea"] * \
                        X_interaction["Neighborhood_NoRidge"]
X_interaction["LotArea x 1stFlrSF"] = X_interaction["LotArea"] * X_interaction["1stFlrSF"]

interaction_model = sm.OLS(y, sm.add_constant(X_interaction))
interaction_results = interaction_model.fit()
print(interaction_results.summary())

### Evaluate the Model with Interactions

Same as with the baseline model, describe the adjusted R-Squared and statistical significance of the coefficients.

In [None]:
# Your code here - evaluate the model with interactions
interaction_results.summary()

In [None]:
# Your written answer here
"""
The model overall still explains about 83% of the variance in sale price. The baseline explained 82.7% whereas this model explains 82.9%, so it's a marginal improvement.

Coefficients for the intercept as well as all continuous variables are still statistically significant
Coefficients for KitchenQual are still statistically significant
Neighborhood_NoRidge used to be statistically significant but now it is not
GrLivArea x Neighborhood_NoRidge is not statistically significant
LotArea x 1stFlrSF is statistically significant
"""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

The model overall still explains about 83% of the variance in sale price. The baseline explained 82.7% whereas this model explains 82.9%, so it's a marginal improvement.
    
* Coefficients for the intercept as well as all continuous variables are still statistically significant
* Coefficients for `KitchenQual` are still statistically significant
* `Neighborhood_NoRidge` used to be statistically significant but now it is not
* `GrLivArea x Neighborhood_NoRidge` is not statistically significant
* `LotArea x 1stFlrSF` is statistically significant

</details>

### Interpret the Model Results

Interpret the coefficients for the intercept as well as the interactions and all variables used in the interactions. Make sure you only interpret the coefficients that were statistically significant!

In [None]:
# Your written answer here
"""
The intercept is about 258k. This means that a home with average continuous attributes and reference categorical attributes (excellent kitchen quality, Bloomington Heights neighborhood) would cost about $258k.

The coefficient for LotArea is about 2.58. This means that for a home with average first floor square footage, each additional square foot of lot area is associated with an increase of about $2.58 in sale price.

The coefficient for 1stFlrSF is about 30.5. This means that for a home with average lot area, each additional square foot of first floor area is associated with an increase of about $30.50 in sale price.

The coefficient for LotArea x 1stFlrSF is about 0.003. This means that:

For each additional square foot of lot area, there is an increase of about $2.58 + (0.003 x first floor square footage) in sale price
For each additional square foot of first floor square footage, there is an increase of about $30.50 + (0.003 x lot area square footage) in sale price
Neighborhood_NoRidge and GrLivArea x Neighborhood_NoRidge were not statistically significant so we won't be interpreting their coefficients.
"""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>

The intercept is about 258k. This means that a home with average continuous attributes and reference categorical attributes (excellent kitchen quality, Bloomington Heights neighborhood) would cost about \\$258k.

The coefficient for `LotArea` is about 2.58. This means that for a home with average first floor square footage, each additional square foot of lot area is associated with an increase of about \\$2.58 in sale price.

The coefficient for `1stFlrSF` is about 30.5. This means that for a home with average lot area, each additional square foot of first floor area is associated with an increase of about \\$30.50 in sale price.

The coefficient for `LotArea x 1stFlrSF` is about 0.003. This means that:

1. For each additional square foot of lot area, there is an increase of about \\$2.58 + (0.003 x first floor square footage) in sale price
2. For each additional square foot of first floor square footage, there is an increase of about \\$30.50 + (0.003 x lot area square footage) in sale price

`Neighborhood_NoRidge` and `GrLivArea x Neighborhood_NoRidge` were not statistically significant so we won't be interpreting their coefficients.



</details>

## Summary

You should now understand how to include interaction effects in your model! As you can see, interactions that seem promising may or may not end up being statistically significant. This is why exploration and iteration are important!