# MATH 3375 Examples Notebook #6
# Variable Interactions

Using the cars data set, we will look at possible _interactions_ between predictors and how they can be accounted for in a regression model.


In [None]:

car_data <- read.csv("cars2004.csv", stringsAsFactors=TRUE)
head(car_data)
car_data$Length = as.integer(as.character(car_data$Length))
car_data$Width = as.integer(as.character(car_data$Width))
tail(car_data)


## Preliminaries: Add Two Columns

To keep the first couple of models simple, we create two binary variables: 1) Sport, to indicate whether a vehicle is a sports car (using the 'Body' feature); and 2) RWD, to indicate whether a vehicle has rear wheel drive (using the 'WheelDrive' feature.)  We will use these variables to visualize how one predictor can be related to another.

In [None]:
car_data$Sport <- as.integer(car_data$Body == "Sport")
car_data$RWD <- as.integer(car_data$WheelDrive == "Rear")

head(car_data,3)
tail(car_data,3)

## Example 1: Horsepower predicted by Engine Size

Below we create a plot of Horsepower by Engine Size, color coding the points to reflect whether the vehicle is a sports car. The plot includes a simple linear regression line of HP by Engine Size, which does NOT take the sports car designation into account.

In [None]:
model_HP <- lm(HP ~ EngineSize, data=car_data)

In [None]:
shapes = c(16,17)
colors = c("red", "blue")
plot(HP ~ EngineSize, data=car_data, main="Horsepower by Engine Size",
    col=colors[factor(car_data$Sport)], pch=shapes[factor(car_data$Sport)])
abline(model_HP, lwd=2)

legend("topleft",
       legend = c("Sport: NO", "Sport: YES"),
       pch = shapes,
       col = colors)

### Observations

* The regression line doesn't look like a great fit for the sports cars in particular
* We might get a better prediction by creating a separate model for sports and non-sports cars. We do this below, by following these steps:
     1. Create 2 separate data sets, one with all the sports cars and one with all the others
     2. Create a model for each one.

In [None]:
car_sport <- car_data[car_data$Sport == 1,]
car_nonsport <- car_data[car_data$Sport == 0,]
head(car_sport,3)
head(car_nonsport,3)

In [None]:
model_HP_sport <- lm(HP ~ EngineSize, data=car_sport)
model_HP_nonsport <- lm(HP ~ EngineSize, data=car_nonsport)


### Visualize the Two Separate Models

In [None]:
shapes = c(16,17)
colors = c("red", "blue")
plot(HP ~ EngineSize, data=car_data, main="Horsepower by Engine Size",
    col=colors[factor(car_data$Sport)], pch=shapes[factor(car_data$Sport)])

abline(model_HP_sport,col="blue",lwd=2)
abline(model_HP_nonsport,col="red",lwd=2)

legend("topleft",
       legend = c("Sport: NO", "Sport: YES"),
       pch = shapes,
       col = colors)

### Comparing the Models

Visually comparing the models, we see that:

* The sport car model has a higher intercept
* The sport car model has a similar slope, but it is slightly steeper

We can verify our observations by looking at the model summary.

In [None]:
summary(model_HP_nonsport)

In [None]:
summary(model_HP_sport)

### Using 'Sport' as a second predictor

The next model adds 'Sport' as a predictor. Note that we can't easily visualize this model, but we can compare the intercept and EngineSize coefficients with the individual models above.

In [None]:
model_HP_both <- lm(HP ~ EngineSize + Sport, data=car_data)
summary(model_HP_both)

### Adding an Interaction Term

Because the slopes are different in the sport and non-sport models, we should look at a model with an interaction term. This model is created below.

In [None]:
model_HP_sport_int <- lm(HP ~ EngineSize + Sport + EngineSize*Sport, data=car_data)
summary(model_HP_sport_int)

### Comparing All Four Models

| Coefficient | HP_nonsport | HP_sport | HP_both | HP_both_int |
|:------|------------:|------:|------:|------:|
|Intercept| 52.604 | 82.820  |   48.773|  52.604 |
|EngineSize| 49.003  | 58.660  | 50.214  | 49.003  |
|Sport|  |   | 62.592  | 30.216  |
|EngineSize $\times$ Sport|  |   |   | 9.656  |

Observe that the model with the interaction will effectively produces the exact same predictions as the two separate models we created for sport and non-sport vehicles.  Specifically:

* The intercept and **EngineSize** coefficient for the non-sport model are identical to the ones in the model with an interaction term.
* The sport model intercept is $82.82$, which is equivalent to the sum of the intercept and **Sport** coefficient in the model with interaction term ($52.604 + 30.216$).
* The sport model EngineSize coefficient is $58.660$, which is equivalent to the sum of the **EngineSize** coefficient and the **EngineSize $\times$ Sport** coefficient in the interaction model ($49.003 + 9.656$).

Based on the above observations, our original plot of the 2 separate models **IS** an accurate visual representation of the model with the interaction term included.

To visualize the combined model WITHOUT the interaction term, we can plot lines using the values from the 'HP_both' model. Here, note that:

* Both lines have an identical slope, indicating the assumption of NO interaction.
* The intercept of the model is the intercept for NON-sport vehicles.
* The **Sport** coefficient is effectively an ADJUSTMENT to the intercept, raising the entire line up. (This makes the intercept for sport vehicles $50.214 + 62.592 = 112.806$.)

In [None]:
shapes = c(16,17)
colors = c("red", "blue")
plot(HP ~ EngineSize, data=car_data, main="Horsepower by Engine Size",
    col=colors[factor(car_data$Sport)], pch=shapes[factor(car_data$Sport)])

abline(a=50.214, b=48.773, col="red",lwd=2)
abline(a=(50.214+62.592), b=48.773, col="blue",lwd=2)

legend("topleft",
       legend = c("Sport: NO", "Sport: YES"),
       pch = shapes,
       col = colors)

### Which Model is Better?

Notice that the interaction term is only significant at $\alpha = 0.1$, so the effect is weak.  This is consistent with the separate models having _similar_ slopes, but still slightly different.  Adding the interaction term accounts for the slight difference in slope, but also adds complexity to the model.  This is considered a _**trade-off**_.

### How to Create Predictions with Each Model

The model coefficients for the combined models (with and without interaction terms) are given here as a reference:

| Coefficient | HP_both | HP_both_int |
|:------|------:|------:|
|Intercept|    48.773|  52.604 |
|EngineSize|  50.214  | 49.003  |
|Sport|62.592  | 30.216  |
|EngineSize $\times$ Sport|   | 9.656  |

#### Model without interaction term

A _**non-sport**_ car with engine size 3.5 would be predicted to have:

$\widehat{HP} = 48.773 + 50.214(3.5) + 62.592(0) = 224.522$

A **_sport car_** with engine size 3.5 would be predicted to have:

$\widehat{HP} = 48.773 + 50.214(3.5) + 62.592(1) = 287.114$

#### Model WITH interaction term

A _**non-sport**_ car with engine size 3.5 would be predicted to have:

$\widehat{HP} = 52.604 + 49.003(3.5) + 30.216(0) + 9.656(3.5)(0) = 224.1145$

A **_sport car_** with engine size 3.5 would be predicted to have:

$\widehat{HP} = 52.604 + 49.003(3.5) + 30.216(1) + 9.656(3.5)(1) = 288.1265$

## Example 2: Highway MPG predicted by Engine Size

This time, we create a plot of Highway MPG by Engine Size, color coding the points to reflect whether the vehicle has rear wheel drive.

In [None]:

shapes = c(16,17)
colors = c("green", "purple")
plot(Hwy.MPG ~ EngineSize, data=car_data, main="Highway MPG by Engine Size",
    col=colors[factor(car_data$RWD)], pch=shapes[factor(car_data$RWD)])

legend("topleft",
       legend = c("RWD: NO", "RWD: YES"),
       pch = shapes,
       col = colors)

### Create 2 Data Sets and Visualize Separate Models

In [None]:
car_RWD <- car_data[car_data$RWD == 1,]
car_nonRWD <- car_data[car_data$RWD == 0,]
head(car_RWD,3)
head(car_nonRWD,3)

In [None]:
model_mpg_RWD <- lm(Hwy.MPG ~ EngineSize, data=car_RWD)
model_mpg_nonRWD <- lm(Hwy.MPG ~ EngineSize, data=car_nonRWD)


In [None]:
shapes = c(16,17)
colors = c("forestgreen", "purple")
plot(Hwy.MPG ~ EngineSize, data=car_data, main="Highway MPG by Engine Size",
    col=colors[factor(car_data$RWD)], pch=shapes[factor(car_data$RWD)])

abline(model_mpg_nonRWD,col="forestgreen", lwd=2)
abline(model_mpg_RWD,col="purple", lwd=2)

legend("topleft",
       legend = c("RWD: NO", "RWD: YES"),
       pch = shapes,
       col = colors)

### Observations

These models show a much stronger interaction: The lines have very different slopes and intersect each other in the middle of the plot. 

As before, we will create models with and without interaction terms.

In [None]:
model_mpg_both <- lm(Hwy.MPG ~ EngineSize + RWD, data=car_data)
model_mpg_both_int <- lm(Hwy.MPG ~ EngineSize + RWD + EngineSize*RWD, data=car_data)

### View All Four Model Summaries

In [None]:
summary(model_mpg_nonRWD)
summary(model_mpg_RWD)
summary(model_mpg_both)
summary(model_mpg_both_int)

### Compare All Four Models

Again, notice how the coefficients in the interaction model relate to those in the separate models.

| Coefficient | mpg_nonRWD | mpg_RWD | mpg_both | mpg_both_int |
|:------|------------:|------:|------:|------:|
|Intercept| 40.3318 | 32.1783  |   38.6129|  40.3318 |
|EngineSize| -4.3375  | -1.9105  | -3.7673  | -4.3375  |
|RWD|  |   | 0.3945  | -8.1535  |
|EngineSize $\times$ RWD|  |   |   | 2.4270  |

Also notice that the coefficient for the interaction term is highly significant this time. This is consistent with our observation of the very different slopes in the separate models.

## More Extensive Models

Recall that 'Sport' was just one of several body types in the original data set. The model in Example 1 above was focused only on the binary (0/1) value for the 'sport' body type.  

Below we show a model where potential interactions are considered using **_all_** possible body types.  First we show a summary of all body types in the data set, for reference.

In [None]:
summary(car_data$Body)

model_HP_body <- lm(HP ~ EngineSize + Body + EngineSize*Body, data=car_data)
summary(model_HP_body)

#### Questions to Consider

* Which body type is the 'baseline' to which all others are compared?
* Which body type coefficients are significant at the $\alpha = 0.05$ level?
* Which body types have a significant interaction with engine size at the $\alpha = 0.05$ level?
* What is the predicted horsepower of an SUV with an engine size of 4.0?

### Suggestion 

Use one or more code cells below to practice the steps above by exploring other possible regression models.