In [None]:
library(tidyverse)
library(magrittr) # for special pipe operators like  %<>% 
library(olsrr) # for post-hoc plots and functions
library(interactions) # for interaction plots
library(jtools) # dependency for interactions plus has summ() function for different format of printing lm output
library(ggpubr) # for ggline plot
library(gridExtra) # for side-by-side plots

# Interactions
We briefly touched on interactions when we looked at two-way ANOVA.  Interaction effects account for a third variable that affects the impact of an IV on the DV.  This could be something like the effect of a treatment differing between genders.  We would include both the type of treatment, the gender variable, and the interaction of the two in our model.

## Numerical by Categorical
The easiest type of interaction to understand and interpret is an interaction between a numerical IV and a categorical IV.  Conceptually, we are looking at how the relationship between the two numerical variables differs depending on the level of the categorical variable.

We often talk about interactions as the X1 by X2 interaction (treatment by gender).  We can also talk about it as the interaction effect between X1 and X2.  The individual coefficients are the "main effects" and the coefficients that represent the interaction are the interaction effects.

For the first example we'll use the built-in R dataset `iris`.  This data includes information about iris flowers including lengths of various parts of the flowers (petals, etc.) and the species of iris.  

<img src="images/iris-species.png" width="600" height="400">

In [None]:
data(iris)
glimpse(iris)

In [None]:
## first fit a model without interaction
iris_nointeract <- lm(Petal.Length ~ Petal.Width + Species, data = iris)
summ(iris_nointeract)

In [None]:
## fit a model to predict Petal.Length with Petal.Width and Species, including an interaction between width and species
## note we use the * to indicate we want both the main effects and interaction between the two IVs
fitiris <- lm(Petal.Length ~ Petal.Width * Species, data = iris)
summ(fitiris)


In [None]:
options(repr.plot.width=6, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

## use interactions package to create interaction plot. 

## modx is used to indicate the variable used to determine how the lines are drawn
## in this case since our moderating variable is a factor, we get one line for each level of the factor.

interact_plot(fitiris, pred = Petal.Width, modx = Species)

Notice how the slopes of setosa and virginica appear to be very similar - with different y-intercepts.  Versicolor, however, has a different slope.  This means that the interaction is important to include in the model, because the relationship between petal.width and petal.length differs by species.  A tell-tale sign of an interaction effect is when the lines cross, however it is not a requirement.  Another option we can use to better visualize the relationship between the lines and the actual data we can add the observations as points on the plot.

In [None]:
options(repr.plot.width=6, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

## add plot.points = TRUE

interact_plot(fitiris, pred = Petal.Width, modx = Species, plot.points = TRUE)

Let's return back to the model summary and talk about interpretation.

In [None]:
summ(fitiris)

The first three coefficients (after the intercept) are the main effects of Petal.Width, Speciesversicolor, and Speciesvirginica.  Speciessetosa ends up being the "reference group" for the species categorical variable.  

We can no longer interpret our main effects in the same way.  Previously, we would say - "Holding all else constant, an increase of one unit of Petal.Width is associated with an average increase of 0.55 units of Petal.Length."  We **CANNOT** say that any longer with an interaction effect in the model.  Why not?

Because as Petal.Width increases, both the main effect and the interaction effect impact the prediction of Petal.Length.  Commonly, when we want to interpret these models, we look at the graph or use example data to demonstrate how the fitted values differ at different levels of the predictor variables.

We should also review the significance of the various predictors.  In the first model, without the interaction effect, all the main effects were significant.  Once we add the interaction into the model, Petal.Width and Speciesversicolor are no longer significant.  This means that it is actually the interaction effect that drives most of the variance in Petal.Length, and not the main effects.  The main effect of speciesvirginica, however, actually remains significant.  If we look at the plot, this would appear to be due to the fact that compared to the reference group, setosa, the observations of virginica are much larger, although the slope of the line between those two groups remains similar (the slope influenced by Petal.Length).

What happens when we add another variable, Sepal.Width?

In [None]:
fitiris_sepal <- lm(Petal.Length ~ Petal.Width * Species + Sepal.Length, data = iris)
summ(fitiris_sepal)

In [None]:
interact_plot(fitiris_sepal, pred = Petal.Width, modx = Species, plot.points = TRUE)

We can still interpret the coefficient of Sepal.Length in the similar way - 

"Holding all else constant, a one unit increase of Sepal.Length is associated with, on average, a 0.54 unit increase in Petal.Length."  

Why can we still do this?  Because none of the other coefficients depend on Sepal.Length, therefore we can hold all else equal.

## Continuous x Continuous Interaction
We can also add interactions between two continuous variables, although they are much harder to interpret - at least the coefficients themselves.  We can say that:

1. There is a significant interaction between Petal.Width and Sepal.Length.
2. The relationship between Petal.Width and Petal.Length depends on Sepal.Length (or vice versa).

In [None]:
fitiris_numint <- lm(Petal.Length ~ Petal.Width * Sepal.Length, data = iris)
summ(fitiris_numint)

In [None]:
interact_plot(fitiris_numint, pred = Petal.Width, modx = Sepal.Length, plot.points = TRUE)

What this plot does is it takes three different values of Sepal.Length (the modx variable we specify) and plots the line of Petal.Width and Petal.Length.  These three values are the mean of Sepal.Length, and +/- one SD.  We can reverse it if we wish and look at the slopes of Sepal.Length at different levels of Petal.Width.

In [None]:
## we can use jitter and point.shape to differentiate the different observations

interact_plot(fitiris_numint, modx = Petal.Width, pred = Sepal.Length, plot.points = TRUE)

## Categorical x Categorical Interactions
These are the types of interactions we looked at in Two-Way ANOVAs, but instead of using aov() we can fit the model with lm() instead.  However, it's essentially the same model, different format.

For that example in the ANOVA notebook we used the warpbreaks data.

In [None]:
data(warpbreaks)
summary(warpbreaks)

In [None]:
## remind ourselves of the interaction on the mean plot

warpbreaks %>% ggline(x = "tension", y = "breaks", color = "wool",
       add = c("mean_se", "jitter"),
       palette = c("#00BF7D", "#FF61C9"), size = 1.5)

In [None]:
## the ANOVA we fit previously
summary(aov(breaks ~ wool * tension, data=warpbreaks)) 

In [None]:
## now the lm model
wool_int <- lm(breaks ~ wool * tension, data=warpbreaks)
summ(wool_int)

In [None]:
cat_plot(wool_int, modx = wool, pred = tension, plot.points = TRUE)

`interactions` will create a plot, but the ggline plot displays the interaction in a more readable way.

### Final note:
We don't have to include both of the main effects with the interaction if we don't want to have all of those parameters in the model.  Here we fit a model with wool and only the interaction of wool:tension without including the main effect of tension.

In [None]:
wool_part <- lm(breaks ~ wool + wool:tension, data=warpbreaks)
summ(wool_part)

# Quadratic Terms
Sometimes the relationship between a variable and the outcome differs by the value of that same IV.  The most commonly seen example of that is age - often the very oldest and the very youngest people will have different relationship with the outcome vs. those in the middle.  It's essentially an interaction between that variable and itself.  Often on the scatterplot we will see a parabola that indicates the quadratic relationship between the IV and DV.

When we were looking at the Big Brother data, we saw a possible curvilinear relationship in the errors of the model.  Let's see if adding a quadratic term will improve the model.

In [None]:
bb <- readRDS("bbdata.rds") ## load the data

In [None]:
options(repr.plot.width=4, repr.plot.height=3) ## plot size options for Jupyter notebook ONLY

bb %>% ggplot(aes(x=total_hoh, y=tenure)) + ## indicate df, x and y variables.
  geom_point()+ ## scatterplot
  geom_smooth(method=lm, se=TRUE) ## method is lm, show CI

In [None]:
## draw the lm line including the quadratic term

bb %>% ggplot(aes(x=total_hoh, y=tenure)) + ## indicate df, x and y variables.
  geom_point()+ ## scatterplot
        stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1)

In [None]:
## Fit the original model

mod1 <- lm(tenure ~ total_hoh, data=bb) ## here I've saved the resulting model to "mod1"
summ(mod1)

In [None]:
## Fit the model with the quadratic term 

## We need to surround the quadratic term with I() or add the precalculated value to the dataset
mod2 <- lm(tenure ~ total_hoh + I(total_hoh^2), data=bb)
summ(mod2)

What we see is promising, the quadratic term is significant, and our adjusted r-squared has increased.  But is this model significantly better than the first model?  We can test that.

In [None]:
anova(mod1, mod2)

The model with the quadratic term is a significant improvement over the model without, justifying the addition of that term into the model.  

Remember the shape of the error/residuals we saw with the first model?  The benefit of adding this term is that we improve the model in a way that improves our adherence to the assumptions.  Let's compare plots!

In [None]:
## compare QQ plots
options(repr.plot.width=3, repr.plot.height=3) ## plot size options for Jupyter notebook ONLY

ols_plot_resid_qq(mod1)
ols_plot_resid_qq(mod2)

Adding the quadratic term improved the normality of our residuals - they are now very close to a normal distribution - the deviation in the tails has been reduced.

Let's look also at the shape of our errors to see if we still see a curvilinear relationship.

In [None]:
## compare resid vs fit plots
options(repr.plot.width=5, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

plot(mod1, which = 1) 
plot(mod2, which = 1) 

We still have heteroscedasticity, but there is no longer a curvilinear relationship between the residuals and fitted values.  We have accounted for that remaining linear relationship that caused our errors to not be independent.

# Transformations

Another way we can adjust our variables to improve the fit of our models and reduce the violation of model assumptions.

Both the predictor (IV) and outcome (DV) variables can be transformed.  The unfortunate result of transforming the variables is the complexity of the interpretation of the model.

## Transformations to the y variable
The most common transformation to the y variable is a log transformation.  This is typically done to mitigate the skewness in the distribution of the y variable.  This skewness is commonly seen in income variables, as we've seen previously.

For these transformations we'll look at the boston housing dataset you used for your homework.

In [None]:
boston <- readRDS("boston.rds") ## load the data

In [None]:
summary(boston$medv)

In [None]:
## quick histogram of medv
hist(boston$medv)

In [None]:
hist(log(boston$medv))

Notice how log transforming medv makes the distribution look more normal and removes the skewness.  We will create the log transformation of medv as a new variable, then use it in fitting our model.  We will compare the basic model where medv is predicted by lstat, to a model where lstat predicts log_medv.

In [None]:
boston$log_medv <- log(boston$medv)

In [None]:
normmod <- lm(medv ~ lstat, data = boston)
logmod <- lm(log_medv ~ lstat, data = boston)
summ(normmod)
summ(logmod)

The first thing I notice is that lstat is still a significant predictor of medv, and it's still a negative relationship.  

Comparing the r-squared values, lstat predicts about 54% of the variance in medv, but it predicts 65% of the variance in log_medv.  We cannot use an F-test to compare the fit of the models because they have two different y variables.

But it becomes tricky when we get to the interpretation of the coefficient.  We cannot interpret the coefficient in the second model in direct regard to medv. 

We would have to say:  

"A one unit increase in lstat yields a 0.05 decrease in log(medv)."  Given that log median value its hard to understand, we need to do something else.

**Only the dependent/response variable is log-transformed:**
Exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.
(https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/)

In [None]:
## interpret the coefficient of lstat

(exp(-0.05) - 1) * 100

So for every unit increase in lstat (proportion of lower status residents in the census tract) median home value in 1000s decreases by about 5%.

Let's see if this transformation improves our residuals/error.

In [None]:
## compare QQ plots
options(repr.plot.width=3, repr.plot.height=3) ## plot size options for Jupyter notebook ONLY

ols_plot_resid_qq(normmod)
ols_plot_resid_qq(logmod)

The log transformation improves the normality of our residuals, let's also check the plot of residuals vs. fitted.

In [None]:
## compare resid vs fit plots
options(repr.plot.width=5, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

plot(normmod, which = 1) 
plot(logmod, which = 1) 

The "hook" relationship evident in the plot from the "normal" model is no longer as evident in the second plot, therefore indicating that the errors are more normally distributed.  

## Transformations to the x variable
Sometimes a log transformation is beneficial for an x variable where we see "heaping" at the lower values (skewness).  The log transformation smooths out the distribution and makes it more even.  This time we will log transform the crim variable, and use that to predict medv.

In [None]:
options(repr.plot.width=4, repr.plot.height=3) ## plot size options for Jupyter notebook ONLY

pairs(boston[c(1,14)])

In [None]:
boston$log_crim <- log(boston$crim)
pairs(boston[c(14,17)])

You can see the improvement in the scatterplot there is more of a linear relationship between the variables vs. the "heaping" seen previously.  Let's fit two models, one with crim and one with log_crim as the predictor.  We will use the untransformed version of medv as the outcome variable

In [None]:
normmod_crim <- lm(medv ~ crim, data = boston)
logmod_crim <- lm(medv ~ log_crim, data = boston)
summ(normmod_crim)
summ(logmod_crim)

Crime rate remains a significant predictor of medv, but again the r-squared value of the model is higher when using log transformed predictor.  Again we cannot compare the models using F-test, because they are not nested.

**Only independent/predictor variable(s) is log-transformed.** Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02.

So, in our case, for every 1% increase in crime rate, there is a -0.0193 unit (1000s of dollars) increase in medv.  Therefore medv decreases by about $19.

Again, let's look at our errors


In [None]:
## compare QQ plots
options(repr.plot.width=3, repr.plot.height=3) ## plot size options for Jupyter notebook ONLY
ols_plot_resid_qq(normmod_crim)
ols_plot_resid_qq(logmod_crim)

The log transformation of the predictor doesn't seem to have improved the normality of the residuals, However...

In [None]:
## compare resid vs fit plots
options(repr.plot.width=5, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

plot(normmod_crim, which = 1) 
plot(logmod_crim, which = 1) 

There is no longer a linear relationship evident in the plot of residuals vs. fitted, therefore our errors are now independent.  In addition, it appears that the variance is fairly constant in this model.  

## Transforming both x and y
We can log transform both x and y.  Let's try that using medv and crim.

In [None]:
pairs(boston[c(15,17)])

In [None]:
normmod_both <- lm(medv ~ crim, data = boston)
logmod_both <- lm(log_medv ~ log_crim, data = boston)
summ(normmod_both)
summ(logmod_both)

The r-squared doubles between the first and second model.  This model is a considerable improvement.

**Both dependent/response variable and independent/predictor variable(s) are log-transformed.** Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract from 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) * 100 = 3.7 percent.

So - for every 1% increase in crime rate, medv decreases by about 0.11%

Again, we'll check our post-hoc plots.

In [None]:
## compare QQ plots
options(repr.plot.width=3, repr.plot.height=3) ## plot size options for Jupyter notebook ONLY
ols_plot_resid_qq(normmod_both)
ols_plot_resid_qq(logmod_both)

## compare resid vs fit plots
options(repr.plot.width=5, repr.plot.height=4) ## plot size options for Jupyter notebook ONLY

plot(normmod_both, which = 1) 
plot(logmod_both, which = 1) 

These look fairly similar to the previous set of models where only crim was log transformed, however, there may be a small improvement to the normality of the residuals.  

These examples highlight the importance of working with a model to improve the fit.  When you have a theory about the relationship between variables and the predictors you believe to influence your outcome, if your model doesn't fit that well and has violation of assumptions, instead of giving up on the model or changing variables (p-hacking) we instead work to improve the model fit to refine the model.  Log-transformation is not the only transformation available.

For IVs/predictors:

<img src="images/IVtrans.jpg" width="600" height="400">

See the following for more information about possible transformations:
- https://newonlinecourses.science.psu.edu/stat462/node/155/
- Brief tutorial on Box-Cox Transformations: https://rpubs.com/bskc/288328
- Tukey's Ladder of Powers: http://onlinestatbook.com/2/transformations/tukey.html