## Interactions and dummy variables in R

Let's continue using our `fake.csv` data we've used in previous labs. First, load the necessary packages. We'll be using a package that you may be unfamiliar with, `marginaleffects` to help us understand the results from our interaction model.

In [None]:
# Install the required packages if not already installed 
#install.packages('pacman')
pacman::p_load(tidyverse, modelsummary, marginaleffects)

Now, load the data. Let's also recode `z` as specified in the code below in order to make our original ordinal measure more like a nominal/factor variable.

In [None]:

fake <- read_csv("https://raw.githubusercontent.com/bowendc/510_labs/main/fake.csv") |>
            mutate(z = case_when(
                z <= 3 ~ 0,
                z == 4 | z == 6 ~ 1, 
                z == 5 | z == 7 | z == 8 ~ 2,
                z > 8 ~ 3   
            ))

# examine the data to make sure it looks as expected
table(fake$z)

### Regression with nominal/factor data

We can use nominal data with `lm` with the `factor()` function if R doesn't already read the variable as a factor. The alternative is to code the dummy variables ourselves. In the models below, let's first treat `z` as ordinal, and then in the second model, treat it like it is nominal. The first category of the variable will be treated as the reference group.

In [None]:

m1 <- lm(y ~ x + w + z, data = fake)
m2 <- lm(y ~ x + w + factor(z), data = fake)

# we set modelsummary() to present p-values as stars in the table and the 
# standard errors below the coefficients (slopes)

modelsummary(list(m1,m2), 
             estimate = "{estimate}{stars}", 
             statistic = "({std.error})")

Notice that in Model 2 above, we get are told that the factor category is "1" in `z`, meaning that respondents who are coded 1 are, on average, -0.638 points lower on `y` than observations coded 0. Is the difference statistically significant?

### Regression with interaction terms 

Now, let's let the effect of `x` vary by `w`. To do this, we need to fit a model with an interaction term. Mathematically, we are just including a new variable that equals the product of `x` and `w`. The coefficient on this term will show how the coefficients on the constituent terms ($\hat{\beta_{x}}$ and $\hat{\beta_{w}}$) *change* as the product of the two variables moves up or down.

First, we fit the model and examine the estimates.

In [None]:
m3 <- lm(y ~ x*w + factor(z), data = fake)
summary(m3)


Interpreting regression models is difficult! What can we say from this output? 

It helps to graph the *marginal effects* of a key predictor variable of interest. These marginal effects show the estimated relationship between the predictor and $\hat{Y}$ across the values of the other predictor variable interacted with it. Let's plot the marginal effect of `w` by `x` using the `marginaleffects` package.

In [None]:
# first, calculate the marginal effects using avg_slopes()
# we want the effect of w
# and x will be on the x axis

mfx <- avg_slopes(m3, variables="w", by="x")

# check the stored data frame
mfx

# now plot the data
# geom_ribbon plots the confidence interval as a shaded region
ggplot(data = mfx, aes(x = x)) +
    geom_line(aes(y = estimate)) + 
    geom_ribbon(aes(ymin = conf.low, ymax= conf.high), color = "grey", alpha = .3) + 
    theme_minimal()

Using the graph, we can see that the marginal effect of `w` is positive when `x` is low but negative when `x` is high. It is significantly different from 0 whenever the shaded confidence interval does not include 0. 