## Interactions and dummy variables in R

Let's continue using our ANES data from labs 3 and 5. This time, we'll improve upon our models by incorporating interaction terms and nominal/factor data. 

As always, let's start by installing necessary packages and loading them.o use those commands. 

In [None]:
# Install the required packages if not already installed 

install.packages(c('tidyverse', 'haven', 'marginaleffects', 'devtools', 'anesr', 'tidymodels', 'modelsummary'))

# let's load your packages in the R session

library(tidyverse)
library(haven)
library(marginaleffects)
library(devtools)
library(anesr)
library(tidymodels)
library(modelsummary)


Now, load the data and run the recode script from github (this is the same as what we used in previous labs, so you can probably skip the download function).

In [None]:
data("timeseries_2020")
myurl <- paste0("https://raw.githubusercontent.com/bowendc/510_labs/main/", "lab3_recodes.R")
download.file(url = myurl, "lab3recodes.R")
source("lab3recodes.R")

anes20 <- anes20 |> mutate(sex_fct = factor(sex, labels = sex_lbl))

## Regression with nominal/factor data

We can use nominal data with `lm` with the `factor()` function if R doesn't already read the variable as a factor. Let's provide two examples. In the first model below, we've already told R that `sex_fct` has a class of factor:

In [None]:

m1 <- lm(welfare ~ age + income + sex_fct + pid7, data = anes20)
m2 <- lm(welfare ~ age + income + sex_fct + factor(pid7), data = anes20)

# The output = "jupyter" argument below is used for presentation purposes in
# the juptyer notebook. You would not use that output. Consider "data.frame"
# instead. 

# we also set modelsummary() to present p-values as stars in the table and the 
# standard errors below the coefficients (slopes)

modelsummary(list(m1,m2), estimate = "{estimate}{stars}", 
             statistic = "({std.error})", output = "jupyter")

Notice that in the models above, we get are told that the factor category is "women" in `sex_fct`, meaning that respondents who are women have, on average, are 0.032 points lower than men on the welfare spending scale. (This is a five-point ordinal scale from "Increase a lot" to "Decrease a lot"). Women are *more supportive* than men of increasing welfare spending, controlling for the other predictors, but that difference is not statistically distinguishable from 0.  

Model 2 replaces party identification that we treat as ordinal in Model 1 with a series of dummy variables using `factor()`. Each party category dummy shows the difference between the omitted category (Strong Democrat) and the category of the dummy variable. For example, look at the coefficient for `factor(pid7)7`. That number shows the expected change in $\hat{Y}$ as one moves from a Strong Democrat respondent to a Strong Republican respondent, holding the other factors constant. Is it statistically significant? 

In [None]:
m3 <- lm(welfare ~ age + income + sex_fct*pid7, data = anes20)
tidy(m3)
glance(m3)

In [None]:
mfx <- avg_slopes(m3, variables="sex_fct", by="pid7")
ggplot(data = mfx, aes(x = pid7)) +
    geom_line(aes(y = estimate)) + 
    geom_ribbon(aes(ymin = conf.low, ymax= conf.high), color = "grey", alpha = .3)
mfx