## Generating Predictions and Analyzing Residuals

Regression is useful for a range of tools beyond evaluating coefficients and hypothesis testing. In this lab, we will examine how to use your regression model to generate predicted values of Y, examine and identify insights from our residuals, and then create out-of-sample predictions. 

In [None]:
# Install the required packages if not already installed 

#install.packages('pacman')

# let's load your packages in the R session
pacman::p_load(tidyverse, tidymodels, modelsummary, marginaleffects)


Now, load the data from GitHub. We will be using a dataset I've pulled together from official election results, the Bureau of Economic Analysis, and various public opinion sources (Gallup, the now-defunct 538, and others). The dependent variable is the change in seats for the president's party in House of Representatives during each midterm election.

In [None]:
df <- read_csv("https://raw.githubusercontent.com/bowendc/pol200_labs/main/midterm_loss.csv")

Let's create our regression model. We expect that midterm elections are a referendum on the president (Tufte 1975) and a function of the size of the presidential coattails in the previous election (Campbell 1964). Let's start with the simple model of seat change regressed on presidential approval.   

In [None]:
m1 <- lm(seat_change ~ approval, data = df)

# generate predicted values of y based on regression
# we need the newdata argument because of missing values in 
# the original data; we can store directly back into the df
# as new variable. 
df$yhat <- predict.lm(m1, 
                    newdata = df) 

# residuals are the diff between the observed and predicted
#   values of y.
df$resid <- df$seat_change - df$yhat

# alternatively, you can use the augument function from `tidymodels`
df2 <- augment(m1)

# predicted values stored as ".fitted" and residuals as ".resid"
df2

Now let's create a better model. Below, we add controls for the number of seats held by the president's party going into the election, and we allow that to interact with approval ratings. If the president's party has won big, they have more seats to lose in the following election. We also control for the one-year change in real disposable income per capita. 

In [None]:
m2 <- lm(seat_change ~ approval * prev_seats + rdipc_1yeardiff, data = df)

# create regression table 
modelsummary(list(m1, m2),
              estimate = "{estimate}{stars} ({std.error})", 
              statistic = NULL)

We no longer have significant results, but that is likely a function of our small number of observations. We're down to 16 elections! The $R^2$ score has jumped to .526, suggesting we're doing a better job accounting for the variation in seat change.  

## Interpreting Interaction Models

It is very difficult to interpret interaction models. I recommend graphing the *marginal effects* of the models; show the effect of one variable on your dependent variable from the regression estimates, but do so across the values of the other variable interacted with it. The easiest way to do this is through the *marginaleffects* package.

In [None]:
# this code will display the marginal effect of the variable approval from model m2
#   across the variable prev_seats. You could store this as dataframe and analyze
#   with ggplot() 
avg_slopes(m2, variables = "approval", by = "prev_seats")

# you could also just plot directly using plot_slopes()
plot_slopes(m2, variables = "approval", by = "prev_seats")


## Analyzing Residuals

Let's get predictions and residuals from our updated model. 

In [None]:
df$yhat2 <- predict.lm(m2, 
                    newdata = df) 

# residuals are the diff between the observed and predicted
#   values of y.
df$resid2 <- df$seat_change - df$yhat2

# notice that we're using the label argument that will be passed to geom_text
ggplot(df, aes(x = seat_change, y = resid2, label = midterm)) + 
        geom_point() + 
        geom_text(hjust = 0,   # 0 will put the label to the left of the point
                  vjust = 0)   # 0 will put the label to the top of the point

## Predicting The 2026 Midterm Elections

We can simply plug our values into our model and then calculate the predicted value. If we assume Trump's approval rating and economic performance doesn't change from their current values, then:

In [None]:
# you can access coefficient values using the coefficients() function. 
coefficients(m2)["(Intercept)"] + coefficients(m2)["approval"]*40 + coefficients(m2)["prev_seats"]*220 + coefficients(m2)["rdipc_1yeardiff"]*1.5 + coefficients(m2)["approval:prev_seats"]*(40*220) 

We can use `augment` to calculate standard errors and prediction intervals which take into account uncertainty:

In [None]:
df2 <- df |> filter(midterm==2026) |> 
                    mutate(approval = 40,
                            rdipc_1yeardiff = 1.5)

augment(m2, newdata = df2, interval = "prediction") # prediction intervals are larger than 
                                                    # standard confidence intervals because 
                                                    # you're predicting a specific observation's 
                                                    # values rather than those of a mean

So sitting here at the time of writing, our best estimate is that Republicans will lose 31 seats in the House during the 2026 midterm elections. There is plenty of uncertainty, though, about the size of those losses. But Trump and the GOP likely need the fundamentals to improve dramatically in order to not lose the House. 