# Lab 7: RDD

In this lab, we will be learning how to visualize and estimate a policy project using a regression discontinuity design (RDD). We will be using replication data from [de Benedictis-Kessner and Warshaw (2016)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSJX0X&version=1.0) which estimates the effect of Democratic mayors on local government finances, most notably, total expenditures per capita. The outcome variable is total municipal expenditures per capita two years after the election (*Total.Expenditure.D2*), and the running variable is the Democratic candidate's vote share relative to the Republican candidate, centered.

## Install (if needed) and load some specialized packages 

In [1]:
install.packages(c("tidyverse", "haven","dataverse", "rddensity", "rdrobust", "rdd", "stargazer"))

Installing packages into 'C:/Users/bowen/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)



package 'dataverse' successfully unpacked and MD5 sums checked
package 'rddensity' successfully unpacked and MD5 sums checked
package 'rdrobust' successfully unpacked and MD5 sums checked
package 'rdd' successfully unpacked and MD5 sums checked
package 'stargazer' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\bowen\AppData\Local\Temp\RtmpYz3tYg\downloaded_packages


In [None]:
library(tidyverse)
library(haven)
library(dataverse)
library(rddensity)
library(rdrobust)
library(rdd)
library(stargazer)


## Getting the data

We will use the **{dataverse}** package to access the replication data for the paper. One little note: the authors store their data as Rdata files, which is a file type storing the entire R workspace. Rdata files cannot be directly loaded as a data frame in R, because such files can contain functions, lists, objects, multiple data frames, etc. The function we have used in previous labs, **get_dataframe_by_doi()** we can download the data and then load the file into R.

In [5]:
# get file as raw binary data from replication dataverse
raw_data <- get_file_by_doi(filedoi = "doi:10.7910/DVN/WSJX0X/PQKBMK", # uses dataverse package
                            server = "dataverse.harvard.edu")

# save binary file
writeBin(raw_data, "mayors.RData")
# open workspace in R
load("mayors.Rdata")

# our data is called "data2". See?
ls()

## Basic visualization and estimation

Before we turn to some user-generated programs, let's first visualize and model de Benedictis-Kessner and Warshaw's using the tools we already have, **{ggplot}** and **lm**.

In [None]:
# provide the scatterplot framework for the rest of the graph and store it
# ylim() will let us limit the y axis range
# geom_vline() will place a vertical line on the plot

p1 <- ggplot(data = data2, aes(x = demshare, y = Total.Expenditure.D2)) +
        geom_point(color="gray20", alpha = .1) +
        theme_minimal() + ylim(-1000,1000) + 
  geom_vline(xintercept = 0, color = "black", linetype = "dashed") 

p1

# now let's add linear fitted lines before and after the cutpoint
p1 + geom_smooth(method = "lm", data = subset(data2, demshare<0), color = "navy") +
     geom_smooth(method = "lm", data = subset(data2, demshare>0), color = "navy")


Not too bad. Now let's plot with polynomials of the running variable, which will graph non-linearities before and after the cutoff. 

In [None]:
# quadratic polynomial
p1 + geom_smooth(method = "lm", formula = y ~ poly(x, 2), data = subset(data2, demshare<0), color = "navy") +
     geom_smooth(method = "lm", formula = y ~ poly(x, 2), data = subset(data2, demshare>0), color = "navy")

# cubic polynomial
p1 + geom_smooth(method = "lm", formula = y ~ poly(x, 3), data = subset(data2, demshare<0), color = "navy") +
     geom_smooth(method = "lm", formula = y ~ poly(x, 3), data = subset(data2, demshare>0), color = "navy")

Visualization is nice, but we really want to estimate the model to see the precise estimate of the local average treatment effect (LATE) at the cutoff. First, let's create a treatment variable to use in a regression model.

In [None]:
data2 <- data2 |> mutate(treatment = case_when(demshare>0 ~ 1, demshare<0 ~ 0, TRUE ~ NA_real_))

m1 <- lm(Total.Expenditure.D2 ~ demshare + treatment, data = data2)
summary(m1)

Hmmm. Not seeing much of an effect here. Only an estimate of an additional $17 per capita increase, which is nearly half the size of the standard error. How about we add some additional flexibility into the estimation? 

In [18]:
# first, allow linear trend to vary before and after the cutoff using an interaction
m2 <- lm(Total.Expenditure.D2 ~ demshare*treatment, data = data2)

# now, let's restrict the bandwidth
m3 <- lm(Total.Expenditure.D2 ~ demshare*treatment, data = data2 |> filter(demshare>-.1 & demshare <.1))
# and how about a quadratic polynomial
m4 <- lm(Total.Expenditure.D2 ~ poly(demshare,p=2)*treatment, data = data2)
# poly + restricted bandwidth
m5 <- lm(Total.Expenditure.D2 ~ poly(demshare,p=2)*treatment, data = data2 |> filter(demshare>-.1 & demshare <.1))


Let's use the **{stargazer}** package to make the regression table.

In [None]:
stargazer(m1, m2, m3, m4, m5, type = "text")



Notice how much the LATE varies by modelling decisions!

The table is nice enough, but we can clean it up even more. Remember, the only information of interest is the coefficient on the treatment variable. 

In [None]:
stargazer(m1, m2, m3, m4, m5, type = "text", omit = c("poly", "demshare", "demshare:treatment"))


## Using RDD packages to choose optimal bandwidths

The **{rdrobust}** will automate the process of identifying bandwidths, although you can certainly still show your audience multiple bandwidths if you want. The package will also let you make some other choices (like using kernel regression to create the local regression estimates). As this literature discusses, you can choose among various kernels. de Benedictis-Kessner and Warshaw use a uniform kernel (equally weighted observations inside the bandwidth on either side of the cutoff) or the triangular kernel (observations closer to the cutoff are weighted more). We again can model with linear or polynomial versions of the running variable.

In [None]:
# first for the linear versions
# c = 0 is the argument for the location of the cutoff. 0 is the default.
# uniform kernel first, and then followed by the triangular kernel next
rd1 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, c = 0, kernel = "uni", all = TRUE)
rd2 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, c = 0, kernel = "tri", all = TRUE)
summary(rd1)
summary(rd2)

In [None]:
# the p argument allows you to specify higher-order polynomials. 2 would be quadratic, 3 would be cubic.

rd3 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, c = 0, p=2, kernel = "uni", all = TRUE)
rd4 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, c = 0, p=2, kernel = "tri", all = TRUE)
summary(rd3)
summary(rd4)

The **{rdrobust}** package comes with nice default way to plot the discontinuity without all the work we did earlier. By default, the *rdplot* function will bin the data to make a cleaner graph. 

In [None]:
rdplot(data2$Total.Expenditure.D2, data2$demshare, c = 0)


## Checking the continuity of the running variable

McCrary (2008) recommends a statistical test to see if the running variable itself is discontinuous at the cutoff. Remember, it shouldn't be. Discontinuity could be a signal the units are manipulating the assignment threshold in some fashion. The **{rdd}** package includes functions to run and plot the McCrary density test. Uh oh - looks like de Benedictis-Kessner and Warshaw's data fail the test (significant discontinity of the running variable's density at the cutpoint).

In [None]:
DCdensity(data2$demshare, 0)