# Synthetic Control and Regression Discontinuities

While SC and RD designs are useful in quite different data situations, both methods utilize careful comparisons to create counterfactuals and heavilyt utilize graphs to illustrate those comparisons. Both methods have a suite of specialized packages which make using the methods easier in R. 

## Synthetic Control

Again, the synthetic control method is useful for case studies or other situtions in which you have few treated observations and are particularly interested in the effect of policy on a particular treated unit. SC uses matching and weighting to make a "synthetic" unit as similar as possible to our pre-treatment unit. Then we can see if the synthetic and actual units diverge post treament as evidence of an effect.

In [None]:
# install packages as needed
# the key Synthetic Control packages are Synth and SCtools
# we need to install (and load) devtools in order to install
# SCtools from Github

install.packages(c('tidyverse', 'Synth', 'devtools', 'cspp'))

In [None]:

library(tidyverse)
library(Synth)
library(devtools)
install_github("bcastanho/SCtools")
library(SCtools)
library(cspp)

Let's examine the potential effect on state finances by limiting teacher union bargaining power through fee collection. Oklahoma's agency fee provision law went into effect in 2001. What impact might such limitation have on state finances? Perhaps it reduced state debt by limited the power of teacher unions to effectively bargain for benefits like pensions and salary increases. We can use *synthetic control* to test for the effect of the policy. Synthetic control creates, well, a synthetic control Oklahoma as a weighted average of other states. We can then compare synthetic OK to actual OK.  

In [None]:

# get the data from the Correlates of State Policy Project
data <- get_cspp_data(years = seq(1990,2015))

df <- as.data.frame(data) # Synth requires data.frame format, not tibble! 
                          # Be really careful with this one. If you try to load 
                          # the tibble format, Synth will through a weird error 
                          # message that you won't recognize. 

In [35]:

# most of the work is done in this dataprep() call. Basically, we need to tell 
# R what predictors of state debt to match on, what the outcome or dependent variable
# is, the time frame, and the potential units to be used for the control population.

dataprep.out <- 
    dataprep(
      foo = df,
      predictors = c("disposable_personal_income1000s_annual", "agovempr", "taxes_gsp"),
      predictors.op = "mean",                 # could use diff stat
      dependent = "total_debt_outstanding_gsp", # outcome variable here
      unit.variable = "state_fips", # must be numeric
      time.variable = "year",   # your time variable goes here 
      special.predictors = list( # this are lets you match on variables in specific time periods. 
        list("total_debt_outstanding_gsp", 1990, "mean"),
        list("total_debt_outstanding_gsp", 2000, "mean"),
        list("hou_chamber", 1998, "mean"), # estimate of legislative ideology
        list("hou_chamber", 2000, "mean")), # in two different elections
      treatment.identifier = "OK",  # treated unit goes here. Use 
      unit.names.variable = "st", # we need this if we want to specify control units below with abbreviations, otherwise we could use the state_fips codes from above
      controls.identifier = c("CA","CO","NM", "PA", "OH", "NY", "MO", "MA", "MT", "KY", "WA", "WV"), # we list our control units here
      time.predictors.prior = c(1990:2000), # pre-implementation period
      time.optimize.ssr = c(1990:2000),
      time.plot = c(1990:2010)              # entire time period
    )


Now that we have prepared our data and described how we would like our synthetic unit to be created, we can create the data and evaluate. 

In [None]:
# use our prepared data to create synthetic control unit
# notice our data is now dataprep.out 
synth_out <- synth(data.prep.obj = dataprep.out)

# plot synthetic and actual units 
path.plot(synth_out, dataprep.out)

# plot difference between synthetic and actual
gaps.plot(synth_out, dataprep.out)

We can view the weights created by **Synth** using `synth.tab`. This is crucial - it shows us how well our synthetic unit matches the actual OK and describes the relative weighting of each predictor variable and control units. How did we do?

In [None]:
tables <- synth.tab(synth_out, dataprep.out)
print(tables)

Placebo analysis plays a big role here. We see that there is a difference between OK and synthetic OK, but it is enough to matter? Well, we can re-run this analysis, replacing OK with our control unit states. If we see similar movement from those states, than we probably don't have a causal effect for OK, because the other states didn't adopt the policy!

In [None]:

# the generate.placebos function will do it for us.
placebos <- generate.placebos(dataprep.out, synth_out)

plot_placebos(placebos)

# mspe.plot shows post to pre-treatment prediction error for 
# actual analysis and placebos. Again, we're looking for evidence
# that OK is unique with a higher ratio (divergence) after treatment.
# do we see that here?
mspe.plot(placebos, discard.extreme = TRUE, mspe.limit = 2, plot.hist = TRUE)

## Regression Discontinuity (RD)



We will be learning how to visualize and estimate a policy project using a regression discontinuity design (RDD). We will be using replication data from [de Benedictis-Kessner and Warshaw (2016)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WSJX0X&version=1.0) which estimates the effect of Democratic mayors on local government finances, most notably, total expenditures per capita. The outcome variable is total municipal expenditures per capita two years after the election (*Total.Expenditure.D2*), and the running variable is the Democratic candidate's vote share relative to the Republican candidate, centered.

In [None]:
# Install RDD packages if not already installed
install.packages(c("rddensity", "rdrobust", "rdd"))


In [None]:
# load packages
library(rddensity)
library(rdrobust)
library(rdd)
library(dataverse)
library(modelsummary)

### Getting the data

We will use the **{dataverse}** package to access the replication data for the paper. One little note: the authors store their data as Rdata files, which is a file type storing the entire R workspace. Rdata files cannot be directly loaded as a data frame in R, because such files can contain functions, lists, objects, multiple data frames, etc. The function we have used in previous labs, **get_dataframe_by_doi()** won't work here. Instead, we can download the data and then load the file into R.

In [None]:
# get file as raw binary data from replication dataverse
as_binary <- get_file_by_doi(
          filedoi = "doi:10.7910/DVN/WSJX0X/PQKBMK",
          server = "dataverse.harvard.edu"
)

# save the file in our working directory
writeBin(as_binary, "mayors.RData") 

# open the data
load("mayors.RData")

# our data is stored in a dataframe called "data2". See?
ls()

### Basic visualization and estimation

Before we turn to some user-generated programs, let's first visualize and model de Benedictis-Kessner and Warshaw's using the tools we already have, **{ggplot}** and **lm**.

In [None]:
# provide the scatterplot framework for the rest of the graph and store it
# ylim() will let us limit the y axis range
# geom_vline() will place a vertical line on the plot

p1 <- ggplot(data = data2, aes(x = demshare, y = Total.Expenditure.D2)) +
        geom_point(color="gray20", alpha = .1) +
        theme_minimal() + ylim(-1000,1000) + 
  geom_vline(xintercept = 0, color = "black", linetype = "dashed") 

p1

# now let's add linear fitted lines before and after the cutpoint
p1 + geom_smooth(method = "lm", data = subset(data2, demshare<0), color = "navy") +
     geom_smooth(method = "lm", data = subset(data2, demshare>0), color = "navy")


Not too bad. Now let's plot with polynomials of the running variable, which will graph non-linearities before and after the cutoff. 

In [None]:
# quadratic polynomial
p1 + geom_smooth(method = "lm", formula = y ~ poly(x, 2), data = subset(data2, demshare<0), color = "navy") +
     geom_smooth(method = "lm", formula = y ~ poly(x, 2), data = subset(data2, demshare>0), color = "navy")

# cubic polynomial
p1 + geom_smooth(method = "lm", formula = y ~ poly(x, 3), data = subset(data2, demshare<0), color = "navy") +
     geom_smooth(method = "lm", formula = y ~ poly(x, 3), data = subset(data2, demshare>0), color = "navy")

Visualization is nice, but we really want to estimate the model to see the precise estimate of the local average treatment effect (LATE) at the cutoff. First, let's create a treatment variable to use in a regression model.

In [None]:
data2 <- data2 |> mutate(treatment = case_when(demshare>0 ~ 1, demshare<0 ~ 0, TRUE ~ NA_real_))

m1 <- lm(Total.Expenditure.D2 ~ demshare + treatment, data = data2)
summary(m1)

Hmmm. Not seeing much of an effect here. Only an estimate of an additional $17 per capita increase, which is nearly half the size of the standard error. How about we add some additional flexibility into the estimation? 

In [12]:
# first, allow linear trend to vary before and after the cutoff using an interaction
m2 <- lm(Total.Expenditure.D2 ~ demshare*treatment, data = data2)

# now, let's restrict the bandwidth
m3 <- lm(Total.Expenditure.D2 ~ demshare*treatment, data = data2 |> filter(demshare>-.1 & demshare <.1))

# and how about a quadratic polynomial
m4 <- lm(Total.Expenditure.D2 ~ poly(demshare,p=2)*treatment, data = data2)

# poly + restricted bandwidth
m5 <- lm(Total.Expenditure.D2 ~ poly(demshare,p=2)*treatment, data = data2 |> filter(demshare>-.1 & demshare <.1))


Let's use the **{modelsummary}** package to make the regression table.

In [None]:
modelsummary(list(m1, m2, m3, m4, m5))



Notice how much the LATE varies by modelling decisions!

The table is nice enough, but we can clean it up even more. Remember, the only information of interest is the coefficient on the treatment variable. 

In [None]:

modelsummary(list(m1, m2, m3, m4, m5),
            coef_map = c('treatment' = 'Treatment'),
            stars = TRUE,                   
            estimate = "{estimate}{stars}", 
            statistic = "({std.error})",
            gof_map = c("nobs", "r.squared", "rmse"))

### Using RDD packages to choose optimal bandwidths

The **{rdrobust}** will automate the process of identifying bandwidths, although you can certainly still show your audience multiple bandwidths if you want. The package will also let you make some other choices (like using kernel regression to create the local regression estimates). As this literature discusses, you can choose among various kernels. de Benedictis-Kessner and Warshaw use a uniform kernel (equally weighted observations inside the bandwidth on either side of the cutoff) or the triangular kernel (observations closer to the cutoff are weighted more). We again can model with linear or polynomial versions of the running variable.

In [None]:
# first for the linear versions
# c = 0 is the argument for the location of the cutoff. 0 is the default.
# uniform kernel first, and then followed by the triangular kernel next
rd1 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, 
                c = 0, 
                kernel = "uni", 
                all = TRUE)

rd2 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, 
                c = 0, 
                kernel = "tri", 
                all = TRUE)
                
summary(rd1)
summary(rd2)

In [None]:
# the p argument allows you to specify higher-order polynomials. 2 would be quadratic, 3 would be cubic.

rd3 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, 
                c = 0, 
                p=2, 
                kernel = "uni", 
                all = TRUE)

rd4 <- rdrobust(data2$Total.Expenditure.D2, data2$demshare, 
                c = 0, 
                p=2, 
                kernel = "tri", 
                all = TRUE)
                
summary(rd3)
summary(rd4)

The **{rdrobust}** package comes with nice default way to plot the discontinuity without all the work we did earlier. By default, the *rdplot* function will bin the data to make a cleaner graph. 

In [None]:
rdplot(data2$Total.Expenditure.D2, data2$demshare, c = 0)


### Checking the continuity of the running variable

McCrary (2008) recommends a statistical test to see if the running variable itself is discontinuous at the cutoff. Remember, it shouldn't be. Discontinuity could be a signal the units are manipulating the assignment threshold in some fashion. The **{rdd}** package includes functions to run and plot the McCrary density test. Uh oh - looks like de Benedictis-Kessner and Warshaw's data fail the test (significant discontinity of the running variable's density at the cutpoint).

In [None]:
DCdensity(data2$demshare, 0)