# Introduction to Regression Discontinuity Design

Code and data taken from: https://github.com/natematias/research_in_python/blob/master/regression_discontinuity/Regression%20Discontinuity%20Analysis.ipynb

**Regression Discontinuity** is a method that makes use of a treatment applied at a cutoff point. It takes advantage of the fact that subjects at either side of the cutoff are not likely to substantially differ from one another on confounders. Thus, any difference in outcomes are only attributable to the intervention itself.

In this notebook, we will learn how to implement and visualize a RDD design. We are using data from Joshua Angrist and Victor Lavy's "Using Maimonides Rule to Estimate the Effect of Class Size on Scholastic Achievement." A common question in education and social science research is whether class size has an effect on student performance. At both the K-12 and university level, propnents of smaller class sizes argue that large class sizes cause teachers to divide their time and attention too thinly, and students suffer from the lack of individual attention. Moreover, large class sizes might be more distracting and impede student comprehension and achievement in a number of other ways.

In this study, Angrist and Lavy make use of the fact that Israel implemented a rule that automatically split classes if they had more than 40 students. They compared student outcomes for students who were in classes with slightly fewer than 40 students to students who were in classes that were just barely split up. 

In this analysis, we examine a dataset that includes school level data for:

- size: fifth-grade cohort size
- intended_classsize: average intended class size for each school
- observed_classize: observed average class size for each school
- read: average reading achievement in cohort

In [None]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.formula.api as smf 

In [None]:
# Load data
class_df = pd.read_sas('angrist.sas7bdat')

In [None]:
# Explore the data


### Question 1: What would be the problem with comparing all students in small classes v. all students in large classes? How might students far away from the cutoff differ in important ways?

## Prepare data for analysis

The key to a regression discontinuity is that we need to distinguish between units that were given treatment, and units that were under control. The usual approach to doing this is to create dummy indicators for whether an observations falls in the pre- or post- side of the intervention.

### Question 2: What is a running variable? How does it relate to the treatment cutoff?

In [None]:
# Write a function that returns a '1' if a class size is large (>40), and a '0' if it is small (<=40)
def small(size):
    ...
    ...
    ...

# Create dummy for whether a class size is small, 
# and csize which measures the difference between observed_classize and the cutoff (41)
class_df['small'] = 
class_df['csize'] = 

# Summarize the read variable by each class size group


# Regression Discontinuity

Now we're ready to fit a model! Let's narrow the window down to class sizes between 29 and 53, and then estimate the regression.

In [None]:
# Subset the data to size between 29 and 53

# Fit OLS model with smf

result = smf.ols(formula = ,
                data = ).fit()
# Print results


In [None]:
# Plot results with cutoff
