# The Tenessee STAR Experiment 

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import statsmodels
import statsmodels.formula.api as smf

import urllib

In [5]:
!pip3 install -U -e git+https://bitbucket.org/fomcl/savreaderwriter.git#egg=savreaderwriter

Obtaining savreaderwriter from git+https://bitbucket.org/fomcl/savreaderwriter.git#egg=savreaderwriter
  Cloning https://bitbucket.org/fomcl/savreaderwriter.git to ./src/savreaderwriter
Installing collected packages: savreaderwriter
  Running setup.py develop for savreaderwriter
Successfully installed savreaderwriter


In [None]:
pd.read_

In [2]:
d = pd.read_csv('../../data/star_students.csv')
d.head()

FileNotFoundError: File b'../../data/star_students.csv' does not exist

## Data cleaning 
We begin with some basic cleaning of the data set. We could have done this before sharing the dataset with you, but thought instead that this might be instructive for you to work through.

What we have is a worse case scenario where one receives a package of data with little information about the data, how its values are stored, etc. So, we will define these things for ourselves. In fact, there are for (or more) tests in the data, and Angrist and Pischke report that the `percentile` variable they use is built from a student's performance on three of these tests. Specifically which three is not detailed in the book. As a result, our answers deviate slightly from those reported in *Mostly Harmless Econometrics* but they're close. 


Create the following fields in a way that make sense: 

- A restructured race indicator that encodes a binary indicator for whether the individual is "White" or "Asian" or another racial/ethnic category
- The student's age in 1985
- Whether the receive a free lunch (a rescaling of the variable `gkfreelunch` 
- The class size they're in (`gkclasssize`)
- Their reading score (`gktreadss`)
- Their math score (`gktmathss`) 
- Their listening score (`gktlistss`)
- Their school id (`gkschid`) 
- And, finally, their *treatment* id (`gkclasstype`)

## Create Outomce Variables
With the data mostly cleaned up and renamed, you can now take create the outcome variable. Create this variable in the following way: 

1. We are summing student's performance on each of the reading, math, and listening tests to create a single, composite score. 
2. With this score, we are calculating the empirical cumulative distribution function (CDF) for each individual student compared to the entire set of students. An empirical CDF is really just a student's percentile. 
3. We should note too, we aren't education experts, and aren't familiar with the Stanford Tests. (*Go Berkeley*). It might be more approprite to find a student's empirical CDF within each test and then, average these. Or it might not. 


## Produce Summary Statistics 
- What are the average rates of free lunch, "White/Asian", age, class size? 
- Should these averages be the same in the different treatment groups? 
- Are they? 

Conduct a covariate balance test (probably using `statsmodels`) that lets you assess whether the treatment and control was *actually* assigned at random. 

- What would be the consequence of failing to randomize? Would this be a problem? Why or why not? 
- Does it look as though the randomization "worked"? How can you know? 



# Estimate Effects

## 1. Difference in Means 
As we have made a point of saying in the course, there are number of mechanical ways to estimate an average treatment effect. One simple way is using the difference in means between the two groups. 

Then use *Field Experiments* equation 3.6 to estimate the standard error of this difference in means between the treatment groups. Does it seem that there is a treatment effect? 


## 2. Regression Estimates 

You can also use a linear model which will provide us with identical p-values, and can still be interpreted as a causal estimand. Estimate this difference via OLS regression, using `statsmodels`, and draw a conclusion about the magnitude of the treatment effect, and whether this treatment effect was likely, or unlikely to occur by chance. 

Does the conclusion that you reach differ any/much from the conclusion that you reached from the difference in means estimates? 

## Correct the Standard Errors 
There are a lot of reasons to suspect that students' performances might not satisfy the assumptions that go in to building normal, Gaussian distributions of the residuals. 

+ One that immediatly comes to mind is that perhaps students who are assigned to treatment have greater variance in their outcomes than students who are assigned to control. This could happen if there are some students who really excel in the small classroom. 
+ Another, that we aren't going to correct for in this document, but that we **certainly** should, is that the outcomes within a school might also be correlated in a way that our model is not accounting for. Including a fixed effect for each school effectively *de-means* the school effects so that we have a more precise estimate of the treatment effect within each school, but it doesn't address any of the *within* school correlation that might exist. 
+ As we talk about in the course, if we have relatively high Inter-Cluster Correlation, then our standard errors are inappropriately enthuiastic about rejecting the null hypothesis (they're too small). This is because our standard errors are behaving as though we have `N` observations, when in fact functionally we might have (many) fewer useful observations. 

### Correct standard errors to be robust SEs 

### Clustered SEs 
The last thing here is acknowledging that we've very likely got correlated potential outcomes within schools. Including a fixed effect term for each school removes the possiblity of inducing bias in our estimate. However, since we can't plausibly assume that the variance is the same within each of the individual school clusters, then failing to appropriately account for the empirical variance might lead us to estimate inappropriately small standard errors. Why would this be a problem? Well, if we want to falsely reject the null in only 5% of cases ($\alpha = 0.95$), then if we make the wrong assumptions we might falsely reject the null hypothesis at higher (or even lower rates). Typically, we aren't kept up at night if we're a little conservative in our estimates, but beign too *gung-ho* is a problem. 

As noted in both *Field Experiments* and *Mostly Harmless Econometrics* the appropriate way to assess this is to calculate the estimated variance within each of the clusters and appropritely combine these estimtates. Conceptually this is pretty simple, but getting the maths just right can be a little particular. 

