## Lab 5
### UGBA 88: Data and Decisions, Fall 2019

<br>

This lab is designed to be completed in class. However, in case you need additional time, this assignment is due **Tuesday, November 19th at 11:59pm**.

The lab will be graded for **completion**. Lab office hours are held by Connector Assistants on Tuesdays after labs from 2-4pm in the DS Nexus in Moffitt.

## Vote 2002 Revisited: Using Regression to Estimate Causal Effects

In this lab we will go over using regression for causal infererence and re-analyze data from the Vote 2002 voter mobilization campaign.

---

### Table of Contents

[Background Review](#background)<br>
1 - [Controlling for a Single of Covariate](#covariate)<br>
2 - [Controlling for a Set of Covariates](#covariates)<br>
3 - [The Vote 2002 Experiment](#experiment)<br>

**Dependencies:**

In [None]:
from datascience import *
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

from client.api.notebook import Notebook
ok = Notebook('lab5.ok')
_ = ok.auth(inline=True)

----
## Background Review <a id='background'></a>

We will study the same Vote 2002 voter mobilization campaign from last lab. Facing the 2002 midterm election and fearing another low turnout, civic groups in Iowa and Michigan launched the Vote 2002 campaign to boost voter turnout. In the week prior to the election, Vote 2002 volunteers placed phone calls to 60,000 voters and gave them the following message:

*"Hello, may I speak with [name of person] please? Hi. This is [caller's name] calling from Vote 2002, a nonpartisan effort working to encourage citizens to vote. We just wanted to remind you that elections are being held this Tuesday. The success of our democracy depends on whether we exercise our right to vote or not, so we hope you'll come out and vote this Tuesday. Can I count on you to vote next Tuesday?”*

Once again, our causal question is: did the Vote 2002 campaign work? Did it increase voter turnout in the Congressional elections?

Today our method of choice will be **regression.** Voters that were successfully contacted by the Vote 2002 campaign differ systematically from voters that were not successfully contacted. We will try to control for those differences via regression, with the hope that any remaining differences in turnout can be attributed to the campaign itself.

----
### The Dataset

The dataset we'll use was compiled by the Vote 2002 campaign staff. They marked whether each call was successfully contacted or not and combined these data with voter registration records on demographics and voter turnout.

Here is a description of each column in the dataset:

* `contact`: indicator for whether voter was successfully contacted by volunteer
* `vote02`: whether the voter votes in the 2002 election (*this is the outcome of interest*)
* `vote98`: whether the voter voted in the 1998 election
* `newreg`: indicator for newly registered voter
* `age`:  age of voter
* `female`: indicator for female
* `county`: county code
* `treatment`: we'll discuss this one later

This time we'll work with a much larger data set, which includes *all competitive districts in Michigan*. We can do this because controlling for covariates via regression is generally much faster than controlling via matching.

In [None]:
#run this cell to load the data

#read in data
data = Table.read_table("mi_voter_clean.csv")

data.show(10)

## Section 1: Controlling for a Single Covariate <a id='covariate'></a>

We'll begin by computing voter turnout in the 2002 election for contacted and non-contacted voters.

In [None]:
data.select('vote02', 'contact').group('contact', collect = np.mean)

We can actually use regression to get these two averages. To do that, we can estimate a regression model of the form:

$$\text{Vote02}_{i} = \alpha + \beta \times \text{Contact}_{i} + \epsilon$$

The intercept, $\alpha$, should be equal to the average of `vote02` for noncontacted voters (where `contact` = 0), while the coefficient should equal the difference in means between contacted and noncontacted voters. As a consequence, $\alpha + \beta$ should be equal to the average of `vote02` for contacted voters.

Let's confirm this.

**Q1.1:** Estimate the regression model specified above.

As in Data 8, we'll first build a function that produces the Mean Squared Error ('MSE', aka the mean squared residual) for a given slope and intercept. Then we'll use the `minimize()` function to find the parameters that minimize MSE. These are our least squares regression coefficients.

In [None]:
def vote02_short_mse(treatment_slope, intercept):
    t = data.column('contact')
    y = data.column('vote02')
    fitted = intercept + treatment_slope*t
    return np.mean((y - fitted) ** 2)

In [None]:
coefficients_short = minimize(vote02_short_mse)
coefficients_short

In [None]:
_ = ok.grade("q1_1")

The first number corresponds to the least squares slope, and the second number is the least squares intercept. This corresponds to the arguments for the function `vote02_short_mse` defined above.

The second number should be equal to the mean for `vote02` among the non-contacted. The two numbers added together should be equal to the mean for `vote02` among the contacted.

OK, now let's return to measuring the causal effect of receiving a Vote 2002 campaign call on voter turnout. One method is to use exact **matching** to control for covariates that potentially differed between contacted and noncontacted voters. However, in this lab, we'll use **regression** and see how the results compare.

Let's first try controlling for `age`. We'll begin by comparing the age distributions for contacted and noncontacted voters in this larger data set.

In [None]:
data.hist('age', group = 'contact', bins = 20)

As before, contacted voters are generally older. This may introduce **selection bias** (and, in turn, an **omitted variable bias**) if older voters are also more likely to vote, for example.

For reference, recall what we did in last week's lab to control for age. For each contacted voter, we found an exact age match among the noncontacted voters, and then compared the turnout rates for contacted and matched noncontacted voters. 

We will not run our matching code from last lab here because, given the substantially larger data set we're using today, it would take an excessively long time to run. Instead, we've done the matching for you, and found that matching on age moves the difference in turnout rates from **6.6 to about 4.7 percentage points**.


Now let's try regression.

**Q1.2:** Estimate the following regression model:

$$\text{Vote02}_{i} = \alpha + \beta \times \text{Contact}_{i} + \gamma \times \text{Age}_{i} + \epsilon$$

In [None]:
#complete function below
def vote02_long_mse(treatment_slope, slope1, intercept):
    t = data.column('contact')
    x1 = ...
    y = ...
    fitted = intercept + treatment_slope*t + slope1*x1
    return np.mean((y - fitted) ** 2)

In [None]:
coefficients_long = minimize(vote02_long_mse)
coefficients_long

In [None]:
_ = ok.grade("q1_2")

**Q1.3:** What is the difference in voter turnout between contacted and noncontacted voters, controlling for `age`? How does this compare to the raw difference (without controls)?

*#write your answer here*

As with matching, controlling for age reduces the difference in turnout rates. In fact, matching produces nearly the same result.

Regression also allows us to examine how age relates to voter turnout.

**Q1.4:** In a complete your sentence, describe the *interpretation* of your estimate for $\gamma$, the regression coefficient on `age`.

*#write your answer here*

In lecture, we learned how to measure **omitted variable bias** for a given covariate. Let's apply this formula for the covariate `age`, and calculate the omitted variable bias associated with leaving `age` out of the regression model. The omitted bias formula from lecture applied here is:
$$\beta^{\text{short}} - \beta^{\text{long}} = \gamma \times \pi_{Age}$$

where $\beta^{\text{short}}$ is the coefficient for `contact` from **Q1.1**, $\beta^{\text{long}}$ is the coefficient for `contact` from **Q1.2**, and where $\pi_{Age}$ is the slope coefficient for a regression model where the dependent variable is `age` and the explanatory variable is `contact`:

$$Age_{i} = \alpha_{\text{Age}} + \pi_{\text{Age}} \times \text{Contact} + \epsilon$$

In [None]:
def age_mse(slope, intercept):
    x = data.column('contact')
    y = data.column('age')
    fitted = intercept + slope*x
    return np.mean((y - fitted) ** 2)

In [None]:
age_coefficients = minimize(age_mse)
age_coefficients

Let's confirm this formula. First, we'll compute the left side of the fomula.

In [None]:
coefficients_short[0] - coefficients_long[0]

**Q1.5:** Compute the right hand side of the omitted variable bias formula given above. (The number should be very close to the number above.)

In [None]:
#write code here

----

## Section 2: Controlling for a Set of Covariates <a id='covariates'></a>

Controlling for age in the regression balances the contacted and noncontacted voters on age, but we know that other differences remain between the two groups. These differences may introduce **selection bias**. We summarize those differences below for the covariates we have in our data.

In [None]:
#run this cell
data.group('contact', collect = np.mean)

Contacted voters are more likely to have voted in the 1998 election, less likely to be newly registered voters, and slightly more likely to be female. Just as we matched on a set of covariates, we can also control for a set of covariates in our regression model.

**Q2.1:** Estimate a regression model that, in addition to `contact`, includes the following covariates as controls: `age`, `newreg`, `female`, and `vote98`.

In [None]:
#complete function below
def vote02_longer_mse(treatment_slope, slope1, slope2, slope3, slope4, intercept):
    t = ...
    x1 = ...
    x2 = ...
    x3 = ...
    x4 = ...
    y = ...
    fitted = ...
    return np.mean((y - fitted) ** 2)

In [None]:
coefficients_longer = minimize(vote02_longer_mse)
coefficients_longer

In [None]:
_ = ok.grade("q2_1")

**Q2.2:** What is the difference in voter turnout between contacted and noncontacted voters, controlling for `age`, `newreg`, `female`, and `vote98`?

*#write your answer here*

Does this difference reflect the *causal effect* of `contact` on voter turnout? Recall that, for that to be the case, the Selection on Observables assumption must hold.

**Q2.3:** What would the Selection on Observables assumption mean in this context? What would we be assuming exactly?

*#write your answer here*

**Q2.4:** Does that assumption seem reasonable here? Why or why not?

*#write your answer here*

**Q2.5:** One covariate we do not have in our data that may be a confounder is how *busy* a voter is with work, taking care of family members, or other obligations (outside of voting). Imagine a covariate `busy` that measured how busy a voter is on a scale from 1 (not busy) to 5 (very busy). What is the *sign* you would expect for the omitted variable bias associated with `busy` (where, as above, the treatment is `contact` and the outcome is `vote02`)? Why?

*#write your answer here*

----
## Section 3: The Vote 2002 Experiment <a id='experiment'></a>

We've been trying to estimate the causal effect of the Vote 2002 mobilization campaign on voter turnout. We have tried using observational methods for measuring this causal effect, including matching and regression. We typically try these approaches when we have not run a randomized experiment but still want to answer a causal question.

But we're in luck. In fact, the Vote 2002 campaign ran a randomized experiment! The campaign made 60,000 calls in total, and those calls were made to a **randomly selected** set of households. That mysterious column `treatment` indicates whether a voter was randomly selected to receive a call.

We can use this experiment to check whether our observational methods gave us the right answer for the causal effect of receiving a Vote 2002 campaign call. That's what we'll do for the remainder of the lab.

Note that, since we're interested in the causal effect of a voter actually receiving a call, we have a *noncompliance* problem here. The Vote 2002 campaign tried to reach every randomly selected household, but some were not successfully contacted. This may happen if a voter never answers the phone or their listed phone number is no longer in service. However, only voters that were randomly selected can receive a Vote 2002 call.

**Q3.1:** Recall from our section on Noncompliance in experiments that there are 4 types of units/subjects in an experiment: compliers, always-takers, never-takers, and defiers. Which types exist in this experiment?

*#write your answer here*

Recall that in this type of scenario, we can use **treatment assignment** as an **instrument** to measure the causal effect of receiving a Vote 2002 call. With no always-takers in our context, we can measure the average **treatment effect on the treated** using this approach.

For this to work, we also need to make the following assumptions:

* *Relevance*: The instrument predicts the treatment of interest.

**We will check this directly below.**

* *Independence*: Random assignment of the instrument across units

**The Vote 2002 campaign claims they randomized. We can check for balance below.**

* *Exclusion restriction*: The instrument *only affects the outcome through the treatment*.

**Untestable, but this seems reasonable. We would not expect that voters who the Vote 2002 campaigned tried to but failed to contact were affected.**

* *Monotonicity*: No defiers.

**See your answer to Q3.1 above.**

**Q3.2:** Households were reportedly randomly assigned to receive calls. Check for balance between voters that were assigned to receive a call and those that were not on the covariates we have.

In [None]:
#check for balance: calculate average covariates for treated and untreated

Looks good--on every dimension, those assigned to treatment are similar to those who are not.

Now let's estimate the treatment effect on the treated in three steps. We'll use the LATE formula:

$$\text{LATE} \underbrace{=}_{\text{no always-takers}} \text{TOT} = \frac{\text{reduced form}}{\text{first stage}} = \frac{E[Y_{i} | Z_{i} = 1] - E[Y_{i} | Z_{i} = 0]}{E[D_{i} | Z_{i} = 1] - E[D_{i} | Z_{i} = 0]}$$

**Q3.3:** First, estimate the first stage. [Hint: this is the difference in our 'treatment of interest', `contact`, between those assigned to the treatment (`treatment` = 1) and those not assigned to the treatment ((`treatment` = 0).]

In [None]:
#calculate the first stage, which should be a single value
first_stage = ...
first_stage

In [None]:
_ = ok.grade("q3_3")

Let's interpret this number. It tells us how often the Vote 2002 campaign was able to successfully contact voters they were attempting the contact. They were able to do so only 38.8% of the time.

**Q3.4:** Next, estimate the reduced form.

In [None]:
#calculate the reduced form, which should be a single value
reduced_form = ...
reduced_form

In [None]:
_ = ok.grade("q3_4")

This tells us the causal effect of being assigned to the treatment group--that is, *the causal effect of Vote 2002 **attempting** to contact a voter*--on 2002 voter turnout. The effect is basically zero.

**Q3.5:** Finally, estimate LATE (which in this case is equal to the average treatment effect on the treated.)

In [None]:
#write your code here
LATE = ...
LATE

In [None]:
_ = ok.grade("q3_5")

**Q3.6:** Interpret your estimate in plain English: what is the average causal effect that you found? How does this estimate compare to the estimates we got with regression in **Q2.1**? What's going on here?

*#write your answer here*

----
Congratulations, you've finished Lab 5! To submit the lab, run the two cells below:

In [None]:
# For your convenience, you can run this cell to run all the tests at once
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
_ = ok.submit()