## Homework 4 - Randomized Experiments versus Observational Studies

**Complete the coding questions in this notebook, and use your results to answer *Lab Homework 4* on bCourses**

For this homework we will be using the data: "Voting.csv"



**Bold text is the actual Homework Quiz question.**

----

## Background <a id='bground'></a>

Compared to other developed countries, the United States has low voter turnout rates. About 55.5% of the U.S. voting-age population casted ballots in the 2016 presidential election. In the 2000 presidential election, the turnout rate was 50.3%.

Fearing another low turnout in the 2002 midterm elections, civic groups in Iowa and Michigan launched the Vote 2002 Campaign to boost voter turnout. In the week prior to the election, Vote 2002 volunteers called 60,000 voters on the phone and gave them the following message:

*"Hello, may I speak with [name of person] please? Hi. This is [caller's name] calling from Vote 2002, a nonpartisan effort working to encourage citizens to vote. We just wanted to remind you that elections are being held this Tuesday. The success of our democracy depends on whether we exercise our right to vote or not, so we hope you'll come out and vote this Tuesday. Can I count on you to vote next Tuesday?”*

Our causal question of interest is: **Did the Vote 2002 campaign work? Did it increase voter turnout in the 2002 Congressional elections?** 

To estimate the causal effect of receving a Vote 2002 phone call, we'll need to compare the outcomes of voters that received a call--we'll call them **contacted** voters--to the outcomes for some comparison group. For our causal effect estimate to be accurate, the comparison group we use will need to reflect the *counterfactual* outcomes for contacted voters: what those voters *would have* done if they had not received the Vote 2002 call. If the comparison group poorly represents the counterfactual, our estimate will be biased.

For the comparison group to represent the counterfactual, we need the only relevant difference between contacted voters and comparison voters to be that comparison voters did not receive a phone call while contacted voters did. We will first use regression to adjust for observed differences between the contacted and comparison groups. We will then compare results from this regression-based approach to what we get from a randomized experiment.

----

## Section 1: The Dataset <a id='data'></a>

The dataset we'll use was compiled by the Vote 2002 campaign staff. The staff had demographic data for each voter they attempted to contact, and marked whether each call was completed successfully or not. Later, in order to measure the impact of their campaign, they merged these data with official voting records to see if voters did go out and vote.

Here is a description of each column in the dataset:

* `contact` - indicator for whether voter was successfully contacted by volunteer
* `vote02` - indicator for whether the voter votes in the 2002 election (*this is the outcome of interest*)
* `vote98` - indicator for whether the voter voted in the 1998 election
* `newreg` - indicator for wheter voter is newly registered voter
* `age` -  age of voter
* `female` - indicator for female
* `county`: county code
* `treatment`: we'll discuss this one later

The data we will work with cover all competitive districts in Michigan and over 300,000 voters.

In [2]:
# Load packages
library(dplyr)
library(tidyr)
library(estimatr)
library(ggplot2)


Load the data "Voting.csv", which is posted on bCourses.

In [9]:
# Load data
Voting <- read.csv("Voting.csv", header = TRUE)
head(Voting)

Unnamed: 0_level_0,treatment,contact,vote02,vote98,newreg,age,female,county
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,0,0,0,0,39,1,32
2,0,0,0,0,0,29,0,32
3,0,0,0,0,0,38,1,32
4,0,0,1,0,0,41,0,32
5,0,0,1,0,0,77,1,32
6,0,0,0,0,0,33,0,32


**Q1: What percentage of voters in our data were *contacted*?** (Be sure to report your answer as a *percentage (%)*. Round to nearest whole percentage point.)

In [6]:
# Calculate the mean of the "contact" variable
mean(Voting$contact)*100

**Q2: Between contacted and noncontacted voters, which group has the higher turnout rate?**

(a) Contacted voters, but the difference is *not* statistically significant.

(b) Contacted voters, and the difference *is* statistically significant.

(c) Noncontacted voters, but the difference is *not* statistically significant.

(d) Noncontacted voters, and the difference *is* statistically significant.

In [10]:
# Run a regression with "vote02" as the outcome variable, and "contact" as the regressor. 
model1 <- lm(Voting$vote02 ~ Voting$contact)
summary(model1)

print("b")



Call:
lm(formula = Voting$vote02 ~ Voting$contact)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6238 -0.5579  0.4421  0.4421  0.4421 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.5579343  0.0008691  641.97   <2e-16 ***
Voting$contact 0.0658195  0.0065660   10.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4964 on 332082 degrees of freedom
Multiple R-squared:  0.0003025,	Adjusted R-squared:  0.0002995 
F-statistic: 100.5 on 1 and 332082 DF,  p-value: < 2.2e-16


[1] "b"


Does the difference in turnout rates between contacted and noncontacted voters reflect the causal effect of receiving a call? It might. But it might not if contacted and not contacted voters are different in ways other than whether they received a call.

As a starting point, compare the average ages for the two groups of voters.

In [12]:
# Run a regression with "age" as the outcome variable, and "contact" as the regressor. 
model1 <- lm(Voting$age ~ Voting$contact)
summary(model1)



Call:
lm(formula = Voting$age ~ Voting$contact)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.218 -13.768  -3.768  12.232  61.232 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    50.76792    0.03036 1672.29   <2e-16 ***
Voting$contact  2.45054    0.22936   10.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.34 on 332082 degrees of freedom
Multiple R-squared:  0.0003436,	Adjusted R-squared:  0.0003406 
F-statistic: 114.2 on 1 and 332082 DF,  p-value: < 2.2e-16


**Q3: Given the averages in ages that you found for contacted and noncontacted voters, what should we infer about the potential for selection bias if we were to estimate the causal effect of the campaign by simply comparing contacted and non-contacted voters? (Keep in mind that older people are generally more likely to vote.)**

(a) Selection bias will be positive--make the causal effect look more positive than it is.

(b) Selection bias will be negative--make the causal effect look more negative than it is.

(c) Selection bias is neglible

In [25]:
print("a")

[1] "a"




**Q4:** Estimate the following regression model:

$$\text{Vote02}_{i} = \alpha + \beta \times \text{Contact}_{i} + \gamma \times \text{Age}_{i} + \epsilon$$

**What is the difference in voter turnout between contacted and noncontacted voters, controlling for `age`? (Express in percentage points. Round to nearest whole percentage point.)**

In [24]:
# Run a regression with "vote02" as the outcome variable, and "contact" + "age" as regressors. 


model1 <- lm(Voting$vote02 ~ Voting$contact + Voting$age)
summary(model1)

print("The difference in voter turnout is ~5%")



Call:
lm(formula = Voting$vote02 ~ Voting$contact + Voting$age)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9926 -0.4815  0.2487  0.4475  0.6747 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.975e-01  2.584e-03  76.431  < 2e-16 ***
Voting$contact 4.842e-02  6.362e-03   7.611 2.72e-14 ***
Voting$age     7.099e-03  4.813e-05 147.512  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4809 on 332081 degrees of freedom
Multiple R-squared:  0.06178,	Adjusted R-squared:  0.06177 
F-statistic: 1.093e+04 on 2 and 332081 DF,  p-value: < 2.2e-16


[1] "The difference in voter turnout is ~5%"


**Q5: Compare contacted voters to noncontacted voters for a broader set of baseline characteristics: `age`, `newreg`, `female`, and `vote98`. For which variable is the t-statistic for the difference in means largest in magnitude?**

(a) `age` (age of voter)

(b) `newreg` (indicator for whether voter is newly registered voter)

(c) `female` (indicator for female)

(d) `vote98` (indicator for whether the voter voted in the 1998 election)


In [14]:
# Run a regression with "age" as the outcome variable, and "contact" as the regressor.
model1 <- lm(Voting$age ~ Voting$contact)
summary(model1)

# Run a regression with "newreg" as the outcome variable, and "contact" as the regressor.
model1 <- lm(Voting$newreg ~ Voting$contact)
summary(model1)

# Run a regression with "female" as the outcome variable, and "contact" as the regressor.
model1 <- lm(Voting$female ~ Voting$contact)
summary(model1)

# Run a regression with "vote98" as the outcome variable, and "contact" as the regressor.
model1 <- lm(Voting$vote98 ~ Voting$contact)
summary(model1)



Call:
lm(formula = Voting$age ~ Voting$contact)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.218 -13.768  -3.768  12.232  61.232 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    50.76792    0.03036 1672.29   <2e-16 ***
Voting$contact  2.45054    0.22936   10.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.34 on 332082 degrees of freedom
Multiple R-squared:  0.0003436,	Adjusted R-squared:  0.0003406 
F-statistic: 114.2 on 1 and 332082 DF,  p-value: < 2.2e-16



Call:
lm(formula = Voting$newreg ~ Voting$contact)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.1330 -0.1330 -0.1330 -0.1330  0.8831 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.1329467  0.0005939 223.871  < 2e-16 ***
Voting$contact -0.0160681  0.0044866  -3.581 0.000342 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3392 on 332082 degrees of freedom
Multiple R-squared:  3.862e-05,	Adjusted R-squared:  3.561e-05 
F-statistic: 12.83 on 1 and 332082 DF,  p-value: 0.0003419



Call:
lm(formula = Voting$female ~ Voting$contact)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5488 -0.5409  0.4591  0.4591  0.4591 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.5408808  0.0008724 619.988   <2e-16 ***
Voting$contact 0.0079333  0.0065911   1.204    0.229    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4983 on 332082 degrees of freedom
Multiple R-squared:  4.363e-06,	Adjusted R-squared:  1.351e-06 
F-statistic: 1.449 on 1 and 332082 DF,  p-value: 0.2287



Call:
lm(formula = Voting$vote98 ~ Voting$contact)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2894 -0.2578 -0.2578  0.7422  0.7422 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.2578142  0.0007663 336.432  < 2e-16 ***
Voting$contact 0.0316324  0.0057896   5.464 4.67e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4377 on 332082 degrees of freedom
Multiple R-squared:  8.988e-05,	Adjusted R-squared:  8.687e-05 
F-statistic: 29.85 on 1 and 332082 DF,  p-value: 4.667e-08


In [26]:
print("Largest t-statistic is age, 10.68")

[1] "Largest t-statistic is age, 10.68"



**Q6:** Estimate a regression model that has `vote02` as the outcome variable, and `contact` as the main regressor of interest. Include the following covariates as controls: `age`, `newreg`, `female`, and `vote98`.

**What is the difference in voter turnout between contacted and noncontacted voters, controlling for `age`, `newreg`, `female`, and `vote98`?**

In [16]:
# Run a regression with "vote02" as the outcome variable, and "contact" + "age" + "newreg" + "female" + "vote98" as regressors.
model1 <- lm(Voting$vote02 ~ Voting$contact + Voting$age + Voting$newreg + Voting$female + Voting$vote98)
summary(model1)


Call:
lm(formula = Voting$vote02 ~ Voting$contact + Voting$age + Voting$newreg + 
    Voting$female + Voting$vote98)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0909 -0.4427  0.1412  0.4438  0.7519 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     2.529e-01  2.776e-03  91.084  < 2e-16 ***
Voting$contact  4.402e-02  6.169e-03   7.135 9.68e-13 ***
Voting$age      5.158e-03  4.947e-05 104.274  < 2e-16 ***
Voting$newreg  -7.086e-02  2.511e-03 -28.224  < 2e-16 ***
Voting$female  -2.678e-02  1.628e-03 -16.444  < 2e-16 ***
Voting$vote98   2.603e-01  1.940e-03 134.192  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4663 on 332078 degrees of freedom
Multiple R-squared:  0.1178,	Adjusted R-squared:  0.1178 
F-statistic:  8868 on 5 and 332078 DF,  p-value: < 2.2e-16


Did we control for enough stuff to take care of selection bias? Who knows.

----
## Section 3: The Vote 2002 Experiment <a id='experiment'></a>

We've been trying to estimate the causal effect of the Vote 2002 mobilization campaign on voter turnout. We have tried using observational methods for measuring this causal effect. We typically try these approaches when we have not run a randomized experiment but still want to answer a causal question.

But we're in luck. In fact, the Vote 2002 campaign ran a randomized experiment! The campaign made 60,000 calls in total, and those calls were made to a **randomly selected** set of households. That mysterious column `treatment` indicates whether a voter was randomly selected to receive a call.

We can use this experiment to check whether our observational methods gave us the right answer for the causal effect of receiving a Vote 2002 campaign call.

Households were reportedly randomly assigned to receive calls. Check for balance between voters that were assigned to receive a call and those that were not for one characteristic: `age`. (A better practice would be to check for all, but let's keep it simple for now)

**Q7: Comparing treated and nontreated voters, which group is older?**

(a) treatment group

(b) control group

In [19]:
# Run a regression with "age" as the outcome variable, and "treatment" as the regressor.
model1 = lm(Voting$age ~ Voting$treatment)
summary(model1)

print("treatment group is older")


Call:
lm(formula = Voting$age ~ Voting$treatment)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.859 -13.809  -3.809  12.191  61.191 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)      50.80859    0.03080 1649.649   <2e-16 ***
Voting$treatment  0.05014    0.14495    0.346    0.729    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.34 on 332082 degrees of freedom
Multiple R-squared:  3.603e-07,	Adjusted R-squared:  -2.651e-06 
F-statistic: 0.1197 on 1 and 332082 DF,  p-value: 0.7294


[1] "treatment group is older"


These two groups are not exactly the same but the differences should be relatively small given random assignment. Test whether the differences in characteristics between the two groups are statistically significant (Hint: run a regression with `age` as the outcome variable, and `treatment` as the regressor)

**Q8: What p-value (Pr(>|t|)) is associated with the treatment coefficient in your regression? (round off to two decimal places)** 

In [22]:
print("The p-value is .729")

[1] "The p-value is .729"




**Q9: Estimate the the causal effect of being assigned to the treatment group--that is, *the causal effect of Vote 2002 outreach*--on 2002 voter turnout. (Do not use additional controls. Express in percentage points. Round to nearest whole percentage point.)**

In [20]:
# Run a regression with "vote02" as the outcome variable, and "treatment" as the regressor.
model1 = lm(Voting$vote02 ~ Voting$treatment)
summary(model1)

print("The causal effect of being assigned to the treatment group on 2002 voter turnout is .3%")


Call:
lm(formula = Voting$vote02 ~ Voting$treatment)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5621 -0.5589  0.4411  0.4411  0.4411 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.5589436  0.0008817 633.932   <2e-16 ***
Voting$treatment 0.0031853  0.0041496   0.768    0.443    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4965 on 332082 degrees of freedom
Multiple R-squared:  1.774e-06,	Adjusted R-squared:  -1.237e-06 
F-statistic: 0.5893 on 1 and 332082 DF,  p-value: 0.4427


[1] "The causal effect of being assigned to the treatment group on 2002 voter turnout is .3%"


**Q10: Is the estimated causal effect statistically different than zero?**

(a) True

(b) False

In [21]:
print("False, the p value is less than alpha of .05, so not statistically different")

[1] "False, the p value is less than alpha of .05, so not statistically different"


**Q11:** Based on the discrepancy between [the difference in turnout rates of contacted and noncontacted] and [the estimates for the causal effect of the Vote 2002 campaign (from the RCT)], what can we conclude?

(a) There are important unobserved differences between contacted and noncontacted

(b) The experiment was not needed to estimate the true causal effect

In [23]:
print("a")

[1] "a"
