# PS 88 Lab 6 - Potential Outcomes and Causal Inference

In [97]:
import numpy as np
from datascience import Table
%matplotlib inline

## Part 1. Reading real causal claims

**Question 1.1 Read <a href="https://www.bbc.com/news/business-64708832">this article</a> from the BBC this week. Identify two causal claims. For each, answer the following: (1) What is the independent and dependent variable?
(2) Is it a causal claim about a specific case or a general causal claim?
(3) What does the causal claim mean about a counterfactual world with a different value of the independent/treatment variable**

*Answer to 1.1*

 Consider this quote from Senator Lindsey Graham: "Russia's actions had no impact at all on the outcome of the (2016) election." Let's think about translating this into potential outcomes notation. To simplify, we will interpret the "outcome" of the election as the number of electoral college votes won by Trump. We will think of the "treatment" here as a binary variable  where $D_{2016}=1$ means "interference by Russia" and $D_{2016}=0$ means "no interference by Russia." We will take it as given that in reality $D_{2016}=1$.
 
 **Question 1.2. Trump won 304 electoral college votes in 2016. How can we express this outcome in potential outcomes notation? (Hint: it should be of the form that Y with some subscript(s) is equal to something. Look at the markdown cell above to see how to make subscripts.)**

*Answer to 1.2*

**Question 1.3. How can we express Graham's claim in potential outcomes notation (again it should be in the form of Y with some subscript(s) being equal to something)**

*Answer to 1.3*

**Question 1.4. The closest state in the election was my home state of Michigan, which has 16 electoral votes. Suppose someone thinks that Russian interference had a large enough impact to cause Trump to win Michigan, but not any other state. Express this claim in potential outcomes notation.**

*Answer to 1.4*

Finally, here is an example of our
$$
\text{estimate} = \text{target} + \text{bias} + \text{noise}
$$
formula in this context.

**Question 1.5. Suppose an enterprising researcher managed to collect data on whether Russian trolls tried to influence the vote by state during the 2016 election. Just for the sake of argument, suppose 20 states were trolled and 30 were not. To try and learn whether the trolls affected voting behavior, the researcher then compares the average Trump vote share among the states which were trolled vs not. In this situation, what is the estimate? What is the target? What might be a potential source of bias? (Don't worry about noise.)**

*Answer to 1.5*

## Part 2. Lobbying and Corruption in All Seeing Mode

A common concern in many democratic (and less-than democratic) countries is that those with resources can use lobbying, campaign contributions, or outright bribes to get politicians to do their bidding. Often, the evidence used to make this argument is that those who receive lots of money from a particular interest group tend to vote in a way that the interest group wants (a friend of mine has expressed this in <a href="https://www.hrothstein.com/#/the-cost-of-denial/">art form</a>).

Let's think about what causal theories are consistent with this evidence. We will simulate a legislature with 500 members. We assume they all have an ideology which ranges from 0 to 1, where we will interpret this as their predisposition to vote in a "pro-business" fashion.


In [98]:
n_leg = 500
leg_ideol = np.random.rand(n_leg)
leg_data = Table().with_column("Ideology", leg_ideol)
leg_data

Ideology
0.364649
0.104616
0.488303
0.40909
0.0620965
0.522891
0.169154
0.474163
0.751836
0.617603


We may be interested in how legislators vote on particular bills, or their overall voting behavior. Since it will make some calculations a bit more natural, we will do the latter.

An outcome we might care about is what proportion of "pro-business" bills the legislator votes for, which will range from 0 to 1. If our ideology has any meaning, then those with a higher ideology should be more likely to vote for these bills. There are probably other factors that matter as well. To capture these ideas, we are going to assume that the proportion of pro-business bills they  vote for can be written:
$$
\text{pro} = b_{leg} \times \text{ideology} + (1-b_{leg}) \times e
$$
where $e$ is an *error term* which is a uniform random number between 0 and 1. The $b_{leg}$ variable measures how much ideology is important relative to other considerations. Here is code for this *data generating process*:

In [99]:
b_leg = 1/2
pro = b_leg * leg_ideol + (1-b_leg)*np.random.rand(n_leg)
leg_data = leg_data.with_column("Pro B Votes", pro)
leg_data

Ideology,Pro B Votes
0.364649,0.316457
0.104616,0.290102
0.488303,0.352061
0.40909,0.464841
0.0620965,0.434734
0.522891,0.718556
0.169154,0.364137
0.474163,0.683617
0.751836,0.595216
0.617603,0.534581


Note we haven't said anything about the donor behavior yet; so we have implicitly assumed that this doesn't affect the vote! One theory about donor behavior is that they will give money to those with an aligned ideology in order to help them get re-elected. 

We can model this with a simple utility framework. Suppose the "cost" to donating is $c$, and the benefit is equal to:
$$
\text{benefit} = b_{don} \times \text{ideology} + (1-b_{don}) \times e
$$
where $e$ is a uniform random number between 0 and 1. So, when $b$ is high, the donor puts more weight on ideology, and when $b$  is low they put more weight on other factors. The donor utility is:
$$
u_{don} = \text{benefit} - c
$$

**Question 2.1. Write code to set $b_{don}=1/2$, $c=1/2$, compute the expected utility to donate, and make a variable called `Donate` which is equal to 1 when this is greater than or equal to the cost. (Hint: use the `np.where` function.)**

In [130]:
# Code for 2.1

**Question 2.2. What is the average of the `Pro B Votes` variable among those who receive a donation? Among those who do not? What is the difference of means?**

In [131]:
#Code for 2.2

**Question 2.3. You should get that there is a positive difference of means. But we set this up in a way that there is no real causal effect: the legislator behavior was unaffected of what the donor did. If someone were to say to you "this just goes to show that politicians do whatever lobbyists want them to do!" what would be a good response based on what we learned this week?**

*Answer to 2.3*

**Question 2.4. One thing we might want to study is how the parameters of this data generating process affect the observed difference of means. Write a function called `getdm(b_leg,b_don,c)` which replicates the analysis above, but with these variables as arguments. (That is, create a Table with the legislator ideology as a variable, then add variables for the legislator voting behavior and the donor choice, then compute the difference of means in voting behavior among those who received donations vs those who did not). Check that `getdm(.5, .5, .5)` gives a similar answer to what you got for 2.2 (it won't be exactly the same due to randomness).**

In [132]:
# Code for 2.4

**Question 2.5. See what happens if you increase or decrease the `b_leg` parameter. What does this mean in words? Does this lead to more or less selection bias, and why?**

In [133]:
#Code for 2.5

*Answer to 2.5*

**Question 2.6. Say we want to simulate a donor who is *anti-business*, or prefers to donate to those with a less pro-business ideology. Write code to simulate whether legislators who get contributions from such a donor are more or less likely to vote yes on pro-business legislation**

In [None]:
# Code for 2.6

*Words for 2.6*

## Part 3. Donations with causation

Now let's do a variant of the analysis above, but where there is a real causal effect of donations. To do that, we will first create a table called `leg_data2` with the legislator ideology and the donation choice, which we treat the same as above.

In [114]:
b_leg=.5
b_don=.5
c=.5
leg_data2 = Table().with_column("Ideology", leg_ideol)
u_don = b_don*leg_ideol + (1-b_don)*np.random.rand(n_leg)
leg_data2=leg_data2.with_column("Donate", 1*(u_don > c))

Let's suppose that if a legislator does not receive a donation, they vote as we assumed above. If they do receive a donation, they will vote in a more pro-business fashion by $0.2$. We will create two separate variabels for the potential voting behavior without a donation (`Pro B Votes 0`) and with a donation (`Pro B Votes 1`).

In [115]:
b_bribe = .2
pro0 = b_leg * leg_ideol + (1-b_leg)*np.random.rand(n_leg)
pro1 = pro0 + b_bribe
leg_data2 = leg_data2.with_column("Pro B Votes 0", pro0)
leg_data2 = leg_data2.with_column("Pro B Votes 1", pro1)
leg_data2

Ideology,Donate,Pro B Votes 0,Pro B Votes 1
0.364649,0,0.662683,0.862683
0.104616,0,0.270537,0.470537
0.488303,0,0.630846,0.830846
0.40909,0,0.256343,0.456343
0.0620965,0,0.272727,0.472727
0.522891,0,0.367734,0.567734
0.169154,0,0.336469,0.536469
0.474163,1,0.24887,0.44887
0.751836,0,0.392424,0.592424
0.617603,1,0.690552,0.890552


**Question 3.1. Create a variable which corresponds to the realized voting behavior; that is, the potential outcome with not donation for those not receiving a donation, and the potential outcome with a donation for those who do receive one. Add this variable to the `leg_data2` table with the name "Pro B Votes" (Hint: use the `np.where` function.**

In [134]:
# Code for 3.1

**Question 3.2. Compute the difference of means in realized voting behavior among those who received a donation versus not.**

In [135]:
# Code for 3.2

**Question 3.3. Compute the selection bias in this estimate by comparing the difference in the average of "Pro B Votes 0" among those who received a donation versus not.**

In [137]:
#Code for 3.3

**Question 3.4. Do a calculation which illustrates the Difference of Means = Causal Effect + Selection Bias formula for this case**

In [138]:
# Code for 3.4

Suppose researchers studying this question also have the data on the legislator ideology, which is measured in a way that is independent of their voting beahvior/who donates to them (which are incidentally the two main sources of data we use to estimate ideology!) 

Even without knowing how the data was generated, we can get a sense of whether this might be driving selection bias by looking at the relationship between ideology and donations and the relationship between ideology and voting.

**Question 3.5. Create a scatter plot with "Ideology" on the x axis and "Pro B Votes" on the y axis, using `leg_data2`**

In [139]:
#Code for 3.5

**Question 3.6. Now compare the difference in the mean of the "Ideology" variable among those who received donations vs not (in `leg_data2`). Interpret this difference**

In [124]:
#Code for 3.6

0.3463941783873179

*Words for 3.6*

**Question 3.7. We can see all three of these variables together by making a scatterplot of with the "Ideology" variable on the x axis and the "Pro B Votes" variable on the y axis, using a `group=` option to plot those receiving donations in a different color.**

In [141]:
#Code for 3.7

**Question 3.8. Use what you found in the last three questions to argue that comparing the voting behavior of those who received donations vs not isn't a *ceteris paribus* comparison.**

*Answer to 3.8*

**Question 3.9 [OPTIONAL]. Create the same graph as you did in 3.7 but using the `legdata` table where there was no causal effect. Compare the two.**

In [140]:
# Code for 3.9