# Lab 6 - Potential Outcomes and Causal Inference

In [5]:
import numpy as np
from datascience import Table
%matplotlib inline

## Part 1. Reading real causal claims

**Question 1.1 Read <a href="https://www.bbc.com/news/world-asia-58710194">this article</a> from the BBC this week. Identify two causal claims. For each, answer the following: (1) What is the independent and dependent variable?
(2) Is it a causal claim about a specific case or a general causal claim?
(3) What does the causal claim mean about a counterfactual world with a different value of the independent/treatment variable**

*Answer to 1.1*

 Consider this quote from Senator Lindsey Graham: "Russia's actions had no impact at all on the outcome of the (2016) election." Let's think about translating this into potential outcomes notation. To simplify, we will interpret the "outcome" of the election as the number of electoral college votes won by Trump. We will think of the "treatment" here as a binary variable  where $D_{2016}=1$ means "interference by Russia" and $D_{2016}=0$ means "no interference by Russia." We will take it as given that in reality $D_{2016}=1$.
 
 **Question 1.2. Trump won 304 electoral college votes in 2016. How can we express this outcome in potential outcomes notation? (Hint: it should be of the form that Y with some subscript(s) is equal to something.)**

*Answer to 1.2*

**Question 1.3. How can we express Graham's claim in potential outcomes notation (again it should be in the form of Y with some subscript(s) being equal to something)**

*Answer to 1.3*

**Question 1.4. The closest state in the election was my home state of Michigan, which has 16 electoral votes. Suppose someone thinks that Russian interference had a large enough impact to cause Trump to win Michigan, but not any other state. Express this claim in potential outcomes notation.**

*Answer to 1.4*

## Part 2. Lobbying and Corruption in All Seeing Mode

A common concern in many democratic (and less-than democratic) countries is that those with resources can use lobbying, campaign contributions, or outright bribes to get politicians to do their bidding. Often, the evidence used to make this argument is that those who receive lots of money from a particular interest group tend to vote in a way that the interest group wants (a friend of mine has expressed this in <a href="https://www.hrothstein.com/#/the-cost-of-denial/">art form</a>).

Let's think about what causal theories are consistent with this evidence. We will simulate a legislature with 500 members. We assume they all have an ideology which ranges from 0 to 1, where we will interpret this as their predisposition to vote in a "pro-business" fashion.


In [6]:
n_leg = 500
leg_ideol = np.random.rand(n_leg)
leg_data = Table().with_column("Ideology", leg_ideol)
leg_data

Ideology
0.811579
0.482587
0.919606
0.859304
0.401466
0.767753
0.210777
0.965212
0.205694
0.44573


We may be interested in how legislators vote on particular bills, or their overall voting behavior. Since it will make some calculations a bit more natural, we will do the latter.

An outcome we might care about is what proportion of "pro-business" bills the legilsator votes for, which will range from 0 to 1. If our ideology has any meaning, then those with a higher ideology should be more likely to vote for these bills. There are probably other factors that matter as well. To capture these ideas, we are going to assume that the proportion of pro-business bills they  vote for can be written:
$$
\text{pro} = b_{leg} \times \text{ideology} + (1-b_{leg}) \times e
$$
where $e$ is an *error term* which is a uniform random number between 0 and 1. The $b_{leg}$ variable measures how much ideology is important relative to other considerations. Here is code for this *data generating process*:

In [7]:
b_leg = 1/2
pro = b_leg * leg_ideol + (1-b_leg)*np.random.rand(n_leg)
leg_data = leg_data.with_column("Pro B Votes", pro)
leg_data

Ideology,Pro B Votes
0.811579,0.823187
0.482587,0.482455
0.919606,0.95545
0.859304,0.689785
0.401466,0.431324
0.767753,0.647607
0.210777,0.14804
0.965212,0.502682
0.205694,0.256566
0.44573,0.379958


A quick side note: we often define our error term to have an average of 0. We could have also written this as:
$$
\text{pro} = b_{leg} \times \text{ideology} + \frac {1-b_{leg}}{2} + e
$$
Where $e$ is uniformly distributed between $-\frac {1-b_{leg}}{2}$ and $\frac {1-b_{leg}}{2}$. Think through why this would produce an equivalent result.

Note we haven't said anything about the donor behavior yet; so we have implicitly assumed that this doesn't affect the vote! One theory about donor behavior is that they will give money to those with an aligned ideology in order to help them get re-elected. 

We can model this with a simple utility framework. Suppose the "cost" to donating is $c$, and the benefit is equal to:
$$
\text{benefit} = b_{don} \times \text{ideology} + (1-b_{don}) \times e
$$
where $e$ is a uniform random number between 0 and 1. So, when $b$ is high, the donor puts more weight on ideology, and when $b$  is low they put more weight on other factors. The donor utility is:
$$
u_{don} = \text{benefit} - c
$$

**Question 2.1. Write code to set $b_{don}=1/2$, $c=1/2$, compute the expected utility to donate, and make a variable called `Donate` which is equal to 1 when this is greater than or equal to the cost. (Hint: if you take a variable that is a boolean (True of False) and multiply it by 1, Python will turn this into an integer equal to 1 for True and 0 for False)**

In [8]:
# Code for 2.1

Ideology,Pro B Votes,Donate
0.811579,0.823187,1
0.482587,0.482455,0
0.919606,0.95545,1
0.859304,0.689785,0
0.401466,0.431324,1
0.767753,0.647607,1
0.210777,0.14804,1
0.965212,0.502682,1
0.205694,0.256566,1
0.44573,0.379958,1


**Question 2.2. What is the average of the `Pro B Votes` variable among those who receive a donation? Among those who do not? What is the difference of means?**

**Question 2.3. You should get that there is a positive difference of means. But we set this up in a way that there is no real causal effect: the legislator behavior was unaffected of what the donor did. If someone were to say to you "this just goes to show that politicians do whatever lobbyists want them to do!" what would be a good response based on what we learned this week?**

*Answer to 2.3*

**Question 2.4. One thing we might want to study is how the parameters of this data generating process affect the observed difference of means (which we know here is all selection bias). Write a function called `getdom(b_leg,b_don,c)` which replicates the analysis above, but with these variables as arguments. (That is, create a Table with the legislator ideology as a variable, then add variables for the legislator voting behavior and the donor choice, then compute the difference of means in voting behavior among those who received donations vs those who did not). Check that `getdom(.5, .5, .5)` gives a similar answer to what you got for 2.2 (it won't be exactly the same due to randomness).**

**Question 2.5. See what happens if you increase or decrease each of the three parameters. Make sure to keep them all between 0 and 1 (if you put c outside of this range you might get an error message; think about why!). Does this lead to more or less selection bias, and why?**

In [None]:
#Code for 2.5

*Words for 2.5*

Now let's do a variant of the analysis above, but where there is a real causal effect of donations. To do that, we will first create a table called `leg_data2` with the legislator ideology and the donation choice, which we treat the same as above.

In [21]:
b_leg=.5
b_don=.5
c=.5

leg_data2 = Table().with_column("Ideology", leg_ideol)
u_don = b_don*leg_ideol + (1-b_don)*np.random.rand(n_leg)
leg_data2=leg_data2.with_column("Donate", 1*(u_don > c))

Let's suppose that if a legislator does not receive a donation, they vote as we assumed above. If they do receive a donation, they will vote in a more pro-business fashion by $0.2$. We will create two separate variabels for the potential voting behavior without a donation (`Pro B Votes 0`) and with a donation (`Pro B Votes 1`).

In [22]:
b_bribe = .2
pro0 = b_leg * leg_ideol + (1-b_leg)*np.random.rand(n_leg)
pro1 = pro0 + b_bribe
leg_data2 = leg_data2.with_column("Pro B Votes 0", pro0)
leg_data2 = leg_data2.with_column("Pro B Votes 1", pro1)


**Question 2.6. Create a variable which corresponds to the realized voting behavior; that is, the potential outcome with not donation for those not receiving a donation, and the potential outcome with a donation for those who do receive one. Add this variable to the `leg_data2` table with the name "Pro B Votes".**

**Question 2.7. Compute the difference of means in realized voting behavior among those who received a donation versus not.**

**Question 2.8. Compute the selection bias in this estimate by comparing the difference in the average of "Pro B Votes 0" among those who received a donation versus not.**

**2.9. Do a calculation which illustrates the Difference of Means = Causal Effect + Selection Bias formula for this case**

Suppose researchers studying this question also have the data on the legislator ideology, which is measured in a way that is independent of their voting beahvior/who donates to them (which are incidentally the two main sources of data we use to estimate ideology!) 

Even without knowing how the data was generated, we can get a sense of whether this might be driving selection bias by looking at the relationship between ideology and donations and the relationship between ideology and voting.

**Question 2.10. Create a scatter plot with "Ideology" on the x axis and "Pro B Votes" on the y axis, using `leg_data`**

**Question 2.11. Now compare the difference in the mean of the "Ideology" variable among those who received donations vs not.**

**Question 2.12. We can see all three of these variables together by making a scatterplot of with the "Ideology" variable on the x axis and the "Pro B Votes" variable on the y axis, using `group=Donate` to plot those receiving donations in a different color. Do this for `leg_data`.**

**Question 2.13. Use what you found in the last three questions to argue that comparing the voting behavior of those who received donations vs not isn't a *ceteris paribus* comparison.**

*Answer to 2.13*

**Question 2.14. Now make the same graph as in 2.13 but for leg_data2. Compare these two graphs.**

*Words for 2.14*