## PS 3 Fall 2020 - Lecture Notebook for Week 6 - Potential Outcomes and Arrays

Here is how we can do some potential outcomes calculations in Python, which will also give us practice with arrays.

We'll be using "made up" data, but to make things more interesting let's use a real example which I've written about: international monitoring of elections.

The outcome we care about is how fraudulent elections are, which we'll suppose is measured on a scale from 0 (perfectly clean) to 10 (completely fraudulent). 

Our independent or treatment variable will be whether international monitors are present. 

An interesting methodological challenge when studying this question in the real world is that we often measure how fraudulent elections are using reports from international monitors: so the presence of our independent variable may be required to measure the dependent variable! For our exercise here we will sweep this under the rug, and suppose that we get a reliable measurement of fraudulent elections are from other sources. 

First, let's create an array of potential outcomes without the treatment. That is, how fraudulent would the election be in the (sometimes hypothetical) scenario with no monitors.

We are going to use the "numpy" library, which creates some nice functions for dealing with arrays.

To keep things simple, we are going to imagine a data set with 8 elections.

In [None]:
import numpy as np
y0 = np.array([8, 2, 5, 8, 2, 3, 4, 3])
y0

Lets assume that the causal effect is equal to -1 for everyone. In words, monitors decrease the amount of fraud by 1 point on a 10 point scale.

To do this, we will define a variable called k (think kappa from the slides), and add that to y0.

In [None]:
k=-1
y1 = y0 + k
y1

Note that we have done something kind of cool here: we added a number to an array, which is a list of numbers. Numpy deals with this the way that we would like: it subtracts 1 from all of the entries.

Now let's suppose that 4 of the countries have election monitors while 4 do not. To capture this, we create an array of 0s and 1,s where 0 means not monitored and 1 means monitored. 

 

In [None]:
d = np.array([1,0,1,1,0,0,1,0])
d

We are going to compute their realized fraud outcome with a clever array trick. Our goal is to get the value from y0 when d=0 and from y1 when d=1. To do this, we will multiply y1 times d, which will give us the realized outcome from those monitored and 0 otherwise, and then y0 times 1-d, which will give us the realized outcome for the non-monitored and 0 otherwise. So, by adding $y1*d$ and $y0*(1-d)$ we are always going to get the realized outcome plus 0, so the realized outcome.

In [None]:
y1*d

In [None]:
y0*(1-d)

In [None]:
y = y1*d + y0*(1-d)
y

What could we actually observe in reality? The monitor status, and the observed amount of fraud. Here is one way to print that.

In [None]:
print(np.column_stack((d,y)))

Now let's think about how we can compute a difference of means for those with and without monitors. We will do this in a few steps. First, we want to compute the average level of fraud for countries with monitors. There is a nice trick for this: we will take the "subset" of observed outcomes which are monitored (d==1). The syntax for this is to add the condition we want in square brackets:

In [None]:
y[d==1]

This returned an array with four entries, which makes sense because four of our countries have monitors. 

If you return to the printed version above, you can check that it pulled the outcome for the four countries with monitors.

We can do the same for the non-monitored countries

In [None]:
y[d==0]

Now we can compute the average fraud level in the monitored countries with the np.mean function:

In [None]:
np.mean(y[d==1])

Note that if we did the same thing but looked at our y1 vector, we get the same reason, since the "average potential outcome with monitoring among the monitored" is just "average outcome among the monitored"

In [None]:
np.mean(y1[d==1])

Now lets do the non-monitored elections:

In [None]:
np.mean(y[d==0])

Finally, let's put this together and compute our difference of means, which we will save as a variable called dom

In [None]:
dom = np.mean(y[d==1]) - np.mean(y[d==0])
dom

What does this mean in words? The elections with monitors were almost 3 points more fraudulent than those with no monitors! Maybe the monitors should stayed home?

But wait, as we learned in the lecture, this might not really capture the causal effect (which we assumed was -1). 

In particular, we can use our selection bias formula from the slides to calculate how wrong our difference of means is.

In [None]:
sb = np.mean(y0[d==1]) - np.mean(y0[d==0])
sb

However, notice that this requires knowing how fraudultent the monitored elections would have been without monitoring: it's an unobserved counterfactual!

Still, in this hypothetical mode, we can check that the real causal effect plus our selection bias is equal to the difference of means:

In [None]:
print(k + sb,dom)

What if we flipped who the monitors went to go check. We can do this by defining a new alternative treatment variable d2 which is equal to 1 when d is 0 and equal to 0 when d is equal to 1

In [None]:
d2 = 1-d
y2 = y1*d2 + y0*(1-d2)

In [None]:
dom2 = np.mean(y2[d2==1]) - np.mean(y2[d2==0])
dom2

Now the difference of means is very negative! We can also compute the selection bias with this new monitoring regime

In [None]:
sb2 = np.mean(y0[d2==1]) - np.mean(y0[d2==0])
sb2

Notice this is the exact opposite as the selection bias with our initial treament. Think through why!