In [None]:
#Hidden
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')

import pandas as pd

## Just for reference, since datasience is evolving rapidly
import datascience as ds
ds.__version__

# From Hypothesis Testing to Causality

How do we know that a change in one variable actually causes a change in another? Natural experiments, like those that we saw at the start of class with the case of John Snow and cholera, are viewed as among the most convincing ways to argue for a causal relationships. We will look today at what is called a quasi-natural experiment, namely what happens to commercial activity when some banks are bailed out while others are not. We start with a recap of hypothesis testing.

/I/ AB Testing

/II/ Interpretation and Linear Regression

## /I/ AB Testing, recap
We will return to the slavery data from the start of the semester, and how much we have learned. The codebook is at: http://www.icpsr.umich.edu/icpsrweb/RCMD/studies/7423 , and you also have a copy in Bcourses. 


In [None]:
data = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data1.csv")

Recall that some records contain multiple entries and, to simplify, we examined only a subgroup, where prices are known and refer to individual records: 

In [None]:
data = data.where(data["V14"] != 99999)
data = data.where(data["V12"] == 1)
data = data.where(data["V40"]!=99)
data

Also, we will restrict data to a single year so that we can easily check the real value (or labor value, labor value, or income value; here is the web-site where we can check these figures:  https://www.measuringworth.com/uscompare/) 

In [None]:
#keep#
data1850 = data.where(data["V4"] == 1850)
data1851 = data.where(data["V4"] == 1851)
data1852 = data.where(data["V4"] == 1852)
#data1850.show()#

Prices of cotton tended to rise after the 1840s, so we can look at 1850. You can look at some other year, and recall that prices were very high after the War of 1812 and in the 1830s due to a boom in land prices (“internal improvements”); in contrast, the “Panic of 1837” had the opposite effect on prices.
Keep in mind the following values:

“If you want to compare the value of a $1.00 Commodity in 1850 there are three choices. 
In 2014 the relative:
real price of that commodity is $31.30
labor value of that commodity is $223.00 (using the unskilled wage) or $482.00 (using production worker compensation)
income value of that commodity is $490.00”


## Your turn

Now Calculate the mean price for 1850, and express it in 2014 dollars.

In [None]:
mean...

In [None]:
#sort gender where now male = 0, female = 1 for future AB test
new_gender = []
for i in data["V15"]:
    new_gender.append(i - 1)
data.append_column("V15_new", new_gender)

#sort skill where now 0=Unlisted; 1=Listed and drop all 99 value for future AB test
data = data.where(data["V40"]!=99)
new_skill = []
for a in data["V40"]:
    new_skill.append(a - 1)
data.append_column("V40_new", new_skill)

In [None]:
"""Bootstrap A/B test for the difference in the mean response
Assumes A=0, B=1"""

def bootstrap_AB_test(samp_table, response_label, ab_label, repetitions):
    
    # Sort the sample table according to the A/B column; 
    # then select only the column of effects.
    response = samp_table.sort(ab_label).select(response_label)
    
    # Find the number of entries in Category A.
    n_A = samp_table.where(samp_table[ab_label],0).num_rows
      
    # Calculate the observed value of the test statistic.
    meanA = np.mean(response[response_label][:n_A])
    meanB = np.mean(response[response_label][n_A:])
    obs_diff = meanA - meanB
    
    # Run the bootstrap procedure and get a list of resampled differences in means
    diffs = []
    for i in range(repetitions):
        resample = response.sample(with_replacement=True)
        d = np.mean(resample[response_label][:n_A]) - np.mean(resample[response_label][n_A:])
        diffs.append([d])
    
    # Compute the bootstrap empirical P-value
    diff_array = np.array(diffs)
    p_value = np.count_nonzero(abs(diff_array) >= abs(obs_diff))/repetitions
    
    # Display results
    diffs = Table([diffs],['diff_in_means'])
    diffs.hist(bins=20,normed=True)
    plots.xlabel('Approx null distribution of difference in means')
    plots.title('Bootstrap A-B Test')
    print("Observed difference in means: ", obs_diff)
    print("Bootstrap empirical P-value: ", p_value)

In [None]:
bootstrap_AB_test(data1850, "V15_new", "V14", 1000)

In [None]:
#keep#
data1850 = data.where(data["V4"] == 1850)
data1851 = data.where(data["V4"] == 1851)
data1852 = data.where(data["V4"] == 1852)
#data1850.show()#

In [None]:
#Your turn 
#For 1851, your turn
#bootstrap_AB_test(..)

## /II/ Interpretation and Linear Regression

Let’s look again at the Kent, and unemployment and relief. Recall that just before the midterm, we noted that there seemed to be a positive association between the two – in counties with more unemployment, poor relief payments tended to be higher.


In [None]:
data2 = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data4.csv")
data2

In [None]:
#To jot your memory, see the scatter plot:
data2.scatter ("UNEMP", "RELIEF", s=10)

In [None]:
#Let’s get rid of the outlier, creating a new table called URT, UnemploymentReliefTruncated:
URT = data2.select(["COUNTY", "UNEMP", "RELIEF"]).where(data2.column("UNEMP")<0.40)
URT

Last time, we considered health and wealth, nothing a strong positive relationship between increasing GDP and increasing life expectancy in the US during the twentieth century. Let’s do perform a similar analysis for Relief and Unemployment.

In [None]:
kent=URT.select(["COUNTY", "UNEMP", "RELIEF"]).where(URT.column("COUNTY")==1)
kent

In [None]:
#First, to get some sense of the relationship:
kent.scatter("UNEMP", "RELIEF", fit_line=True)

In [None]:
#Next, we get into Pandas

import statsmodels.api as sm
import pandas as pd 

In [None]:
#You should see this after running the command: pandas.core.frame.DataFrame#
reg = kent.to_df()
type(reg)

In [None]:
#Next, given the approximately linear association, we can proceed with the OLS procedure:

x = reg[['COUNTY','UNEMP']]
y = reg['RELIEF']
multiple_regress = sm.OLS(y, x).fit()
multiple_regress.summary()

How do we interpret the results? Let's recall how we interpreted Table 9 from Richardson and Troost:

/1/ What is the *eXplanatory variable*?

/2/ What is the *sign* (positive or negative)?

/3/ What is the *'significance'*? Recall the "tyranny of two" -- divide the Coeff/SE

Can you write out a sentence, stating the results as an argument?

## Your Turn

What happens if we add anothe variable? You would need to go back the original table, and take in a second eXplanatory variable.

/1/ What is the sign?

/2/ What is the significance?