# Lab 5a, 2/21/19
# Correlation and Causation
## the Good, the Bad, and the Ugly

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

## Poverty and its casuses

### What causes poverty to change over time? 
### How does it change over time? 
### How do we measure it in different places and times?


The debate concerned whether differing amounts of "out-relief" in causing greater pauperism. What was "out-relief"? The [Poor Laws of 1834](http://www.workhouses.org.uk/poorlaws/newpoorlaw.shtml) were to discourage the population from "wanting" to be poor, by forcing anyone deemed capable of working in workhouses with deliberately harsh conditions.  "Out-relief," the granting of funds for survival, was to be forbidden to all able-bodied adults and the families, in favor of "In-relief" for those working in such workhouses. 

But how did these forms of relief affect poverty? Did the tough-love approach work to curtail it?

### In the late 19th century, statisticians in England jumped into the breach to use data to answer these questions.

In the words of the historians Desrosières, "a political problem" was translated "into an instrument of measurement that allowed arbitration of a controversy."(Des 139) 

Figures such as Charles Booth saw themselves as advocating a scientific approach to major questions of policy, an approach untainted by traditional political divisions and moral views. 

In 1894, Charles Booth released *The Aged Poor in England and Wales*, chock full of data and tables. Based on the data, he ended the book with a series of politically significant claims. One was a highly revisionist take on the belief that being too generous with "out-relief" went along with higher poverty: 

>The proportion of relief given out of doors bears no general relation to the total percentage of pauperism (Booth 423).

Booth's procedures were, well, a bit dicey.

The mathematician Udny Yule undertook to re-investigate the issues to look at the "various causes that one may conceive to effect changes in the rate of pauperism."

Accordint to Yule, possible causes included: 

1. Changes in the method, or strictness, of administration of the law.
2. Changes in economic conditions, e.g., fluctuations in trade, wages, prices, and employmelit.
3. Changes of a general social character, e.g., in density of populationi, overcrowding, or in the character of industry in a given district.
4. Changes more of a moral character, illustrated, for example, by the statistics of crime, illegitimacy, education, or possibly death-rates from certain causes.
5. Changes in the age distribution of the population.

#### The first category is of particular interest, Yule said, because, then, change "may be comparatively rapidly effected by the direct action of the responsible authorities."

### In other words, moving from *description* to *prediction* may allow policy *proscription* if the right kind of cause is found. 


Yule's original paper: Yule, G. Udny. 1899. ["An Investigation into the Causes of Changes in Pauperism in England, Chiefly during the Last Two Intercensal Decades."](http://www.jstor.org.ezproxy.cul.columbia.edu/stable/2979889) *Journal of the Royal Statistical Society* 62 (Part II):249-295. 

He doesn't give all his data, but gives one important example. 

![example](https://i.imgur.com/7DTIfZK.png). 



In [None]:
pauper_data=pd.read_csv("https://raw.githubusercontent.com/data-ppf/data-ppf.github.io/master/labs/dat/Yule_tableXIX.csv", index_col=0)

In [None]:
pauper_data.head()

Let's make a scatter matrix, something not readily available to Mr. Yule

In [None]:
%matplotlib notebook
from pandas.plotting import scatter_matrix
scatter_matrix(pauper_data)

## Least Squares and Regression

Thanks to high level packages, it's very, very easy to do simple linear regression using least squares.  

Linear Regression IN GENERAL:

$y = \beta_n x_n + ... + \beta_1 x_1 + \mu_0$

Linear Regression FOR JUST ONE VARIABLE:

$y = \beta_1 x_1 + \mu$ 

where $\beta_1$ is the slope, $x_1$ to $x_n$ are the observations, and $\mu$ is the y-intercept.  

So, friends, let's regress for a minute. Let's try connecting just two of the variables in the scatter plot above.
What do you choose?


So we'll follow the same process we followed before:
1. choose our type of model 
2. identify what we want to do the predicting (our $x$s) and what we want to predict (our $y$)
3. prepare the date for this software implementation
4. run the model
5. inspect the results of the model
6. overinterpret and claim discovery of truth of universe

In [None]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()

To use this, we must identify the value to be predicted (called y) and the data to be used to do the predicting (called X). 

In [None]:
X_out=pauper_data["out"] # this will be subsequently known as our training data or training set

In [None]:
X_out[1:10]

In [None]:
X=X_out.values.reshape(-1,1)

In [None]:
y=pauper_data["paup"] 

Given these choices, we ask `sklearn` to use the `LinearRegression()` to fit the data to produce the data. Poor Mr. Yule would have had someone to do all this by hand.

In [None]:
regression_model.fit(X, y)

Now, let's plot. We'll put a scatterplot of the data first and then add the regression line we just computed.

In [None]:
%matplotlib notebook
plt.scatter(X, y,  color='black')
plt.plot(X, regression_model.predict(X), color='blue', linewidth=1)
plt.show()

In [None]:
regression_model.predict(X)

In [None]:
regression_model.score(X,y)

.35 is, let's say, not a strong vote in favor in our model. This is $R^2$.

### Your turn: regress a bit

Do the same process but some of the other variables in question


## From Correlation to Cause

Let's get back to Yule and his account of pauperism. 

Yule took on Booth as a bad statistician: “it is extremely regrettable that a statist of Mr. Booth’s standing should have given so many examples of the fundamental mistake of founding general conclusions on particular instances.” (Yule 1895c, 606, quoted in Mills)

Looking at the data Yule discerned many correlations--and counseled against imputing causality to them.
>“the rate of total pauperism is positively correlated with the proportion of out-relief given, i.e. high average values of the former correspond to high average values of the latter. The method used seems to leave no room for doubt.” (Yule, 1895c, page 605, from Mills, 43)

He descerned that out-relief and pauperism were, in fact, strongly correlated. 

But he insisted on care in thinking about cause:
>“(t)his statement does not say either that the low mean proportion of out-relief is the cause of the lesser mean pauperism or vice versa. Such terms seem best avoided where one is not dealing with a catena of causation at all. ... To be quite clear, I do not mean simply that out-relief determines pauperism in one union, and pauperism out-relief in another, so that you cannot say which is which on the average: but I mean that out-relief and pauperism mutually react in one and the same union. 

He went further to note that "detailed knowledge" might give some causal understanding:

>Detailed knowledge may occasionally enable one to say ‘The pauperism is low here, since the proportion of out-relief is very small,’ or perhaps, ‘The proportion of out-relief given is large on account of the high pauperism and other industrial conditions of the union’: but such cases will be exceptional and will as a rule only refer to large deviations from the mean.” (ibid., page 605, footnote 2, quoted in Mills 46)

Yule's concerned echoed the philosophy and practice of Karl Pearson, who believed causal knowledge *impossible*

But the desire, the temptation for causality led him to push harder, and to create some new math and new technologies. 

### What to do about the proliferation of causes?


Yule was deeply concerned that something else might be the underlying cause of the changes in outrelief. 

![Yule_multiple](https://i.imgur.com/go8cXCd.png)


We can undertake something like his analysis with ease, using almost identical syntax.


First, let's make `X` all the columns EXCEPT the `paup` data. `X` can be one column or multiple ones.

In [None]:
X=pauper_data.drop("paup", axis=1)   

# axis=1 means a column in this context. Almost no one ever remembers which is column and which rows, so just check yer data.

And now, for `y` just that first column.

In [None]:
y=pauper_data["paup"]

And fit our model, using the same syntax as before. 

In [None]:
regression_model.fit(X, y)

This is finding all the $\beta_i$ in the general linear regression

$y = \beta_n x_n + ... + \beta_1 x_1 + \mu_0$


Within `sklearn`, we can find those $\beta_i$ coefficients using the `.coef_` method.



In [None]:
regression_model.coef_

In [None]:
regression_model.coef_[0]

In [None]:
regression_model.score(X,y)

.70? 'Zwounds, we've discovered rock-bottom truth!


Note for the reader: Yule gives slightly different values (.755, -.022 and -.322) than rederivations.  

Here's how Yule presents his coefficients:

![Yule coefficients](https://i.imgur.com/3IiDti3.png)


Yule concludes his paper, with a series of claims quoted by Desrosieres:

![Yule conclusions](https://i.imgur.com/3TwxM3G.png) 



What sort of problems might there be with this analysis?


## Causal problems

Our Freedman reading challenges Yule's conclusions.
>At best, Yule has established association [i.e. a correlation]. Conditional on the covariates, there is a positive association between  $\Delta$Paup and $\Delta$Out. Is this association causal? If so, which way do the causal arrows point? 


Freedman gave some examples of these sorts of concerns:

>For instance, a parish may choose not to build poor-houses in response to a short-term increase in the number of paupers. Then pauperism is the cause and outrelief the effect.

in that case, still the same correlation, but $\Delta$Paup --> $\Delta$Out

>Likewise, the number of paupers in one area may well be affected by relief policy in neighboring areas. 

in other words, still the same correlation, but $\Delta$other-relief-policy --> $\Delta$Paup 

>Such issues are not resolved by the data analysis. Instead, answers are assumed *a priori*. Although he was busily parceling out changes in pauperism – so much is due to changes in out-relief ratios, so much to changes in other variables, so much to random effects – Yule was aware of the difficulties. With one deft footnote (number 25), he withdrew all causal claims: ‘Strictly speaking, for “due to” read “associated with”.’

![Yule weasel](https://i.imgur.com/r7ZPsKH.png)

The causal situation is deeply problematic, as Freedman explains:
>To make causal inferences, it must be assumed that equations are stable under proposed interventions. Verifying such assumptions – without making the interventions – is problematic. On the other hand, if the coefficients and error terms change when variables are manipulated, the equation has only a limited utility for predicting the results of interventions.



## Category problems

The debate is fundamentally about *poverty*. Like prosperity, poverty can't be measured directly.

Need a *proxy*. For this English debate, the key proxy is *pauperism*. 


### *Pauperism* is an *administrative* category
An administrative caterogy a way of classifying people with set definitions that state bureaucracies administer at great scale.

Administrative categories are necessary conventions, not truths of nature. 

Desrosières explains that an object like pauperism "exists by virtue of its social codification, through the reification of the results of an administrative process with fluctuating modalities." (140)

As with Spearman and intelligence, *reification* involves the thought crime of claiming existence for that which is a convention useful to us.

"It is this slippage from the process to the thing," Desrosières writes, "that made Yule’s conclusion so ticklish to interpret." 


#### Not just us who noticed

Yule and Pearson fight over categories in statistical analysis for some time. 

>It’s the old controversy of nominalism against realism. Mr Yule juggles with the names of categories as if they represented real entities, and his statistics are merely a form of symbolic logic. No practical knowledge ever resulted from these logical theories. They may hold some pedagogical interest as exercises for students of logic, but modern statistics will suffer great harm if Mr Yule’s methods become widespread, consisting as they do of treating as identical all the individuals ranged under the same class index. (in Derosières, 144)



## Current example: poverty measures in USA

### Current definition

>Following the Office of Management and Budget’s (OMB’s) Directive 14, the Census Bureau uses a set of money income thresholds that vary by family size and composition to detect who is poor. If a family’s total income is less than that family’s threshold, then that family, and every individual in it, is considered poor. The poverty thresholds do not vary geographically, but they are updated annually for inflation with the Consumer Price Index (CPI-U). The official poverty definition counts money income before taxes and excludes capital gains and noncash benefits (such as public housing, medicaid, and food stamps). [link](https://www.census.gov/programs-surveys/cps/technical-documentation/subject-definitions.html#povertydefinition)


![poverty](https://www.census.gov/content/census/en/library/visualizations/2014/demo/poverty_measure-history/jcr:content/map.detailitem.950.high.png/1449892859257.png)

## Exercise: 

### think of a "administrative category" and take 3 minutes to look into why it's contested
#### post to slack

## Back to Galton

Classifications matter. You may recall Galton's graph connecting genetic worth with social class.

![galton](https://www.sjdr.se/articles/10.1080/15017410600608491/sjdr_a_160832_o_f0002g.gif)

Galton's took his demographic classifications from Yule's rival Booth. 

Booth was a reformer, in that he sought to move away from the idea that the, barring a few hardened criminals, poor were *morally* responsible for being poor: "the poverty of the poor is mainly the result of the competition of the very poor."

The solution? "The entire removal of this very poor class out of the daily struggle for existence I believe to be the only solution to the problem." He wasn't advocating violence, he claimed, rather: 

>My suggestion is that these people should be given the opportunity to live as families in industrial groups, planted wherever land and building materials were cheap; being well-housed, well-fed and well-warmed; taught, trained and employed from morning to night on work, indoors and out, for themselves or on government account.


Galton, of course, thought there was something inherent, and more than that, hereditary about the abilities of the classes that explained their places in society. Other measures would be needed.