# Comparing several variables - to each other!

In [None]:
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import numpy as np
from scipy import stats

Does giving aid to the unemployed have the perverse effect of increasing unemployment? This is a longstanding debate, and we will look at some data from an 1831 investigation of relief payments in England. 

One question that does not have an immediately obvious answer is what causes what: an argument could be made that as unemployment increases, so do relief payments, and of course a counterargument would be that the more generous the relief payments, the more people will be opt out of work (or, high benefits “induce” unemployment). We will not settle this today, but we will look an interesting relationships: the relationship between unemployment and relief payments, and the relationship between relief payments and workers in the grain sector.

/1/ Starting with scatter plots

/2/ From standard units to the correlation coefficient

/3/ Questions


## Starting with scatter plots

Scatter plots are an excellent way to visualize these relationships, although we must consider that there are limitations to their usefulness.

We'll begin by looking at UNEMP and RELIEF in our dataset. 
--UNEMP refers to unemployment, represented as the ratio of unemployed people to wage laborers in each parish.
--RELIEF refers to welfare relief payments, represented by expenditure per person, in shillings, in each parish.


In [None]:
data = Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/Data4.csv")
data

In [None]:
plots.scatter(data.column("UNEMP"), data.column("RELIEF")) #you could also use data.scatter("UNEMP", "RELIEF")

What looks unusual? Two things to wonder about are:
--How different do the distributions look? 
--Are there any especially unusual values?

Note also there is an outlier, one parish has 60+% UNEMP -- how different is the analysis with and without the outlier, should we include the parish in our analysis? We will come back to the question of outliers.

In [None]:
#Question: Draw historams of UNEMP and RELIEF and what would you comment on about the distributions?

data.hist...

### Scatter plot of two variables with (more) similar distributions

Let's look at two variables with what appear like similar distributions: RELIEF and GRAIN
--GRAIN is percent adult males employed in grain production in each parish 

In [None]:
data.hist("RELIEF", bins=np.arange(0, 101, 1))
data.hist("GRAIN", bins=np.arange(0, 101, 1))

In [None]:
#Question: how would you make the scatter plot that looks like the ones with hybrid cars that we saw during lecture? 

data.scatter(....)

What is the interpretation? Is it surprising that relief payments appear to be higher in parishes where more of the population works in grain production? Consider the nature of grain production, or what we might call seasonality in agricultural labor.

## From Standard Units to the Correlation Coefficient

To get a sense of the association between RELIEF and GRAIN, we can convert both variables to what we described last time as standard units. 

Question: what are standard units, and why bother calculating them? Hint: look at the previous lab!

Following the three step process we saw in lecture, first we transform the variable to standard units, next we multiply each pair of standard units, and finally we calculate the correlation coefficient.

Let’s first focus on GRAIN and RELIEF, and note that to plot in standard units, we make a table with the variables we're interested in, converted to standard units. Here's a handy function to convert an array of numbers into standard units:

In [None]:
def standard_units(any_numbers):
    "Convert any array of numbers to standard units."
    return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)

In [None]:
partial_data = data.select(['INCOME', 'WEALTH'])
partial_data

In [None]:
#Step 1: Caluculating standard units for GRAIN and RELIEF

partial = data.select(['GRAIN', 'RELIEF'])
partial

su_partial= partial.with_columns([
     'GRAIN (standard units)', standard_units(partial.column('GRAIN')),
     'RELIEF (standard units)', standard_units(partial.column('RELIEF'))   
    ])
su_partial

In [None]:
#Step 2: Calculating the product of pairs of standard units

su_product = su_partial.with_column ('product of standard units', su_partial.column(2)*su_partial.column(3))
su_product

In [None]:
#Step 3: calculating the r

r = np.mean(su_product.column(4))
r

Question: So, is the r small or large, does its sign matter? Are we closer to getting at the question why does it seem that the grain labor in a parish, the more relief?

Let’s look at scatter plots of standard units, and see how they affect our discussion.

In [None]:
Table().with_columns([
    'GRAIN (standard units)',  standard_units(data.column('GRAIN')), 
    'RELIEF (standard units)', standard_units(data.column('RELIEF'))
]).scatter(0,1)
plots.xlim([-4, 4])
plots.ylim([-4, 4])

#Note another approach:

#su_partial.scatter("GRAIN (standard units)", "RELIEF (standard units)", s=20)
#plots.xlim([-5, 5])
#plots.ylim([-5, 5])

In [None]:
#Question: How would you add a trend-line? Would that help?

Now, back to examining UNEMP and RELIEF. You could draw a scatter of UNEMP and RELIEF using standard units, and one thing to keep in mind is – the outlier. Check to make sure it is included on the scatter plot, and you may have to try several times.

In [None]:
# After running the code, you see the outlier? What could you do to check?

Table().with_columns([
    'UNEMP (standard units)',  standard_units(data.column('UNEMP')), 
    'RELIEF (standard units)', standard_units(data.column('RELIEF'))
]).scatter(-1,0, fit_line=True, s=10)
plots.xlim([-4, 4])
plots.ylim([-4, 4])
None

## From the Correlation Coefficient to Questions

/1/ Derive the r for UNEMP and RELIEF. How does it compare with the r we derived for GRAIN and RELIEF, and what does it mean?

/2/ Would the r be useful when considering the relationship between POP, the population of each parish, and WEALTH, the value of real property, such as land and buildings, in the parish per capita?