In [None]:
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

import pandas as pd

## Just for reference, since datasience is evolving rapidly
import datascience as ds
ds.__version__

# Inequality and Poverty: How Confident are We about the Trends?

We will first look at the famous findings by Thomas Piketty and Emanuel Saez, and then proceed to a discussion of confidence intervals. We will look at how confidence intervals have been calculated in more traditional classes, and then you will use what you have learned in DS8 to calculate confidence intervals on an example of historical poverty rates among the elderly in the US.

/I/ Inequality in historical perspective 

/II/ Confidence intervals – the traditional approach

/III/ Confidence intervals – the newer approach



## /I/ Inequality

We will use the Piketty Saez data-set. 

http://eml.berkeley.edu/~saez/ 

Note that this is real ‘teaching moment’: they got the data from archives to answer an empirical question. The Internal Revenue Service keeps the records, and it took until the late 1990s for someone to systematically analyze tax returns of top income earners. Note also, the analysis relies heavily on what we termed descriptive stats and on – visualization. 

Let’s make what has become known as the Golden Gate graph:

In [None]:
ustopincome= Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/ustop10y.csv")
ustopincome

One question that suggests itself is, what happened to poverty rates during this period. We will turn to the poverty rates in the elderly population in the US, and cover during our discussion the topic more broadly.

## /II/ Confidence Intervals

BIG CAVEAT: ASSUME THAT THE DATA ARE A RAMDOM DRAW, which they are not, and we will see this next week in class with permutation tests.

Below is an outline of what we will cover in discussion:
The CI is a range (or interval) of values that is likely to contain the pop parameter, and the CI is associated with a degree of confidence (denoted by the Greek letter alpha)

The procedure traditionally contains the following several steps:
-Calculate the mean and the standard deviation of the sample
-Determine the confidence level of interest, usually a 90%, 95%, and 99%, and the critical value associated with that level, 1.645, 1.96, 2.575, respectively
-Calculate the margin of error


In [None]:
pov= Table.read_table("https://github.com/data-8/history-connector/raw/gh-pages/us65pov.csv")
pov

In [None]:
#Let's see the trends overall:

(
pov.
select(["year", "w65pc" , "b65pc", "h65pc", "65pc", "a65pc"]).
plot("year", ["w65pc" , "b65pc", "h65pc", "65pc", "a65pc"])
     )

In [None]:
#Comparing the African American and white "sample"
diffpov = pov.select(["w65pc", "b65pc"])
diffpov

In [None]:
#A visualization of the "sample" -- note again, this is Census data

viz_uspov = diffpov.select(["w65pc"])
viz_uspov.hist(bins=np.arange(0, 60, 1))

In [None]:
viz_bpov = diffpov.select(["b65pc"])
viz_bpov.hist(bins=np.arange(0, 60, 1))

In [None]:
#The means
bmean = np.mean(diffpov["b65pc"])
bmean

In [None]:
wmean = np.mean(diffpov["w65pc"])
wmean

In [None]:
# The STDs for "samples"

stdb = diffpov ["b65pc"].std()
stdb

In [None]:
stdw = diffpov ["w65pc"].std()
stdw

In [None]:
# The CIs for the two "samples", and we cover the steps in discussion
#Note, as in class, 95% is taken

conf95b = (stdb/len(diffpov.rows)**0.5)*1.96
conf95b

In [None]:
# How would you change this to a 99% or a 90%

In [None]:
conf95w = (stdw/len(diffpov.rows)**0.5)*1.96
conf95w

In [None]:
#If CIs do not overlap, then we can reject the hypothesis of equal poverty rates

wmean + conf95w

In [None]:
bmean - conf95b

In [None]:
#What kind of conclusion can you draw? 

## /III/ Your Turn with CIs 

Here's the code from class:

def average_ci(sample, label):
    deviations = Table(['Resample #', 'Deviation'])
    n = sample.num_rows
    average = np.average(sample.column(label))
    for i in np.arange(1000):
        resample = sample.sample(n, with_replacement=True)
        dev = np.average(resample.column(label)) - average
        deviations.append([i, dev])
    return (average - percentile(97.5, deviations.column(1)),
            average - percentile(2.5, deviations.column(1)))

average_ci(sample, 'Duration')

Could you adapt it to construct a CI based on our "sample"?