# APMTH 207: Advanced Scientific Computing: 
## Stochastic Methods for Data Analysis, Inference and Optimization
## Homework #10
**Harvard University**<br>
**Spring 2017**<br>
**Instructors: Rahul Dave**<br>
**Due Date: ** Friday, April 14th, 2017 at 11:59pm

**Instructions:**

- Upload your final answers as well as your iPython notebook containing all work to Canvas.

- Structure your notebook and your work to maximize readability.

During the course of the semester you have worked with a considerable number of sophisticated models. In previous homeworks, you have focused on developing algorithms for performing inference on or prediction using these models. In this assignment, much like in Long Homework #3, you will explore how to recast data, tasks and research questions from a variety of different contexts so that an existing model can be applied for analysis.

Example 10.1.3 in "Statistical Rethinking", the excerpt of which is included in the data directory for this assignment, illustrates a study of the effect of an applicant's gender on graduate school admissions to six U.C. Berkeley departments through a comparison of four models. 

## Problem #1: A Study of Census Data

In this problem, you are given the data for the 1994 U.S. Census. The data has been processed so that only a subset of the features are present (for full dataset as well as the description see the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Census+Income)). 

You will be investigate the effect of gender on a person's yearly income in the dataset. In particular, we want to know how a person's gender effect the likelihood of their yearly salary being above or below \$50k. 

First we want to aggregate the dataset by seven different categories. The categories we wish to consider are: 
- 4 year college degree
- Some-college or two year academic college degree
- High school
- Professional, vocational school
- Masters 
- Doctorate
- Some or no high school
Note that you might have to combine some of the existing education categories in your dataframe. For each category, we suggest that you only keep track of a count of the number of males and females who make above (and resp. below) $50k (see the dataset in Example 10.1.3). 

Following Example 10.1.3, build two models for the classification of an individual's yearly income (1 being above \$50k and 0 being below), one of these models should include the effect of gender while the other should not. 

Replicate the analysis in 10.1.3 using your models; specifically, make a plot like Figure 10.5 (posterior check) to see how well your models fits the data. Using your analysis, discuss the effect of gender on income.

Following Example 10.1.3, build two models for the classification of an individual's yearly income taking into account education. 

Replicate the analysis in 10.1.3 using your models; specifically, make a plot like Figure 10.6 (posterior check) to see how well your model fits the data. Using your analysis, discuss the effect of gender on income, taking into account an individual's education.

In [19]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
%matplotlib inline


df = pd.read_csv('data/census_data.csv')
df.head()


Unnamed: 0.1,Unnamed: 0,age,sector,edu,marital,race,sex,earning
0,0,39,State-gov,Bachelors,Never-married,White,Male,<=50K
1,1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,White,Male,<=50K
2,2,38,Private,HS-grad,Divorced,White,Male,<=50K
3,3,53,Private,11th,Married-civ-spouse,Black,Male,<=50K
4,4,28,Private,Bachelors,Married-civ-spouse,Black,Female,<=50K


In [20]:
df['edu'].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [37]:
cat_1 = ['Bachelors']
cat_2 = ['Some-college', 'Assoc-acdm']
cat_3 = ['Assoc-voc', 'Prof-school']
cat_4 = ['HS-grad']
cat_5 = ['11th', '7th-8th', '9th', '5th-6th', '10th', '1st-4th', 'Preschool', '12th']
cat_6 = ['Masters']
cat_7 = ['Doctorate']

categories = [cat_1, cat_2, cat_3, cat_4, cat_5, cat_6, cat_7]
dataset = []
i = 1
for cat in categories:
    filtered = df[df['edu'].isin(cat)]
    dataset.append((i, 0, len(filtered[filtered['sex'] == 'Female']), len(filtered[(filtered['sex'] == 'Female') & (filtered['earning'] == '<=50K')]), len(filtered[(filtered['sex'] == 'Female') & (filtered['earning'] == '>50K')])))
    dataset.append((i, 1, len(filtered[filtered['sex'] == 'Male']), len(filtered[(filtered['sex'] == 'Male') & (filtered['earning'] == '<=50K')]), len(filtered[(filtered['sex'] == 'Male') & (filtered['earning'] == '>50K')])))
    i += 1
dataset = np.array(dataset)
dataset = pd.DataFrame(dataset, columns=['category', 'sex', 'total', '<=50', '>50'])
dataset

Unnamed: 0,category,sex,total,<=50,>50
0,1,0,1619,1280,339
1,1,1,3736,1854,1882
2,2,0,3227,2974,253
3,2,1,5131,3732,1399
4,3,0,592,483,109
5,3,1,1366,691,675
6,4,0,3390,3164,226
7,4,1,7111,5662,1449
8,5,0,1321,1298,23
9,5,1,2932,2711,221


In [38]:
# WITH GENDER

# define invlogit function
def invlogit(x):
    return pymc.exp(x) / (1 + pymc.exp(x))

# observed data
n = dataset['total'].values
x = dataset['sex'].values
y_obs = dataset['>50'].values

# define priors
alpha = pymc.Normal('alpha', mu=0, tau=10)
beta = pymc.Normal('beta', mu=0, tau=10)

# define likelihood
p = pymc.InvLogit('p', alpha + beta*x)
y = pymc.Binomial('y_obs', n=n, p=p, value=y_obs, observed=True)

# inference
m = pymc.Model([alpha, beta, y])


In [39]:
mc = pymc.MCMC(m)
mc.sample(iter=11000, burn=10000)

 [-----------------100%-----------------] 11000 of 11000 complete in 1.1 sec

In [40]:
beta.stats()

{'95% HPD interval': array([ 1.18557867,  1.30994645]),
 'mc error': 0.0031599400426224173,
 'mean': 1.240725960299812,
 'n': 1000,
 'quantiles': {2.5: 1.1838268410138473,
  25: 1.2161407605377883,
  50: 1.2398927570312375,
  75: 1.2655260303409308,
  97.5: 1.3099464537622341},
 'standard deviation': 0.033682142798189769}

In [42]:
print 'average male advantage', np.exp(beta.stats()['mean'])

3.4581230137558441

In [58]:
logistic = lambda x: 1. / (1 + np.exp(-x))
males = logistic(alpha.trace() + beta.trace())
females = logistic(alpha.trace())
diff = males - females
print 'male advantage percentiles', np.percentile(diff, 2.5), np.percentile(diff, 50), np.percentile(diff, 97.5)

male advantage percentiles 0.184690137403 0.192069313762 0.200015899732


In [59]:
MAP = pymc.MAP(m)
MAP.fit()
print 'model aic, bic', MAP.AIC, MAP.BIC

model aic, bic 3969.60420286 3970.88231752


In [56]:
# NO GENDER

# define priors
alpha2 = pymc.Normal('alpha', mu=0, tau=10)

# define likelihood
p2 = pymc.InvLogit('p', alpha)
y2 = pymc.Binomial('y_obs', n=n, p=p2, value=y_obs, observed=True)

# inference
m2 = pymc.Model([alpha2, y2])

In [60]:
MAP = pymc.MAP(m2)
MAP.fit()
print 'model aic, bic', MAP.AIC, MAP.BIC

model aic, bic 9898.15871995 9898.79777728


In [72]:
# WITH GENDER


# observed data
cat = dataset['category'].values


# define priors
alphas = pymc.Normal('alpha', mu=0, tau=10, size=7)
beta3 = pymc.Normal('beta', mu=0, tau=10)

# define likelihood
def match_dept(alphas, cat):
    return alphas[cat - 1]
p3 = pymc.InvLogit('p', match_dept(alphas, cat) + beta3*x)
y3 = pymc.Binomial('y_obs', n=n, p=p3, value=y_obs, observed=True)

# inference
m3 = pymc.Model([alphas, beta3, y3])



In [73]:
mc = pymc.MCMC(m3)
mc.sample(iter=11000, burn=10000)


 [-----------------100%-----------------] 11000 of 11000 complete in 1.3 sec

In [75]:
beta3.stats()

{'95% HPD interval': array([ 1.16397331,  1.25959332]),
 'mc error': 0.0021678823937301707,
 'mean': 1.2190834180277406,
 'n': 1000,
 'quantiles': {2.5: 1.1680985637425561,
  25: 1.2055042618498635,
  50: 1.2181412310010695,
  75: 1.2344974204240882,
  97.5: 1.2775460895965354},
 'standard deviation': 0.024358136464205225}

In [77]:
print 'average male advantage', np.exp(beta3.stats()['mean'])

average male advantage 3.3840845208


In [76]:
alphas.stats()

{'95% HPD interval': array([[-1.27467757, -2.31399983, -1.25496924, -2.6248527 , -3.75351406,
         -0.69661536, -0.06452168],
        [-1.15172624, -2.16807808, -1.21318883, -2.50752291, -3.50911107,
         -0.50188815,  0.31635917]]),
 'mc error': array([ 0.00294713,  0.00339148,  0.00114192,  0.00259892,  0.00535699,
         0.00436833,  0.0093698 ]),
 'mean': array([-1.22165214, -2.23763333, -1.23443438, -2.56935554, -3.61688324,
        -0.59980522,  0.11894709]),
 'n': 1000,
 'quantiles': {2.5: array([-1.27467757, -2.31102181, -1.25490578, -2.62952736, -3.75268712,
         -0.69661536, -0.06395842]),
  25: array([-1.24598484, -2.26255586, -1.24325831, -2.59030117, -3.66423978,
         -0.63363324,  0.05108358]),
  50: array([-1.22648815, -2.24245793, -1.23433091, -2.56831644, -3.61252993,
         -0.59980435,  0.10887859]),
  75: array([-1.19743301, -2.21401341, -1.22704143, -2.54802457, -3.57167357,
         -0.57377007,  0.17433633]),
  97.5: array([-1.15172624, -2.154

In [78]:
MAP = pymc.MAP(m3)
MAP.fit()
print 'model aic, bic', MAP.AIC, MAP.BIC

model aic, bic 144.093109267 149.205567903


In [79]:
# NO GENDER

# define priors
alphas2 = pymc.Normal('alpha', mu=0, tau=10, size=7)

# define likelihood
def match_dept(alphas, cat):
    return alphas[cat - 1]
p4 = pymc.InvLogit('p', match_dept(alphas2, cat))
y4 = pymc.Binomial('y_obs', n=n, p=p4, value=y_obs, observed=True)

# inference
m3 = pymc.Model([alphas2, y4])

In [80]:
MAP = pymc.MAP(m3)
MAP.fit()
print 'model aic, bic', MAP.AIC, MAP.BIC

model aic, bic 1850.85306305 1855.32646436
