<img src="https://user-images.strikinglycdn.com/res/hrscywv4p/image/upload/c_limit,fl_lossy,h_300,w_300,f_auto,q_auto/1266110/Logo_wzxi0f.png" style="float: left; margin: 20px; height: 55px">

# Day 31 - Exercises Solutions 

## 1)

Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

In [34]:
import first
live, firsts, others = first.MakeFrames()
live = live[live.prglngth>30]

The following are the only variables I found that have a statistically significant effect on pregnancy length.

In [35]:
import statsmodels.formula.api as smf
model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=live)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,prglngth,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,34.28
Date:,"Tue, 03 Jan 2017",Prob (F-statistic):,5.090000000000001e-22
Time:,13:37:55,Log-Likelihood:,-18247.0
No. Observations:,8884,AIC:,36500.0
Df Residuals:,8880,BIC:,36530.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,38.7617,0.039,1006.410,0.000,38.686 38.837
birthord == 1[T.True],0.1015,0.040,2.528,0.011,0.023 0.180
race == 2[T.True],0.1390,0.042,3.311,0.001,0.057 0.221
nbrnaliv > 1[T.True],-1.4944,0.164,-9.086,0.000,-1.817 -1.172

0,1,2,3
Omnibus:,1587.47,Durbin-Watson:,1.619
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6160.751
Skew:,-0.852,Prob(JB):,0.0
Kurtosis:,6.707,Cond. No.,10.9


## 2)

The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis

Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.

As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?

In [36]:
import regression
join = regression.JoinFemResp(live)

In [37]:
# Solution

def GoMining(df):
    """Searches for variables that predict birth weight.

    df: DataFrame of pregnancy records

    returns: list of (rsquared, variable name) pairs
    """
    df['boy'] = (df.babysex==1).astype(int)
    variables = []
    for name in df.columns:
        try:
            if df[name].var() < 1e-7:
                continue

            formula='boy ~ agepreg + ' + name
            model = smf.logit(formula, data=df)
            nobs = len(model.endog)
            if nobs < len(df)/2:
                continue

            results = model.fit()
        except:
            continue

        variables.append((results.prsquared, name))

    return variables

variables = GoMining(join)

Optimization terminated successfully.
         Current function value: 0.686305
         Iterations 4


In [38]:
# Solution

#Here are the 30 variables that yield the highest pseudo-R^2 values.

regression.MiningReport(variables)

totalwgt_lb 0.00969685519034


In [39]:
# Solution

# Eliminating variables that are not known during pregnancy and 
# others that are fishy for various reasons, here's the best model I could find:

formula='boy ~ agepreg + fmarout5==5 + infever==1'
model = smf.logit(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 0.691874
         Iterations 4


0,1,2,3
Dep. Variable:,boy,No. Observations:,8884.0
Model:,Logit,Df Residuals:,8880.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 03 Jan 2017",Pseudo R-squ.:,0.001653
Time:,13:38:09,Log-Likelihood:,-6146.6
converged:,True,LL-Null:,-6156.8
,,LLR p-value:,0.0001432

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-0.1805,0.118,-1.534,0.125,-0.411 0.050
fmarout5 == 5[T.True],0.1582,0.049,3.217,0.001,0.062 0.255
infever == 1[T.True],0.2194,0.065,3.374,0.001,0.092 0.347
agepreg,0.0050,0.004,1.172,0.241,-0.003 0.013


## 3)

If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called `poisson`. It works the same way as `ols` and `logit`. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called `numbabes`.

Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?

In [40]:
# Solution

# I used a nonlinear model of age.  The age3 term is probably overkill, 
# but I included it to show that it might not be a crazy choice.
# It doesn't have much effect on the predictions.

join.numbabes.replace([97], np.nan, inplace=True)
join['age2'] = join.age_r**2
join['age3'] = join.age_r**3

In [41]:
# Solution

formula='numbabes ~ age_r + age2 + age3 + C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()
results.summary() 

Optimization terminated successfully.
         Current function value: 1.676477
         Iterations 7


0,1,2,3
Dep. Variable:,numbabes,No. Observations:,8884.0
Model:,Poisson,Df Residuals:,8876.0
Method:,MLE,Df Model:,7.0
Date:,"Tue, 03 Jan 2017",Pseudo R-squ.:,0.03716
Time:,13:38:09,Log-Likelihood:,-14894.0
converged:,True,LL-Null:,-15469.0
,,LLR p-value:,5.0860000000000004e-244

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-3.1875,0.733,-4.351,0.000,-4.623 -1.752
C(race)[T.2],-0.1381,0.015,-9.339,0.000,-0.167 -0.109
C(race)[T.3],-0.0970,0.025,-3.943,0.000,-0.145 -0.049
age_r,0.3659,0.070,5.215,0.000,0.228 0.503
age2,-0.0087,0.002,-3.952,0.000,-0.013 -0.004
age3,6.8e-05,2.24e-05,3.039,0.002,2.41e-05 0.000
totincr,-0.0186,0.002,-9.778,0.000,-0.022 -0.015
educat,-0.0471,0.003,-16.074,0.000,-0.053 -0.041


Now we can predict the number of children for a woman who is 35 years old, black, and a college
graduate whose annual household income exceeds $75,000

In [42]:
# Solution

import pandas
columns = ['age_r', 'age2', 'age3', 'race', 'totincr', 'educat']
new = pandas.DataFrame([[35, 35**2, 35**3, 1, 14, 16]], columns=columns)
results.predict(new)

array([ 2.46483498])

## 4)

If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called `mnlogit`. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called `rmarital`.

Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

In [43]:
# Solution

# Here's the best model I could find.

formula='rmarital ~ age_r + age2 + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary() 

  endog_dummies = get_dummies(endog.icol(0))


Optimization terminated successfully.
         Current function value: 1.084053
         Iterations 8


0,1,2,3
Dep. Variable:,rmarital,No. Observations:,8884.0
Model:,MNLogit,Df Residuals:,8849.0
Method:,MLE,Df Model:,30.0
Date:,"Tue, 03 Jan 2017",Pseudo R-squ.:,0.1682
Time:,13:38:10,Log-Likelihood:,-9630.7
converged:,True,LL-Null:,-11579.0
,,LLR p-value:,0.0

rmarital=2,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,9.0156,0.805,11.199,0.000,7.438 10.593
C(race)[T.2],-0.9237,0.089,-10.418,0.000,-1.097 -0.750
C(race)[T.3],-0.6179,0.136,-4.536,0.000,-0.885 -0.351
age_r,-0.3635,0.051,-7.150,0.000,-0.463 -0.264
age2,0.0048,0.001,6.103,0.000,0.003 0.006
totincr,-0.1310,0.012,-11.337,0.000,-0.154 -0.108
educat,-0.1953,0.019,-10.424,0.000,-0.232 -0.159
rmarital=3,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,2.9570,3.020,0.979,0.328,-2.963 8.877
C(race)[T.2],-0.4411,0.237,-1.863,0.062,-0.905 0.023


Make a prediction for a woman who is 25 years old, white, and a high
school graduate whose annual household income is about $45,000.

In [44]:
# Solution

# This person has a 75% chance of being currently married, 
# a 13% chance of being "not married but living with opposite 
# sex partner", etc.

columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pandas.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)
results.predict(new)

array([[ 0.75002831,  0.12639705,  0.00156367,  0.03340309,  0.02148542,
         0.06712247]])