# NB 2

Given that the linear probability models gave rather weak results, I want to slowly build up to running an instrumental variable (2SLS) regression by first creating some control variables and assessing the assumptions of an IV/2SLS Model.

In [1]:
# Set Up
import pandas as pd
import numpy as np

# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Regression
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from IPython.core.display import HTML
from statsmodels.sandbox.regression.gmm import IV2SLS

# Graphing
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10,10)
import seaborn as sns

# Specialty Imports

For the IV regression I'm going to be using statsmodels implementation of IV/2-Stage Least Squares. Documentation can be found here https://www.statsmodels.org/stable/generated/statsmodels.sandbox.regression.gmm.IV2SLS.html

In [2]:
sample = pd.read_csv('sample.csv')

# Part II: Using instrumental variables

Using the story from the Notion page, we established that a couple's desire to have both a boy and a girl is a strong incentive for them to have more kids. In order for us to use "samesex" as an instrument, we will want to truly consider the claim that gender assignment is completely random because only then will we satisfy the exogeneity assumptions of instrumental varaibles.

Read more here: https://en.wikipedia.org/wiki/Instrumental_variables_estimation


Let's first begin by controlling for the father's influence on having more children by looking at the father's race and education:

$$ y_i = \beta_0 + \beta_1<ageDummies> + \beta_2<educDummies> + \beta_3<agefstmDummies> + \beta_4 aged +\beta_5 educd + \beta_6 blackm + \beta_7 hispm + \beta_8 othracem + \epsilon_i$$





In [3]:
educDummies = pd.get_dummies(sample['educm'], prefix = "ed")
ageDummies = pd.get_dummies(sample['agem'], prefix = "age")
age_fstmDummies = pd.get_dummies(sample['agefstm'], prefix = "age_ftsm")


m3_features = pd.concat([ageDummies, educDummies, age_fstmDummies], axis = 1)

Now we can test the importance of the father's race and their education confounding. Notice that our excluded group here are white, non-hispanic mothers.

In [4]:
new_features = sample[['aged', 'educd', 'blackm', 'hispm', 'othracem']]
noDad = sample[['blackm', 'hispm', 'othracem']]



In [5]:
m4_features = pd.concat([m3_features, new_features], axis = 1)
m4_features_noDad = pd.concat([m3_features, noDad], axis = 1)
            

m4_outcome = sample[['morekids']]
np.mean(m4_outcome)

morekids    0.373101
dtype: float64

Let's conduct a joint significance test for Model 4a which includes the fatherly race variables.


$$H_0: \beta_{blackm}= 0 , \beta_{hispm} = 0, \beta_{educd} = 0$$

$$H_1 =   \beta_{blackm} \neq 0, \beta_{hispm} \neq 0, beta_{educd} \neq 0$$

In [8]:
m4_a = sm.OLS(m4_outcome ,sm.add_constant(m4_features))
m4_a = m4_a.fit()
m4_a.summary()

0,1,2,3
Dep. Variable:,morekids,R-squared:,0.078
Model:,OLS,Adj. R-squared:,0.078
Method:,Least Squares,F-statistic:,350.6
Date:,"Sun, 10 May 2020",Prob (F-statistic):,0.0
Time:,14:05:18,Log-Likelihood:,-154150.0
No. Observations:,236459,AIC:,308400.0
Df Residuals:,236401,BIC:,309000.0
Df Model:,57,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2315,0.024,9.695,0.000,0.185,0.278
age_21,-0.2321,0.013,-18.467,0.000,-0.257,-0.207
age_22,-0.1983,0.009,-22.365,0.000,-0.216,-0.181
age_23,-0.1655,0.007,-23.846,0.000,-0.179,-0.152
age_24,-0.1104,0.006,-19.169,0.000,-0.122,-0.099
age_25,-0.0622,0.005,-12.364,0.000,-0.072,-0.052
age_26,-0.0382,0.005,-8.460,0.000,-0.047,-0.029
age_27,0.0043,0.004,1.041,0.298,-0.004,0.013
age_28,0.0257,0.004,6.441,0.000,0.018,0.033

0,1,2,3
Omnibus:,1669238.513,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29577.479
Skew:,0.473,Prob(JB):,0.0
Kurtosis:,1.549,Cond. No.,1.01e+16


The p-value for the $aged$ and $educd$ are $p = 0.164, 0.661$ respectively which indicate the fatherly variables are not significantin determining the probability of returning. However we observe that the race variables have p-values of 0 which indicate significant determinants.

Let's run another regression taking those out.


In [9]:
m4_b = sm.OLS(m4_outcome ,sm.add_constant(m4_features_noDad))
m4_b = m4_b.fit()
m4_b.summary()

0,1,2,3
Dep. Variable:,morekids,R-squared:,0.078
Model:,OLS,Adj. R-squared:,0.078
Method:,Least Squares,F-statistic:,363.3
Date:,"Sun, 10 May 2020",Prob (F-statistic):,0.0
Time:,14:05:19,Log-Likelihood:,-154150.0
No. Observations:,236459,AIC:,308400.0
Df Residuals:,236403,BIC:,309000.0
Df Model:,55,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.022e+10,1.78e+11,-0.394,0.694,-4.2e+11,2.79e+11
age_21,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_22,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_23,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_24,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_25,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_26,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_27,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10
age_28,1.067e+08,1.89e+10,0.006,0.995,-3.69e+10,3.72e+10

0,1,2,3
Omnibus:,1673201.867,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,29556.85
Skew:,0.473,Prob(JB):,0.0
Kurtosis:,1.549,Cond. No.,1340000000000000.0


Notice that the Adj. R-squared values for both regressions are 0.078 which indicate that omitting the dad variables are insignficant when determining the likelihood of having more kids. Let's continue testing our larger hypothesis that same sex children is a decent _exogenous_ determinant of the family size.

## Adding samesex and applying that to Model 4 which has the dummies, aged, educd, and race

In [10]:
m4c_features = pd.concat([m4_features, sample['samesex']], axis = 1)

In [11]:
m4_c = sm.OLS(m4_outcome ,sm.add_constant(m4c_features))
m4_c = m4_c.fit()
m4_c.summary()

0,1,2,3
Dep. Variable:,morekids,R-squared:,0.083
Model:,OLS,Adj. R-squared:,0.083
Method:,Least Squares,F-statistic:,369.0
Date:,"Mon, 25 May 2020",Prob (F-statistic):,0.0
Time:,17:33:56,Log-Likelihood:,-153500.0
No. Observations:,236459,AIC:,307100.0
Df Residuals:,236400,BIC:,307700.0
Df Model:,58,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1994,0.024,8.371,0.000,0.153,0.246
age_21,-0.2358,0.013,-18.814,0.000,-0.260,-0.211
age_22,-0.2011,0.009,-22.752,0.000,-0.218,-0.184
age_23,-0.1678,0.007,-24.249,0.000,-0.181,-0.154
age_24,-0.1127,0.006,-19.614,0.000,-0.124,-0.101
age_25,-0.0642,0.005,-12.790,0.000,-0.074,-0.054
age_26,-0.0406,0.005,-9.019,0.000,-0.049,-0.032
age_27,0.0028,0.004,0.671,0.502,-0.005,0.011
age_28,0.0236,0.004,5.926,0.000,0.016,0.031

0,1,2,3
Omnibus:,1792846.281,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28969.171
Skew:,0.47,Prob(JB):,0.0
Kurtosis:,1.566,Cond. No.,1.01e+16


### The average effect of having the first two children be the same sex is the beta coeffecient 0.0688, that is to say that if a couple has two kids of the same sex, on average they are 6% more likely to have a 3rd child.


Let's evaluate the claim that families only care about have 1 son. This would mean that if we included the variables $boys_2$ in our model then  $\beta_{boys2}$ would be statistically insignificant because the couple already has a son. 

$$ H_0: \beta_{boys2} = 0$$
$$ H_1: \beta_{boys2} \neq 0$$

Then we will test if having 2 girls increases the probability of having more kids. I expect the $\beta_{girls2}$ to not only be significant but having a weight greater than 6% because of the (outdated) cultural norms of _"needing to have a male heir"_.

$$ H_0: \beta_{boys2} = 0.06$$
$$ H_1: \beta_{boys2} > 0.06$$

In [12]:
m4c_2boys_features = pd.concat([m4c_features, sample['boys2']], axis = 1)

m4c_2boys = sm.OLS(m4_outcome ,sm.add_constant(m4c_2boys_features))
m4c_2boys = m4c_2boys.fit()
m4c_2boys.summary()

0,1,2,3
Dep. Variable:,morekids,R-squared:,0.083
Model:,OLS,Adj. R-squared:,0.083
Method:,Least Squares,F-statistic:,363.9
Date:,"Mon, 25 May 2020",Prob (F-statistic):,0.0
Time:,17:33:57,Log-Likelihood:,-153470.0
No. Observations:,236459,AIC:,307100.0
Df Residuals:,236399,BIC:,307700.0
Df Model:,59,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1987,0.024,8.341,0.000,0.152,0.245
age_21,-0.2359,0.013,-18.820,0.000,-0.260,-0.211
age_22,-0.2012,0.009,-22.756,0.000,-0.218,-0.184
age_23,-0.1679,0.007,-24.264,0.000,-0.181,-0.154
age_24,-0.1127,0.006,-19.629,0.000,-0.124,-0.101
age_25,-0.0643,0.005,-12.810,0.000,-0.074,-0.054
age_26,-0.0407,0.005,-9.034,0.000,-0.050,-0.032
age_27,0.0028,0.004,0.669,0.504,-0.005,0.011
age_28,0.0235,0.004,5.916,0.000,0.016,0.031

0,1,2,3
Omnibus:,1800697.047,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28941.96
Skew:,0.47,Prob(JB):,0.0
Kurtosis:,1.567,Cond. No.,1.01e+16


$p_{boys2}$ is 0 and $\beta_{boys2} = -0.0213$ which means that having two boys decreases the probability of having more children by about 2% which is in the direction of our alternative hypothesis!

In [14]:
m4c_2girls_features = pd.concat([m4c_features, sample['girls2']], axis = 1)

m4c_2girls = sm.OLS(m4_outcome ,sm.add_constant(m4c_2girls_features))
m4c_2girls = m4c_2girls.fit()
m4c_2girls.summary()

0,1,2,3
Dep. Variable:,morekids,R-squared:,0.083
Model:,OLS,Adj. R-squared:,0.083
Method:,Least Squares,F-statistic:,363.9
Date:,"Mon, 25 May 2020",Prob (F-statistic):,0.0
Time:,17:36:13,Log-Likelihood:,-153470.0
No. Observations:,236459,AIC:,307100.0
Df Residuals:,236399,BIC:,307700.0
Df Model:,59,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1987,0.024,8.341,0.000,0.152,0.245
age_21,-0.2359,0.013,-18.820,0.000,-0.260,-0.211
age_22,-0.2012,0.009,-22.756,0.000,-0.218,-0.184
age_23,-0.1679,0.007,-24.264,0.000,-0.181,-0.154
age_24,-0.1127,0.006,-19.629,0.000,-0.124,-0.101
age_25,-0.0643,0.005,-12.810,0.000,-0.074,-0.054
age_26,-0.0407,0.005,-9.034,0.000,-0.050,-0.032
age_27,0.0028,0.004,0.669,0.504,-0.005,0.011
age_28,0.0235,0.004,5.916,0.000,0.016,0.031

0,1,2,3
Omnibus:,1800697.047,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28941.96
Skew:,0.47,Prob(JB):,0.0
Kurtosis:,1.567,Cond. No.,1.01e+16


$p_{girls2}$ is 0 and $\beta_{girls2} = 0.0213$ which means that having two girls increases the likelihood of having more children which can confirm our hypothesis based on cultural norms. Adjusted R-squared fell by about .002 compared to the regression with 2 boys, so roughly speaking they have about the same predictive power.