In [1]:
# HIDDEN
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')
# datascience version number of last run of this notebook
version.__version__

'0.5.19'

<h1>Class 11: Quarter of birth and maternal characteristics</h1>

A tradition in the applied economics literature of the past several decades has been to look at an individual's outcomes across different quarters of birth. The most basic idea is that because most state laws require compulsory schooling until a particular age in years, whether it's 16, 17, or 18, teens who were born during the winter typically are of dropout age while children born during the summer or early fall were not. If children born at different times of the year were in fact otherwise identical or indistinguishable from one another, then compulsory schooling laws might force some children to have more education than others, and social scientists might be able to learn something by comparing those two groups.

Buckles and Hungerman (2013) looked at maternal characteristics and found that they varied over the year as well, which calls into some question the canonical findings that children born during the winter are different because their own educational attainment can be less than that of children born during the summer.

Let's look at a subset of the data that Buckles and Hungerman mustered in support of their argument. In particular, let's look at Census records from the 1960, 1970, and 1980 Censuses.  In each case, these subsamples are of mothers with coresident children 17 or under, and each record contains the mother's characteristics alongside the birth quarter and year of the child.

Let's run this model repeatedly with ordinary least squares (OLS):

$$ Y_i = \alpha 
+ \beta_2 bq2_i
+ \beta_3 bq3_i
+ \beta_4 bq4_i
+ \gamma \tilde{by}_i
+\epsilon_i 
$$

where $Y_i$ is a characteristic of the mother, the $bq$ variables are 0/1 indicators of the child's birth quarter, and the $by$ variable is a linear measure of the child's birth year.

I have subtracted the average birth year in the sample from $by$ to produce the measure $\tilde{by}$ that appears in the equation.  I did this so that the constant term $\alpha$ would report to us a recognizable average $Y$ rather than a function of it and $\gamma$ times the birth year.

Notice also that I have <b>omitted</b> *bq1* from the equation. When you have indicator variables that together cover 100% of the sample, you either must drop one and thus designate it as the "default" category that receives just the constant term, or you must omit the constant term. Buckles and Hungerman choose to omit *bq1*, so let's do the same here.

(Why is this? Imagine if it weren't the case. Then to whom is the constant term $\alpha$ applicable? Everyone? Then everyone gets $\alpha$ plus their $\beta$. But what would prevent us from subtracting a tiny number from each $\beta$ and adding it to $\alpha$? Or doing that, and then doing it again? Nothing would, and that produces an indeterminacy that isn't good. We must drop one of the indicators or the constant term, which pins down the estimates and gets us out of indeterminacy.)

Omitting *bq1* gives us a very convenient set of hypothesis tests:  the $\beta$'s are the additional bits that mothers of children NOT born in the first quarter get. Our hypothesis is that these are all zero:  $\beta_2 = \beta_3 = \beta_4 = 0$. We will find in many cases that we can reject the null; in fact, there are differences in mother's characteristics by quarter of birth.

As we have done recently, let's use the very helpful <a href="http://statsmodels.sourceforge.net/">Statsmodels</a> 
module and some <a href="http://pandas.pydata.org/">Pandas</a> functions to run a multivariate regression. 

In [2]:
import statsmodels.api as sm
import pandas as pd

Here is an extract of the 1960 Census:

In [30]:
Tablec1960 = Table.read_table('http://demog.berkeley.edu/~redwards/Courses/LS88/c11_b1960.csv')
Tablec1960

sex,birthyr,birthyr0,birthqtr,birthq1,birthq2,birthq3,birthq4,ones,white,momed,momhs,momage,mommarried,poor
Male,1943,-9,2,0,1,0,0,1,1,14,1,21,1,0
Male,1943,-9,2,0,1,0,0,1,1,9,0,25,1,1
Female,1943,-9,2,0,1,0,0,1,1,12,1,23,1,0
Female,1943,-9,2,0,1,0,0,1,1,12,1,26,1,0
Male,1943,-9,2,0,1,0,0,1,1,6,0,25,1,1
Female,1943,-9,2,0,1,0,0,1,1,12,1,27,1,0
Male,1943,-9,2,0,1,0,0,1,1,12,1,23,1,0
Male,1943,-9,2,0,1,0,0,1,1,10,0,22,1,0
Male,1943,-9,2,0,1,0,0,1,1,12,1,32,1,0
Male,1943,-9,2,0,1,0,0,1,1,9,0,18,1,0


Now let's run OLS after we switch data types:

In [31]:
c1960 = Tablec1960.to_df()
type(c1960)

pandas.core.frame.DataFrame

First let's model the probability that the mother is white.

In [34]:
x60 = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = c1960['white']
multiple_regress = sm.OLS(y60, x60).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,white,R-squared:,0.001
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,114.6
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,7.769999999999999e-98
Time:,16:35:58,Log-Likelihood:,-178060.0
No. Observations:,578733,AIC:,356100.0
Df Residuals:,578728,BIC:,356200.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,0.8737,0.001,994.518,0.000,0.872 0.875
birthq2,0.0055,0.001,4.423,0.000,0.003 0.008
birthq3,0.0020,0.001,1.613,0.107,-0.000 0.004
birthq4,0.0022,0.001,1.800,0.072,-0.000 0.005
birthyr0,-0.0019,9.04e-05,-20.614,0.000,-0.002 -0.002

0,1,2,3
Omnibus:,240867.14,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,752322.792
Skew:,-2.283,Prob(JB):,0.0
Kurtosis:,6.218,Cond. No.,21.3


<font color="blue">What is the average percent white among all the mothers in the sample?</font>

<font color="blue">Do you see any *seasonality* in percent white?</font>

Now let's run OLS with the same x-variables but a different y-variable: `momed` which is mother's education in years.

In [23]:
#x = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = c1960['momed']
multiple_regress = sm.OLS(y60, x60).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,momed,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,1635.0
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,0.0
Time:,16:29:58,Log-Likelihood:,-1444000.0
No. Observations:,578733,AIC:,2888000.0
Df Residuals:,578728,BIC:,2888000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,10.5172,0.008,1343.271,0.000,10.502 10.533
birthq2,0.0862,0.011,7.725,0.000,0.064 0.108
birthq3,0.0347,0.011,3.202,0.001,0.013 0.056
birthq4,0.0588,0.011,5.372,0.000,0.037 0.080
birthyr0,0.0650,0.001,80.717,0.000,0.063 0.067

0,1,2,3
Omnibus:,46566.886,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,69673.017
Skew:,-0.644,Prob(JB):,0.0
Kurtosis:,4.11,Cond. No.,21.3


<font color="blue">What is the average level of education among all the mothers in the sample?</font>

<font color="blue">Do you see any seasonality in the number of years of education?</font>

Let's also have a look at mother's age.  

In [24]:
#x = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = c1960['momage']
multiple_regress = sm.OLS(y60, x60).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,momage,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,18.8
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,1.8e-15
Time:,16:30:05,Log-Likelihood:,-1874400.0
No. Observations:,578733,AIC:,3749000.0
Df Residuals:,578728,BIC:,3749000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,26.7267,0.016,1622.551,0.000,26.694 26.759
birthq2,-0.0370,0.023,-1.575,0.115,-0.083 0.009
birthq3,-0.0599,0.023,-2.624,0.009,-0.105 -0.015
birthq4,0.0999,0.023,4.337,0.000,0.055 0.145
birthyr0,-0.0070,0.002,-4.146,0.000,-0.010 -0.004

0,1,2,3
Omnibus:,29431.738,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,34180.121
Skew:,0.594,Prob(JB):,0.0
Kurtosis:,3.062,Cond. No.,21.3


<font color="blue">What is the average age among all the mothers in the sample?</font>

<font color="blue">Is there any seasonality in mother's age? In which quarter are moms the youngest?</font>

Finally, let's look at living in an impoverished household: the variable `poor`

In [35]:
#x = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = c1960['poor']
multiple_regress = sm.OLS(y60, x60).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,poor,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,45.81
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,1.51e-38
Time:,16:45:29,Log-Likelihood:,-339470.0
No. Observations:,578733,AIC:,679000.0
Df Residuals:,578728,BIC:,679000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,0.2587,0.001,222.820,0.000,0.256 0.261
birthq2,-0.0101,0.002,-6.110,0.000,-0.013 -0.007
birthq3,-0.0042,0.002,-2.591,0.010,-0.007 -0.001
birthq4,-0.0057,0.002,-3.535,0.000,-0.009 -0.003
birthyr0,0.0014,0.000,11.611,0.000,0.001 0.002

0,1,2,3
Omnibus:,115312.079,Durbin-Watson:,2.002
Prob(Omnibus):,0.0,Jarque-Bera (JB):,135986.688
Skew:,1.132,Prob(JB):,0.0
Kurtosis:,2.283,Cond. No.,21.3


<font color="blue">What is the average poverty rate among all the mothers in the sample?</font>

<font color="blue">Is there any seasonality in births into poverty?</font>

<h2>Patterns in 1980</h2>

For kicks, let's now look at the "same" data from the 1980 Census. Here's the dataset:

In [8]:
Tablec1980 = Table.read_table('http://demog.berkeley.edu/~redwards/Courses/LS88/c11_b1980.csv')
Tablec1980

sex,birthyr,birthyr0,birthqtr,birthq1,birthq2,birthq3,birthq4,ones,white,momed,momhs,momage,mommarried,poor
Female,1963,-8,2,0,1,0,0,1,0,14,1,20,1,0
Female,1963,-8,2,0,1,0,0,1,1,8,0,39,1,0
Male,1963,-8,2,0,1,0,0,1,1,12,1,29,1,0
Male,1963,-8,2,0,1,0,0,1,0,10,0,19,1,0
Female,1963,-8,2,0,1,0,0,1,1,12,1,19,1,0
Male,1963,-8,2,0,1,0,0,1,1,10,0,34,0,0
Female,1963,-8,2,0,1,0,0,1,1,12,1,22,1,0
Female,1963,-8,2,0,1,0,0,1,1,13,1,19,1,1
Male,1963,-8,2,0,1,0,0,1,1,12,1,36,1,0
Female,1963,-8,2,0,1,0,0,1,0,12,1,28,1,0


In [9]:
c1980 = Tablec1980.to_df()
type(c1980)

pandas.core.frame.DataFrame

Let's look at the same Y-variables and models that we examined using 1960 data, in order to see how the relationships have changed, if at all.

In [25]:
x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = c1980['white']
multiple_regress = sm.OLS(y80, x80).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,white,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,126.3
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,4.95e-108
Time:,16:30:17,Log-Likelihood:,-1237200.0
No. Observations:,2766122,AIC:,2474000.0
Df Residuals:,2766117,BIC:,2475000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,0.8242,0.000,1776.382,0.000,0.823 0.825
birthq2,0.0090,0.001,13.649,0.000,0.008 0.010
birthq3,0.0003,0.001,0.406,0.685,-0.001 0.002
birthq4,0.0013,0.001,1.935,0.053,-1.6e-05 0.003
birthyr0,-0.0007,4.59e-05,-15.123,0.000,-0.001 -0.001

0,1,2,3
Omnibus:,737423.551,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1484094.0
Skew:,-1.726,Prob(JB):,0.0
Kurtosis:,3.98,Cond. No.,22.1


<font color="blue">What is the average percent white among all the mothers in the sample? Do you see any seasonality in percent white? Compare and contrast with the 1960 data.</font>

Like before, let's examine patterns in mother's education.

In [27]:
#x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = c1980['momed']
multiple_regress = sm.OLS(y80, x80).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,momed,R-squared:,0.006
Model:,OLS,Adj. R-squared:,0.006
Method:,Least Squares,F-statistic:,3959.0
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,0.0
Time:,16:30:57,Log-Likelihood:,-6715500.0
No. Observations:,2766122,AIC:,13430000.0
Df Residuals:,2766117,BIC:,13430000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,11.8824,0.003,3534.285,0.000,11.876 11.889
birthq2,0.0825,0.005,17.359,0.000,0.073 0.092
birthq3,0.0347,0.005,7.457,0.000,0.026 0.044
birthq4,0.0411,0.005,8.734,0.000,0.032 0.050
birthyr0,0.0417,0.000,125.338,0.000,0.041 0.042

0,1,2,3
Omnibus:,351744.4,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1060523.455
Skew:,-0.677,Prob(JB):,0.0
Kurtosis:,5.715,Cond. No.,22.1


<font color="blue">What is the average level of education among all the mothers in the sample?  Do you see seasonality here?  Compare/contrast with 1960.</font>

And let's look at mother's age again too.

In [28]:
#x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = c1980['momage']
multiple_regress = sm.OLS(y80, x80).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,momage,R-squared:,0.001
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,850.4
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,0.0
Time:,16:31:08,Log-Likelihood:,-8772100.0
No. Observations:,2766122,AIC:,17540000.0
Df Residuals:,2766117,BIC:,17540000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,25.4189,0.007,3594.662,0.000,25.405 25.433
birthq2,-0.0388,0.010,-3.882,0.000,-0.058 -0.019
birthq3,-0.0022,0.010,-0.226,0.822,-0.021 0.017
birthq4,0.1059,0.010,10.700,0.000,0.086 0.125
birthyr0,-0.0390,0.001,-55.719,0.000,-0.040 -0.038

0,1,2,3
Omnibus:,240322.208,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,313183.109
Skew:,0.762,Prob(JB):,0.0
Kurtosis:,3.629,Cond. No.,22.1


<font color="blue">What is the average age among all the mothers in the sample? Is there seasonality here? Compare/contrast to 1960.</font>

Finally, a second look at poverty:

In [36]:
#x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = c1980['poor']
multiple_regress = sm.OLS(y80, x80).fit()
multiple_regress.summary()

0,1,2,3
Dep. Variable:,poor,R-squared:,0.002
Model:,OLS,Adj. R-squared:,0.002
Method:,Least Squares,F-statistic:,1457.0
Date:,"Mon, 18 Apr 2016",Prob (F-statistic):,0.0
Time:,16:48:19,Log-Likelihood:,-1152600.0
No. Observations:,2766122,AIC:,2305000.0
Df Residuals:,2766117,BIC:,2305000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
ones,0.1618,0.000,359.548,0.000,0.161 0.163
birthq2,-0.0049,0.001,-7.740,0.000,-0.006 -0.004
birthq3,0.0012,0.001,1.868,0.062,-5.74e-05 0.002
birthq4,-0.0002,0.001,-0.325,0.745,-0.001 0.001
birthyr0,0.0033,4.46e-05,75.076,0.000,0.003 0.003

0,1,2,3
Omnibus:,823414.701,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1786680.831
Skew:,1.84,Prob(JB):,0.0
Kurtosis:,4.401,Cond. No.,22.1


<font color="blue">What is the average poverty rate among all the mothers in the sample? Is there seasonality here? Compare/contrast to 1960.</font>