In [1]:
# HIDDEN
# This useful nonsense should just go at the top of your notebook.
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')
# datascience version number of last run of this notebook
version.__version__

'0.5.19'

<h1>Class 11: Quarter of birth and maternal characteristics</h1>

A tradition in the applied economics literature of the past several decades has been to look at an individual's outcomes across different quarters of birth. The most basic idea is that because most state laws require compulsory schooling until a particular age in years, whether it's 16, 17, or 18, teens who were born during the winter typically are of dropout age while children born during the summer or early fall were not. If children born at different times of the year were in fact otherwise identical or indistinguishable from one another, then compulsory schooling laws might force some children to have more education than others, and social scientists might be able to learn something by comparing those two groups.

Buckles and Hungerman (2013) looked at maternal characteristics and found that they varied over the year as well, which calls into some question the canonical findings that children born during the winter are different because their own educational attainment can be less than that of children born during the summer.

Let's look at a subset of the data that Buckles and Hungerman mustered in support of their argument. In particular, let's look at Census records from the 1960, 1970, and 1980 Censuses.  In each case, these subsamples are of mothers with coresident children 17 or under, and each record contains the mother's characteristics alongside the birth quarter and year of the child.

Let's run this model repeatedly with ordinary least squares (OLS):

$$ Y_i = \alpha 
+ \beta_2 bq2_i
+ \beta_3 bq3_i
+ \beta_4 bq4_i
+ \gamma \tilde{by}_i
+\epsilon_i 
$$

where $Y_i$ is a characteristic of the mother, the $bq$ variables are 0/1 indicators of the child's birth quarter, and the $by$ variable is a linear measure of the child's birth year.

I have subtracted the average birth year in the sample from $by$ to produce the measure $\tilde{by}$ that appears in the equation.  I did this so that the constant term $\alpha$ would report to us a recognizable average $Y$ rather than a function of it and $\gamma$ times the birth year.

Notice also that I have <b>omitted</b> *bq1* from the equation. When you have indicator variables that together cover 100% of the sample, you either must drop one and thus designate it as the "default" category that receives just the constant term, or you must omit the constant term. Buckles and Hungerman choose to omit *bq1*, so let's do the same here.

(Why is this? Imagine if it weren't the case. Then to whom is the constant term $\alpha$ applicable? Everyone? Then everyone gets $\alpha$ plus their $\beta$. But what would prevent us from subtracting a tiny number from each $\beta$ and adding it to $\alpha$? Or doing that, and then doing it again? Nothing would, and that produces an indeterminacy that isn't good. We must drop one of the indicators or the constant term, which pins down the estimates and gets us out of indeterminacy.)

Omitting *bq1* gives us a very convenient set of hypothesis tests:  the $\beta$'s are the additional bits that mothers of children NOT born in the first quarter get. Our hypothesis is that these are all zero:  $\beta_2 = \beta_3 = \beta_4 = 0$. We will find in many cases that we can reject the null; in fact, there are differences in mother's characteristics by quarter of birth.

As we have done recently, let's use the very helpful <a href="http://statsmodels.sourceforge.net/">Statsmodels</a> 
module and some <a href="http://pandas.pydata.org/">Pandas</a> functions to run a multivariate regression. 

In [2]:
import statsmodels.api as sm
import pandas as pd

Here is an extract of the 1960 Census:

In [3]:
Tablec1960 = Table.read_table('http://demog.berkeley.edu/~redwards/Courses/LS88/c11_b1960.csv')
Tablec1960

sex,birthyr,birthyr0,birthqtr,birthq1,birthq2,birthq3,birthq4,ones,white,momed,momhs,momage,mommarried,poor
Male,1943,-9,2,0,1,0,0,1,1,14,1,21,1,0
Male,1943,-9,2,0,1,0,0,1,1,9,0,25,1,1
Female,1943,-9,2,0,1,0,0,1,1,12,1,23,1,0
Female,1943,-9,2,0,1,0,0,1,1,12,1,26,1,0
Male,1943,-9,2,0,1,0,0,1,1,6,0,25,1,1
Female,1943,-9,2,0,1,0,0,1,1,12,1,27,1,0
Male,1943,-9,2,0,1,0,0,1,1,12,1,23,1,0
Male,1943,-9,2,0,1,0,0,1,1,10,0,22,1,0
Male,1943,-9,2,0,1,0,0,1,1,12,1,32,1,0
Male,1943,-9,2,0,1,0,0,1,1,9,0,18,1,0


Now let's run OLS after we switch data types:

In [5]:
c1960 = Tablec1960.to_df()
type(c1960)

pandas.core.frame.DataFrame

First let's model the probability that the mother is white.

In [6]:
x60 = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = ...
#multiple_regress = sm.OLS(y60, x60).fit()
#multiple_regress.summary()

<font color="blue">What is the average percent white among all the mothers in the sample?</font>

<font color="blue">Do you see any *seasonality* in percent white?</font>

Now let's run OLS with the same x-variables but a different y-variable: `momed` which is mother's education in years.

In [7]:
#x = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = ...
#multiple_regress = sm.OLS(y60, x60).fit()
#multiple_regress.summary()

<font color="blue">What is the average level of education among all the mothers in the sample?</font>

<font color="blue">Do you see any seasonality in the number of years of education?</font>

Let's also have a look at mother's age.  

In [8]:
#x = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = ...
#multiple_regress = sm.OLS(y60, x60).fit()
#multiple_regress.summary()

<font color="blue">What is the average age among all the mothers in the sample?</font>

<font color="blue">Is there any seasonality in mother's age? In which quarter are moms the youngest?</font>

Finally, let's look at living in an impoverished household: the variable `poor`

In [9]:
#x = c1960[['ones','birthq2','birthq3','birthq4','birthyr0']]
y60 = ...
#multiple_regress = sm.OLS(y60, x60).fit()
#multiple_regress.summary()

<font color="blue">What is the average poverty rate among all the mothers in the sample?</font>

<font color="blue">Is there any seasonality in births into poverty?</font>

<h2>Patterns in 1980</h2>

For kicks, let's now look at the "same" data from the 1980 Census. Here's the dataset:

In [None]:
Tablec1980 = Table.read_table('http://demog.berkeley.edu/~redwards/Courses/LS88/c11_b1980.csv')
Tablec1980

In [None]:
c1980 = Tablec1980.to_df()
type(c1980)

Let's look at the same Y-variables and models that we examined using 1960 data, in order to see how the relationships have changed, if at all.

In [14]:
x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = ...
#multiple_regress = sm.OLS(y80, x80).fit()
#multiple_regress.summary()

NameError: name 'c1980' is not defined

<font color="blue">What is the average percent white among all the mothers in the sample? Do you see any seasonality in percent white? Compare and contrast with the 1960 data.</font>

Like before, let's examine patterns in mother's education.

In [13]:
#x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = ...
#multiple_regress = sm.OLS(y80, x80).fit()
#multiple_regress.summary()

<font color="blue">What is the average level of education among all the mothers in the sample?  Do you see seasonality here?  Compare/contrast with 1960.</font>

And let's look at mother's age again too.

In [12]:
#x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = ...
#multiple_regress = sm.OLS(y80, x80).fit()
#multiple_regress.summary()

<font color="blue">What is the average age among all the mothers in the sample? Is there seasonality here? Compare/contrast to 1960.</font>

Finally, a second look at poverty:

In [11]:
#x80 = c1980[['ones','birthq2','birthq3','birthq4','birthyr0']]
y80 = ...
#multiple_regress = sm.OLS(y80, x80).fit()
#multiple_regress.summary()

<font color="blue">What is the average poverty rate among all the mothers in the sample? Is there seasonality here? Compare/contrast to 1960.</font>