<h1>ECON 140R Class 06</h1>

This analysis draws extensively from [Edwards and Roff (2010)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012157).

Consider the dataset `cpp_sample.dta` drawn from the multi-site U.S. Collaborative Perinatal Project (CPP), a panel survey spanning 7 years of children’s lives. The study began with pregnant women who were recruited at university hospitals between 1959 and 1965. They and their children were followed for 7 years of the children's lives, with several waves of questions and neurocognitive tests administered along the way.

There is potentially a lot more to be said here, about how the panel was reinterviewed, how much attrition there was, and so on. Here, we are going to take a very simple approach. We will examine data from each wave separately, ignoring the panel structure but in a way that avoids big econometric problems.

<h2>Learning objectives</h2>

<h3>General</h3>How can we draw inferences from observational data about effects of "treatments" on outcomes we care about? Multivariate regression methods &mdash; using `lm()` in R, for example &mdash; can help. But a critical issue is whether we can control sufficiently for <b>omitted variable bias (OVB)</b>. In this dataset, which is publicly available, we can walk through several examples of omitted variables that biased published results.

<h3>Class 06</h3>
Let's start by looking at the dataset, running a regression, and talking and writing about what we see.

<h2>Variables</h2>

For <b>outcome variables</b> $y$, we can look at the same six CPP measures of children’s <i>neurocognitive development</i> that were examined by [Saha et al. (2009)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1000040), two measures at each follow-up age: 8 mo, 4 y, and 7 y. In order, these include 
* the Bayley Mental Scale `bayleymental` and Bayley Motor Scale `bayleymotor` for Infant Development
* the Stanford Binet Intelligence Scale Form L-M `stanfordbinet` and the Graham-Ernhart Block Sort Test `grahamernhart` 
* the Wechsler Intelligence Scale for Children (WISC) Full Scale IQ `wiscfulliq` and the Wide Range Achievement Test (WRAT) of Reading `wratreading`

The CPP data, drawn from the enhanced electronic datasets distributed by the Johns Hopkins School of Public Health, include WRAT scores that are raw rather than normed, with a mean around 35 rather than 100. [Edwards and Roff (2010)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012157) found no qualitative differences between results using these raw scores and the normed scores used by [Saha et al. (2009)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1000040). (In theory there should be no qualitative differences.)

The <b>treatment variable</b> of interest is <b>paternal age</b>, labeled `fathage` in the dataset. [Saha et al. (2009)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1000040) are interested in whether there are developmental risks to children of older fathers. As they discuss, previous research suggests that advanced paternal age, which is connected to a higher probability of copy error mutations in sperm, appears to be associated with fetal death, rare congenital conditions, neurological and neuropsychiatric conditions, schizophrenia, and autism spectrum disorder. Advanced <i>maternal age</i> carries known risk factors, and this and other studies aimed to explore what advanced paternal age may bring with it, during a period when childbearing is generally being postponed to later in life, especially among the college educated.

The regression equation takes this general form:

$$
y_i = \alpha + \beta \cdot fathage_i + \sum \gamma^j \cdot z^j_i + \epsilon_i
$$

where $y_i$ is a neurocognitive test score at a particular are; $\beta$ is the treatment effect of paternal age `fathage` on $y_i$; and each $z^j_i$ is a "control variable" or background characteristic, with an effect $\gamma^j$. 

Many other variables measure things that are likely to affect children's neurocognitive scores, of course. [Saha et al. (2009)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1000040) considered the following set:

* `mothage` 
    * Mother's age at birth
* `fathage` 
    * Father's age at birth
* `childage8mo` `childage4yr` `childage7yr`
    * Child's age at reinterview wave
* `gestationwks`    
    * Child's gestation in weeks
* `male`
    * 0/1 Child is male
* `mothblack`
    * 0/1 Mother is Black (or African American)
* `mothasian`
    * 0/1 Mother is Asian American
* `mothpuert` 
    * 0/1 Mother is Puerto Rican
* `mothother` 
    * 0/1 Mother is of other race, not White non-Hispanic
* `mothsingle` 
    * 0/1 Mother's marital status: single
* `mothcommonl` 
    * 0/1 mother's marital status: common-law married 
* `mothwidowed` 
    * 0/1 mother's marital status: widowed 
* `mothdivorce`   
    * 0/1 mother's marital status: divorced 
* `mothseparat`  
    * 0/1 mother's marital status: separated 
* `socioeconindex`
    * 0/1 family's [Duncan socioeconomic index](https://usa.ipums.org/usa-action/variables/SEI#description_section) 
* `mothmentill`    
    * 0/1 mother's marital status: common-law married 
* `fathmentill`  
    * 0/1 mother's marital status: common-law married 

Wow, what a list! But ... is it enough? <i>Are there other critical variables that are still omitted?</i>

<b>SPOILER ALERT:</b>  <i>Yes. There are still critical variables that are still omitted.</i>

<hr>


Let's load up <b>haven</b> and <b>tidyverse</b>

In [None]:
library(haven)
library(tidyverse)

And let's load in the dataset:

In [None]:
cpp_sample <- read_dta("cpp_sample.dta")

In [None]:
head(cpp_sample)

[Saha et al. (2009)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1000040) examined six different neurocognitive test scores. Let's choose one to focus on. Let's choose the Wechsler Intelligence Scale for Children (WISC), measured at age 7. It's useful to look at the summary of the variable, which reveals the units and average value.

In [None]:
summary(cpp_sample$wiscfulliq)

A visualization might be helpful too. Let's look at a scatterplot of `wiscfulliq` ($y$) vs. `fathage` ($x$).

`ggplot(data = cpp_sample, aes(x = fathage, y = wiscfulliq)) +
    geom_point() +
    labs(x = "Paternal age",
        y = "WISC IQ score at 7y")`        

<font color = "blue">What do you see here?</font>

Following what [Saha et al. (2009)](https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1000040) did, let's start with a "short regression" that includes just a few basic characteristics: mother's age, father's age, child's age at test, getstation, child's gender, and mother's race/ethnicity.

`wisc_reg1 <- lm(wiscfulliq ~ fathage + mothage + 
                  childage7yr + gestationwks + male +
                  mothblack + mothasian + mothpuert + mothother, data = cpp_sample)`
                  
`summary(wisc_reg1)`

<font color="blue">Write about what you see here. Describe and discuss the "treatment effect" of `fathage`. What about the effect associated with `mothage`? How do these two coefficients compare? Are they statistically significant? What are their signs? Are they large relative to the average level of `wiscfulliq`?</font>

Type some answers here

<font color="blue">
Describe the nature of this study. How are "treatment groups" &mdash; children with older dads &mdash; and "control groups" &mdash; children with younger dads &mdash; assigned? Randomly? In some other way? If the assignment is not random, why do you think children might have older dads?
</font>

Type some answers here

Now let's run a second regression, a "longer" one, with more control variables on the right-hand side. Like Saha et al., let's also control for mother's marital status, the family socioeconomic index (a single-valued function of income, education, and occupation &mdash; these are commonly seen in sociology), and 0/1 indicators of mother's and father's having any history of mental illness.

`wisc_reg2 <- lm(wiscfulliq ~ fathage + mothage + 
                  childage7yr + gestationwks + male +
                  mothblack + mothasian + mothpuert + mothother + 
                  mothsingle + mothcommonl + mothwidowed + mothdivorce + mothseparat +
                  socioeconindex + mothmentill + fathmentill, data = cpp_sample)`

`summary(wisc_reg2)`

<font color="blue">Write about what you see here. Describe and discuss the "treatment effect" of `fathage`. What about the effect associated with `mothage`? How do these two coefficients compare? Are they statistically significant? What are their signs? Are they large relative to the average level of `wiscfulliq`?
How have things changed since the first regression above?
</font>

Write about what you see here!

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>