<h1>ECON 140R Class 07</h1>

Let's examine the simple example of 9 college applicants in Table 2.1 inside Chapter 2 of <i>Mastering Metrics</i>. The specific research question is: "What is the effect of attending a private college (when you can but don't have to)?" The impediment is <b>omitted variable bias (OVB)</b>.

<h2>Learning Objectives</h2>

* Run some ordinary least squares (OLS) regressions using `lm()` in __R__. Math below 
* Explore how groups C and D are unfortunately useless for the task at hand
* Jump ahead and examine the <i>OVB formula</i> for this example, from pp 70-72


Here is a clean PNG of Table 2.1:

<img src="MMtbl21.png" width="600" />

In [None]:
library(tidyverse)

First, let's code up all of Table 2.1. <i>But let's also do it with two separate data frames: one for groups A and B, and one for groups C and D.</i> The reason why will become clear shortly. This __R__ code looks just like the code in class06. Typically, you won't have to code up a dataset like this; you'll instead receive a dataset in CSV or similar format, and your job will be to recode things to make them useful.

You'll also typically model the <b>natural logarithm</b> of financial variables like earnings, wages, income, wealth, and so on. Here, we'll follow MM and stick with the level of earnings, because it makes everything much clearer.

Here is the code for groups A and B:

In [None]:
group   <- c("A", "A", "A", "B", "B")

groupa  <- c(1, 1, 1, 0, 0)

groupb  <- c(0, 0, 0, 1, 1)

groupc  <- c(0, 0, 0, 0, 0)

groupd  <- c(0, 0, 0, 0, 0)

private <- c(1, 1, 0, 1, 0)

earnings <- c(110000, 100000, 110000, 60000, 30000)

(table2_1dataAB <- data.frame(group, groupa, groupb, groupc, groupd, private, earnings))


And here's the code for groups C and D:

In [None]:
group   <- c("C", "C", "D", "D")

groupa  <- c(0, 0, 0, 0)

groupb  <- c(0, 0, 0, 0)

groupc  <- c(1, 1, 0, 0)

groupd  <- c(0, 0, 1, 1)

private <- c(1, 1, 0, 0)

earnings <- c(115000, 75000, 90000, 60000)

# data.frame() constructs the data frame and labels the columns with the variable names
# Parentheses around the command also ask R to show it to us

(table2_1dataCD <- data.frame(group, groupa, groupb, groupc, groupd, private, earnings))

As in class06, let's call `rbind()` to append the datasets and create the entire Table 2.1 with 9 individuals shown.

In [None]:
(table2_1data <- rbind(table2_1dataAB, table2_1dataCD))

Now we are ready to run some regressions.

Why do Angrist and Pischke drop groups C and D from their analysis? As they describe on p. 54, "Groups C and D are uninformative, because, from the perspective of our effort to estimate a private school treatment effect, each is composed of either all-treated or all-control individuals." On p. 57, they state "In our matching matrix, the five srudents in groups A and B (Table 2.1) contribute useful data, while students in groups C and D can be discarded."

Let's explore these statements.

First, suppose we were <i>U.S. News & World Report</i>, and all we did was observe, rather than search for causal effects. Let's run regular OLS on this equation:

$$
earnings_i = \alpha + \beta \ private_i + \epsilon_i
$$

and let's use the entire fictional dataset of 9 applicants:

In [None]:
table2_1_full_reg <- lm(earnings ~ private, data = table2_1data)
summary(table2_1_full_reg)

When we look across all 9 applicants, those who attend public schools earn an average of $\alpha = 72,500$ while applicants who attend private schools earn $\beta = 19,500$ more.

Now suppose we run the same simple model but look at just groups A and B:

In [None]:
table2_1_AB_reg <- lm(earnings ~ private, data = table2_1dataAB)
summary(table2_1_AB_reg)

Hmmm. Now, when we look across the 5 applicants in groups A and B, those who attend public schools earn an average of $\alpha = 70,000$ while applicants who attend private schools earn $\beta = 20,000$ more. In this simple observational model, it clearly does matter whether we use all 9 data points or just the 5 in groups A and B.

But can this simple observational model:
$$
earnings_i = \alpha + \beta \ private_i + \epsilon_i
$$

reveal a good estimate of the causal effect of $private$ on $earnings$?

On p. 55, Angrist and Pischke argue the answer is NO. The simplest way to describe why is that the simple model compares every applicant to every other applicant, that is, compares apples to oranges. We want to compare apples to apples, and the best way to do that is to compare individuals within groups.

Introducing "group fixed effects" is one way to do this. Let's first consider this model, which is a little different from equation (2.1) on p. 57:

$$
earnings_i = \alpha + \beta \ private_i + \gamma^A \ A_i + \gamma^C \ C_i + \gamma^D \ D_i+ \epsilon_i
$$

Here, group B is the omitted "baseline" category, and indicator variables $A_i$, $C_i$, and $D_i$ take on values of 1 when applicant $i$ is in that group, and 0 otherwise. This is how we have coded the data already, so let's run this regression:

In [None]:
table2_1_fullfe_reg <- lm(earnings ~ private + groupa + groupc + groupd, data = table2_1data)
summary(table2_1_fullfe_reg)

These estimates are quite different than our earlier simple observational estimates. With group fixed effects, we see that the parameter of interest, $\beta = 10,000$, about half of what it was.

Now humor me. Let's do what Angrist and Pischke advise, which is to <i>drop groups C and D</i>. We'll also drop indicator variables for those groups too, and we'll end up with this:
$$
earnings_i = \alpha + \beta \ private_i + \gamma \ A_i + \epsilon_i
$$

This is equation (2.1) from p. 57 of <i>Mastering Metrics</i>. Let's run it:

In [None]:
table2_1_ABfe_reg <- lm(earnings ~ private + groupa, data = table2_1dataAB)
summary(table2_1_ABfe_reg)

Look at these estimates, and compare them to the previous regression with group fixed effects. <i>They are exactly the same</i>, except here we don't have estimates of the fixed effects for groups C and D.
    
Why? 

If we look at groups C and D, notice that the variable of interest, $private_i$, does not vary within either of those groups. In a setting like this, if you specify group fixed effects and there is no variation in the treatment variable $private_i$ within a group, then the group fixed effect (here, both $\gamma^C$ and $\gamma^D$) will capture or absorb all of the action within those groups, and those groups will not contribute in any way to the estimate of $\beta$, the effect of $private_i$.

<h2>RECAP</h2>

* Comparing apples to apples reveals the plausibly causal effect of private school attendance on earnings. 
* We accomplish that with <b>good controls</b> in an ordinary least square (OLS) regression. 
* In particular, we specified <b>group fixed effects</b>, or group indicator variables, where the groups were constructed by combining students who applied to similar colleges. 
* The subset of these students whose choices reveal the causal effect are those is groups where there was variation in the treatment variable of interest: private school <i>attendance</i>.

It's also true that the simple comparison of earnings across all students is interesting. But it clearly reflects both the causal impact of private school attendance and selection.

To the extent we might care about total inequality, however, the simple comparison is still illuminating. But suppose we wanted to identify a <b>policy</b> to reduce total inequality; for example, sending disadvantaged applicants to private colleges. Then the causal component is probably the most relevant one. (In this simple example, there is a positive causal effect of private college attendance on earnings.)

<h2>OVB formula</h2>

Let's stick with this simple example and introduce the <i>omitted variable bias (OMV) formula</i>. So far, we have seen how there is omitted variable bias (a.k.a. selection) in the simple observational model that we ran. Let's revisit this and introduce some new terminology.

Call the simple observational model the <b>short regression</b>, which appears on the top of p. 70:

$$
earnings_i = \alpha^s + \beta^s \ P_i + e^s_i
$$

while the model with a group fixed effect $\gamma$ (when $A_i = 1$) will be the <b>long regression</b> because it includes one more right-hand side variable. This is equation (2.3) on page 69:

$$
earnings_i = \alpha^l + \beta^l \ P_i + \gamma \ A_i + e^l_i
$$

We can formally define omitted variable bias as the difference between these two estimates of the "effect" of $private_i$ on $earnings_i$: 

$$
OVB = \beta^s - \beta^l
$$

For example, if $\beta^s$ from the short regression is larger than $\beta^l$ from the long regression, then $\beta^s > \beta^l$ and the omitted variable bias is positive. In words, we can describe the situation this way: in the simple observational model, our $\beta$ was too large, because of positive omitted variable bias.

Omitted variable bias can also be negative, making the short regression coefficient smaller in size if it's still positive. (Things can also get complicated when the treatment effects themselves are negative; here, they are positive. In a situation like that, take a deep breath, try to reason through it, and keep careful track of the signs.)

<h3>OVB formula</h3>

To decompose $OVB = \beta^s - \beta^l$ into parts that are conceptually easier to think about, we need a third regression, often called an <b>auxiliary regression</b>, which models the relationship between the omitted variable, $A_i$, and the treatment variable, $private_i$. The auxiliary regression here is very simple:

$$
A_i = \pi_0 + \pi_1 \ P_i + u_i
$$

If this is the case, then the omitted variable bias in the short regression must be given by:

$$
OVB = \beta^s - \beta^l = \pi_1 \times \gamma
$$

<h3>Intuition</h3>

If both of these are true:
* The long regression is right, meaning that $\gamma$ isn't zero and thus $earnings_i$ varies with $A_i$ 
* The auxiliary regression is right, meaning that $\pi_1$ isn't zero and thus $A_i$ varies with the treatment variable $P_i$

then part of $\beta^s$ is the effect of the omitted variable $A_i$. The omitted variable bias equals the product of the long-regression coefficient on $A_i$ and the auxiliary-regression coefficient on the treatment variable $P_i$. In general, the signs of $\pi_1$ and $\gamma$ can vary; here, they are both positive.

To see the result mathematically, rewrite the long regression using the auxiliary regression, substituting for $A_i$:

$$
earnings_i = \alpha^l + \beta^l \ P_i + \gamma \left(\pi_0 + \pi_1 \ P_i + u_i \right) + e^l_i
$$
$$
earnings_i = \alpha^l + \beta^l \ P_i + \gamma \ \pi_0 + \gamma \ \pi_1 \ P_i + \gamma \ u_i + e^l_i
$$

Now combine similar terms:

$$
earnings_i = \left(\alpha^l + \gamma \ \pi_0 \right) + \left(\beta^l + \gamma \ \pi_1 \right) \ P_i + \left( \gamma \ u_i + e^l_i \right)
$$

This is now the short regression, and we can see that the full coefficient on $P_i$ here in the short regression will be

$$
\beta^s = \beta^l + \gamma \ \pi_1
$$

$$
\beta^s - \beta^l =  \pi_1 \times \gamma
$$

<h3>Motivation</h3>

The concept of omitted variable bias and its formula are important for at least two reasons:
* Quantifying its actual effect, when the omitted variables can be measured
* Thinking about its likely effect, when the omitted variables CAN'T be measured

<i>Mastering Metrics</i> introduces the second reason on p. 72, when Angrist and Pischke work through a thought experiment regarding <i>family size</i>, which is another omitted variable. They ultimately argue that omitting family size probably also introduces positive bias into the estimate of the effect of private college on earnings. We'll return to this later, after seeing the 

Let's now run the decomposition. Here, when we can measure $A_i$ and reveal that it matters a lot, this is just a mechanical exercise. When we can't measure an omitted variable, like family size, then we must instead make some educated guesses about the signs of $\pi_1$ and $\gamma$.

First, let's run the short regression:

In [None]:
table2_1_reg_short <- lm(earnings ~ private, data = table2_1dataAB)
summary(table2_1_reg_short)

Now let's run the long regression:

In [None]:
table2_1_reg_long <- lm(earnings ~ private + groupa, data = table2_1dataAB)
summary(table2_1_reg_long)

And finally, let's run the auxiliary regression of the omitted variable $A_i$ on the included variables, which is only the treatment variable $P_i$ here. Again, this is shown in MM on pp 71-72. 
$$
A_i = \pi_0 + \pi_1 \ P_i + u_i
$$

In [None]:
table2_1_reg_omitted <- lm(groupa ~ private, data = table2_1dataAB)
summary(table2_1_reg_omitted)

Now let's calculate the omitted variable bias formula from the bottom of page 71: 

$$
OVB = \beta^s - \beta^l = \pi_1 \times \gamma
$$

In [None]:
beta_short = 20000
beta_long  = 10000
pi_1       = 0.1667
gamma      = 60000

(beta_short - beta_long)
(ovb = pi_1 * gamma)

Checks out, with some rounding error. Just for kicks, we can also look at the decomposition of the constant term if we want to:

In [None]:
alpha_short = 70000
alpha_long  = 40000
pi_0        = 0.5

(alpha_short)
(alpha_long + gamma * pi_0)

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>