<h1>ECON 140R Class 16</h1>

In this notebook, we return to the Fulton Fish Market example that we saw in `econ140_project_wooldridge.ipynb` during Class 15. The data are inside the Wooldridge repository and were collected and provided by Prof. Kathryn Graddy of Brandeis University. Here is the [documentation PDF](https://cran.r-project.org/web/packages/wooldridge/wooldridge.pdf) for the Wooldridge repo. You can also call `?dataset_name` to see the page for just the data in `dataset_name`.

Learning objectives:

1. As always, more experience with OLS
2. Instrumental variables with `ivreg()`
3. Reduced form $\rho$, first stage $\phi$, and the LATE estimate $\lambda = \frac{\rho}{\phi}$

In [None]:
install.packages("wooldridge")

In [None]:
library("wooldridge")
library(tidyverse)

In [None]:
data("fish")

In [None]:
head(fish)

In [None]:
?fish

Below is a simple visualization of the log average price of fish in an observation ($y$-axis) as a function of the log quantity ($x$-axis). The tricky thing is that in our infinite wisdom, we economists put quantity along the $x$-axis and then prefer to measure the price elasticity of demand (and supply). The price elasticity of demand is usually written

$$
\eta_f = \frac{\% \Delta Q}{\% \Delta P} = \frac{\partial \ln Q}{\partial \ln P}
$$

where the middle part shows the percentage change in quantity for a percentage change in price. This turns out to be the coefficient on log price ($\ln P$) in a linear regression of log quantity ($\ln Q$):

$$
\ln Q_i = \alpha + \eta_f \ \ln P_i + B \ X_i + \epsilon_i
$$

where $X_i$ are other controls, like day of the week in this example.

In [None]:
plot(fish$ltotqty, fish$lavgprc)

In [None]:
#install.packages("vembedr")
#install.packages("Hmisc")
#install.packages("ggplot2")

#library("ggplot2")
#library("Hmisc")
#library("vembedr")
#library("dplyr")

Here is a regression analysis shown by Wooldridge on p. 572 of the 4th edition of his textbook, and also shown in Table 2 of Graddy's [2006 J Econ Perspect paper](https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.20.2.207). Consider this approach to modeling the quantity of fish demanded: 

$$
ltotqty_i = \alpha + \eta_f \ lavgprc_i + \beta_m \ mon_i + \beta_t \ tues_i 
+ \beta_w \ wed_i + \beta_{th} \ thurs_i + \epsilon_i
$$

where `ltotqty` is the log total quantity sold (to all customers); `lavgprc` is the log average price; and the rest of the variables are indicators for weekdays in the week. Friday is omitted, and apparently the market was closed on the weekend. The variable of interest is the demand elasticity $\eta_f$, which is the percentage change in quantity demanded for a percentage change in price.

In [None]:
demand_reg <- lm(ltotqty ~ lavgprc + mon + tues + wed + thurs, data = fish)
summary(demand_reg)

Describe what you see. Which variables are statistically significant? Describe your numerical estimate of the coefficient on log average fish price, $\eta_f$, in a sentence. 

The demand elasticity here is $-0.52$, indicating somewhat inelastic demand. In a handy chart you can find in <i>Health Economics</i> by Bhattacharya, Hyde, and Tu, this elasticity is between those for coffee ($-0.25$) and movies ($-0.9$).

You might be concerned about whether shifts in supply or demand are actually creating these effects, however. That is the fundamental problem in examining market data; either supply or demand could be moving, or both could. Graddy was able to shed some light on that question by examining <b>weather data</b>, which she motivates on page 215 of her JEP piece. She argues that theory and evidence imply that weather only affects the market through supply.

This reasoning also appears in the end-of-chapter "Masters of 'Metrics" section at the end of Chapter 3. [Philip G. Wright](https://en.wikipedia.org/wiki/Philip_Green_Wright), an American economist born during the Civil War, later proposed instrumental variables as solutions to the simultaneity problem when estimating supply and demand curves for commodities.

### Instrumental variables with 2 IV's per Wooldridge

Hold on to your hats. Let us explore a form of instrumental variables estimation called "two stage least squares" using `ivreg` in __R__. <font color="red">You do NOT have to run instrumental variables in your term project.</font> We are doing it here because Graddy did it, Wooldridge suggests this exercise in an advanced chapter, and because we are currently studying instrumental variables in <i>Mastering Metrics</i>.

In [None]:
install.packages("ivreg", dependencies = TRUE)

In [None]:
library("ivreg")

Let us follow Wooldridge's advice and estimate the same model from above, but this time with two-stage least squares (2SLS), a very common form of IV estimation:

$$
ltotqty_i = \alpha + \eta_f \ lavgprc_i + \beta_m \ mon_i + \beta_t \ tues_i 
+ \beta_w \ wed_i + \beta_{th} \ thurs_i + \epsilon_i
$$

We will use `wave2` and `wave3`, both of which are measures of recent wave height around NYC, as instrumental variables. The argument is that the log average price is endogenously determined by supply and demand, but that running IV with weather variables will leverage the causality running from weather to supply and then to the price, and thus will reveal the true slope of the demand curve.

(Intuition: the height of the waves around NYC probably doesn't much affect the typical New Yorker's appetite for fish. But it does affect how hard it is to retrieve fish from the ocean.)

The syntax for `ivreg()` is very similar to that of `lm()`, except you need a "pipe" symbol: "|". The pipe appears after your original equation, and it needs to include the exogenous variables plus any instrumental variables.

In [None]:
demand_ivreg <- ivreg(ltotqty ~ lavgprc + mon + tues + wed + thurs 
                      | mon + tues + wed + thurs + wave2 + wave3, data = fish)
summary(demand_ivreg)

What happened to the price elasticity of demand when we ran IV versus ordinary least squares above? Curious? Read Graddy's JEP piece for more.

<hr>

### IV with one IV

Let us dig a little more. The equation above uses 2 instrumental variables, two different time-lags of wave height weather variables. As Angrist and Pischke explain in <i>Mastering Metrics<i> on pp 131-138, and specifically on p. 135, when we have 2 instrumental variables, the two-stage least squares (2SLS) routine produces a local average treatment effect (LATE) that is a weighted average of the LATEs that we would get if we ran 2SLS twice, once with each of the instruments.
    
Rather than pick through that more, let us instead just choose one of the instruments and reestimate the demand equation, so that it is easier to deconstruct everything.
    
A friendly econometrics leprechaun whispers in your ear: "Use wave2, it's good!"

In [None]:
d1_ivreg <- ivreg(ltotqty ~ lavgprc + mon + tues + wed + thurs 
                      | mon + tues + wed + thurs + wave2, data = fish)
summary(d1_ivreg)

What do you think? Are these results similar to the two-IV model we started with?

<hr>

Below we estimate the <b>reduced form</b> for this one-IV model. Here, it is a regression of the outcome on:
* the instrument
* NOT the endogenous treatment, we drop that
* the other exogenous controls

$$
ltotqty_i = \alpha_0 + \rho \ wave2_i + \beta_{0m} \ mon_i + \beta_{0t} \ tues_i 
+ \beta_{0w} \ wed_i + \beta_{0th} \ thurs_i + \epsilon_{0i}
$$

In [None]:
d1_rf <- ivreg(ltotqty ~ wave2 + mon + tues + wed + thurs 
               , data = fish)
summary(d1_rf)

What can we say about the reduced form?

As Angrist and Pischke write on p. 146, "We always tell our students: <i>If you can't see it in the reduced form, it ain't there.<i>
    
Can you see it in the reduced form?

<hr>

Finally, let us run the first stage. Here, it is a regression of <i>the endogenous treatment</i> on:
* the instrument
* the other exogenous controls

$$
lavgprc_i = \alpha_1 + \phi \ wave2_i + \beta_{1m} \ mon_i + \beta_{1t} \ tues_i 
+ \beta_{1w} \ wed_i + \beta_{1th} \ thurs_i + \epsilon_{1i}
$$

In [None]:
d1_1st <- ivreg(lavgprc ~ wave2 + mon + tues + wed + thurs 
               , data = fish)
summary(d1_1st)

What do you see here? How does lagged wave height $wave2_i$ affect the log average price, if at all?

<hr>

And now for the grand finale. Because we used just 1 instrumental variable, we can decompose the LATE $\eta_f$, the demand elasticity that we obtained from 2SLS, as the ratio:

$$
\eta_f = \frac{\rho}{\phi}
$$

(In the textbook, the LATE usually shows up as $\lambda$. Here, we are using $\eta_f$ because the estimate is a demand elasticity, and economists love $\eta$ for that because of its "long-E" sound.)

In [None]:
rho = d1_rf$coefficients["wave2"]
rho

phi = d1_1st$coefficients["wave2"]
phi

LATE = rho/phi
LATE

In [None]:
# This command appears to take a long time
etaf = d1_ivreg$coefficients["lavgprc"]
etaf

Another look at it, by hand:

In [None]:
phi = 0.112565
rho = -0.09467
(LATE = rho / phi)

<hr>

Finally, here is something YOU SHOULD NEVER DO because it gets the standard errors wrong. It is what I did back when I was 20 and did not know any better. The coefficients in 2SLS can be obtained by running the second stage on fitted values from the first stage. <i>But the standard errors will be incorrect. So do not do this.</i>

In [None]:
fish$lavgprc_hat1 <- d1_1st$fitted.values
head(fish)

In [None]:
d1_2sls_badse <- demand_reg <- lm(ltotqty ~ lavgprc_hat1 + mon + tues + wed + thurs, 
                                  data = fish)
summary(d1_2sls_badse)
summary(d1_ivreg)

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>