<img src="images/econ140R_logo.png" width="200" />

In the following cell, please type your name and SID:

In the cell below, please write out the [Honor Code](https://teaching.berkeley.edu/berkeley-honor-code) to reaffirm you are abiding by it.

Did you work with other students? List them below. Please write your answers in your own words, not in theirs.

<h1>ECON 140R - Problem Set 4</h1>

<h2>INSTRUCTIONS</h2>

Please step through this problem set, copying and pasting code as needed, and run the code to produce output. Answer the questions asked, which appear in <font color="blue">blue font</font>. You will earn 100% of the credit on this problem set for <b>completing</b> it with working code and coherent answers. Answers do not need to be correct for full credit.

Throughout, <b>you may consult outside sources like the paper below or other commentary</b>, but your analysis should be in your own words. We will not run TurnItIn software on this Problem Set, but as usual, you should not borrow phrases without attribution or commit "mosaic plagiarism."

<h2>Background</h2>

In this problem set we will examine the `injury` dataset within the Wooldridge repository, which includes measures of workers' compensation utilization and worker characteristics in Kentucky and Michigan around 1980. The data were collected by the [National Council on Compensation Insurance (NCCI)](https://www.ncci.com/pages/default.aspx) and analyzed by [Meyer, Viscusi, and Durbin (1995)](https://www-jstor-org.libproxy.berkeley.edu/stable/2118177).

The scenarios in Kentucky (KY) and Michigan (MI) were similar, although of course the two states differed in terms of industries, incomes, and other characteristics. Both states <i>raised the maximum weekly benefit</i> that workers could receive, by a large amount: from \\$131 to \\$217 or by 66% in KY, and from \\$181 to \\$307 or 70% in MI. Although high inflation rates during this period ate away some of the real value of these increases, they were comfortably high even after adjusting for inflation.

<i>What effect did this policy change have, if any, on benefit claiming behavior?</i>

<b>Workers' compensation</b> plans differ in implementation across states, but the general structure replaces wages when a workplace injury prevents working. An injured worker's weekly workers' comp payment is a function of the worker's wage, via a "replacement rate," above a floor and up to a ceiling. The floor grants low-paid workers a minimum weekly insurance payout, and the ceiling places a cap on the total amount paid. In between, workers are paid perhaps two-thirds or 66.67% of their weekly wages, the replacement rate for Kentucky cited in the study.

As Meyer, Viscusi, and Durbin discuss, the most common type of payment is for temporary total disabilities, where a person is unable to work but is expected to recover fully and return to work. As they describe on p. 324, "temporary total claims have no fixed duration; their length is determined by the injured worker, his or her doctor, the employer, and its insurer."
    
Shown below, Figure 1 from [Meyer, Viscusi, and Durbin (1995)](https://www-jstor-org.libproxy.berkeley.edu/stable/2118177) illustrates this point and also the change associated with the increases in the maximum in Kentucky and Michigan in this time period.

<img src = "images/meyer-viscusi-durbin-aer95-fig1.png" width = 400/>

In [None]:
library(tidyverse)
library(haven)
library(dplyr)
install.packages('wooldridge')

It turns out that this handy command stops __R__ from defaulting to scientific notation. 

In [None]:
options(scipen=999)

<hr>

Let's load up and look at the data.

In [None]:
library("wooldridge")
data(injury, package = "wooldridge")
head(injury)

As usual, you can call the code below to pull up the documentation for the dataset, which reveals the variable descriptions.

In [None]:
?injury

As is customary inside the helpful `wooldridge` repository, a lot of variables have already been constructed for us. Variables with a leading lowercase "L" are natural logs of things. It is very common to log variables like prices or incomes, because they typically have long right tails. Let us have a look at a key variable: `durat`, the duration in weeks of the workers' compensation claim.

In [None]:
hist(injury$durat)

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
# Here, look at the histogram of the natural logarithm of claim duration
hist(injury$ldurat)

<font color="blue">
    <h3>
    Question 1</h3>
    One of these histograms looks like a bell curve, or somewhat normally distributed, and one does not. Which one looks reasonably like a bell curve?
</font>

<hr>

<h2>Ordinary Least Squares</h2>

Let us therefore use a regression model to estimate the somewhat-normally-distributed variable as a function of things that we think matter for the duration of workers' comp benefits. Obvious choices for the determinants of benefit duration are the nature of the injury, the worker's industry, and the worker's age. Less obvious choices are the earnings class of the worker, gender, and marital status. 

If the nature of the injury were the only thing that drove what is ostensibly medical decisionmaking, one would expect that things like marital status should not matter for benefits duration unless they somehow proxied for health status. (As a parent of small children, I can attest that injuries appear to hurt a lot more than they used to now that I have additional care responsibilities, but that could also just be an age effect!)

First things first. Let us remove one Kentuckian who is listed as 98 years old. The next oldest is 81 and then two workers are listed at age 80. This suggests to me that the 98yo is actually an unrecoded "missing data" problem. Missings are often coded as 98 or 99, or 998 or 998.

In [None]:
# Drop an apparent miscoding using subset()
injury1 <- subset(injury, age < 98)

Here is our first regression model:

$$
ldurat_{i} = \alpha + \beta^a \ age_{i} + \beta^m \ male_{i}
+ \sum_k \beta^k \ injury^k_{i} 
+ \sum_\eta \beta^\eta \ industry^\eta_{i} 
+ \epsilon_{i}
$$

The right-hand-side variables are <i>age</i> in years; an indicator variable for <i>male</i> identity; and sets of indicator variables for the type of injury by body location: <i>head</i>, <i>neck</i>, <i>upextr</i> for upper extremities, <i>trunk</i>, <i>lowback</i> for lower back, <i>lowextr</i> for lower extremities; and type, <i>occdis</i> for occupational disease. We also have indicators for industry: <i>manuf</i> for manufacturing, and <i>construc</i> for construction.

Let us estimate the model on a subset of the data: for Kentucky in the period before the policy change around 1980. The indicator variable <i>afchnge</i> equals 1 after the policy change and 0 before it.

<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
# Alter the code below to use subset() to grab the observations 
# for which afchnge == 0 & ky == 1
injury1_ky_before <- subset(injury1, afchnge == 0 & ky == 1)

Now run the code below using that subset of the data we created.

In [None]:
ldurat_reg1 <- lm(ldurat ~ age + male + 
                  head + neck + upextr + trunk +
                  lowback + lowextr + occdis + 
                  manuf + construc, 
                  data = injury1_ky_before)
summary(ldurat_reg1)

<font color="blue">
    <h3>
    Question 2</h3>
    Look at the regression output above and describe what you see. 
    Do older workers spend more time on workers' comp, conditional on gender, injury location, and industry? Does that make sense to you?
    Of the coefficients on the various body parts of the injury, do any of the results seem strange? (Of these listed, what body part is arguably the most important? <i>This question probably does not have a single correct answer, but one really stood out to me.</i>)
</font>

<hr>

Now let us examine a model where we add an indicator for being married, $marr_i$, and an indicator of having high earnings, $highearn_{i}$. The precise threshold of earnings is a little unclear; it is in the neighborhood of \\$350 per week, and [Meyer, Viscusi, and Durbin (1995)](https://www-jstor-org.libproxy.berkeley.edu/stable/2118177) discuss it in footnote 17. The regression model becomes:

$$
ldurat_{i} = \alpha + \beta^{h} \ highearn_{i} + \beta^{marr} \ married_{i} 
+ \beta^a \ age_{i} + \beta^m \ male_{i}
+ \sum_k \beta^k \ injury^k_{i} 
+ \sum_\eta \beta^\eta \ industry^\eta_{i} 
+ \epsilon_{i}
$$


<h4>
<font color="blue">Complete the code below and run it.</font>
    </h4>

In [None]:
ldurat_reg2 <- lm(ldurat ~ highearn + married + 
                  age + male +
                  head + neck + upextr + trunk +
                  lowback + lowextr + occdis + 
                  manuf + construc, 
                  data = injury1_ky_before)
summary(ldurat_reg2)

<font color="blue">
    <h3>
    Question 3</h3>
    Look at the regression output above and describe what you see. 
    What is the coefficient on $highearn$? Is it statistically significant? Recalling the units of the left-hand-side variable, state what the coefficient on $highearn$ means in terms of the duration of workers' comp between two injured Kentuckians who are observationally identical except that one is high-earning.
    </font>

<hr>

<font color="blue">
    <h3>
    Question 4</h3>
    Does this result surprise you, as an economist? Do you think it would surprise a typical person? 
    </font>

<hr>

<h2>Difference in differences</h2>

The raising of the benefit cap around 1980 may have had no effect on benefits claiming if claiming behavior were unrelated to earnings. But claiming behavior clearly was related to earnings, as we have just seen. Because the cap was only a constraint for high earners and not for low earnings, a natural approach to explore is <b>difference-in-differences (DID)</b> estimation.

<font color="blue">
    <h3>
    Question 5</h3>
    According to <i>Mastering Metrics</i> and class discussion, what is the assumption underlying DID? Describe and discuss in this context, where unfortunately it is not directly testable in these data. Take a stand about whether you think DID is appropriate or not for this problem. 
    </font>

<font color="blue">
    <h3>
    Question 6</h3>
    Just to put a fine point on this: what do your second regression results above reveal about the level of benefits duration among high earners versus low workers? Is this a problem for the DID approach? 
    </font>

<hr>

When $highern_i$ defines the treatment group, and $afchng_t$ defines the post-treatment period, the simplest regression DID approach is to estimate this equation using stacked or pooled data:

$$
ldurat_{it} = \alpha + \beta^h \ highearn_i + \gamma \cdot {afchnge}_t 
+ \delta_{rDID} 
\left( 
highearn_i \times afchng_t
\right)
+ \epsilon_{it}
$$

where I have inserted a dot multiplier ("$\cdot$") after $\gamma$ to make the notation clearer.  The setup is that the treatment group gets a fixed effect equal to $\beta^h$, and both control and treatment groups get a time fixed effect $\gamma$. Then $\delta_{rDID}$ is what remains, the difference in differences experienced by the treatment group.

First things first, let us select observations pre and post but only from Kentucky.

In [None]:
injury1_ky <- subset(injury1, ky == 1)

Now let us create the interaction term by multiplying the treatment group indicator variable by the post-treatment period indicator:

In [None]:
injury1_ky <- mutate(injury1_ky, highearnxafchnge = highearn * afchnge)

Finally, let us run the simple regression DID:

In [None]:
ldurat_reg_did1 <- lm(ldurat ~ highearn + afchnge + highearnxafchnge, 
                  data = injury1_ky)
summary(ldurat_reg_did1)

<font color="blue">
    <h3>
    Question 7</h3>
    Describe what you see in the results here. Which group uses benefits for longer? Is that consistent with your second OLS regression above? 
    Is the common time trend statistically significant?
    What is the DID estimate of the effect of raising the benefit cap on the duration of workers' comp benefits? Is it statistically significant?
    </font>

<hr>

A reasonable approach is to enrich this simple DID model with the other variables that we included earlier. That model is written as:

$$
\begin{aligned}
ldurat_{it} & = & \alpha + \beta^h \ highearn_i + \gamma \cdot {afchnge}_t 
+ \delta_{rDID} 
\left( 
highearn_i \times afchng_t
\right)
\\
& & 
+ \beta^{marr} \ married_{i} 
+ \beta^a \ age_{i} + \beta^m \ male_{i}
+ \sum_k \beta^k \ injury^k_{i} 
+ \sum_\eta \beta^\eta \ industry^\eta_{i} 
+ \epsilon_{it}
\\
\end{aligned}
$$

In [None]:
ldurat_reg_did2 <- lm(ldurat ~ highearn + afchnge + highearnxafchnge +
                      married + age + male +
                      head + neck + upextr + trunk +
                      lowback + lowextr + occdis + 
                      manuf + construc, 
                      data = injury1_ky)
summary(ldurat_reg_did2)

<font color="blue">
    <h3>
    Question 8</h3>
    Repeat your analysis in Question 7 for these results: Which group uses benefits for longer? Is that consistent with your second OLS regression above? 
    Is the common time trend statistically significant?
    What is the DID estimate of the effect of raising the benefit cap on the duration of workers' comp benefits? Is it statistically significant?
    </font>

<hr>

Suppose we ignored the DID estimator structure and just ran the panel data regression using $highearn$ as a treatment group fixed effect and $afchnge$ as a time fixed effect. What kind of inference would emerge?

In [None]:
ldurat_reg_panelfe <- lm(ldurat ~ highearn + afchnge +
                      married + age + male +
                      head + neck + upextr + trunk +
                      lowback + lowextr + occdis + 
                      manuf + construc, 
                      data = injury1_ky)
summary(ldurat_reg_panelfe)

<font color="blue">
    <h3>
    Question 9</h3>
    Please answer familiar questions based on these results: 
    Which group uses benefits for longer? Is that consistent with your second OLS regression above? 
    Is the common time trend statistically significant?
    </font>

<hr>

<h2>Wrapping up</h2>

Stepping back, let us articulate the conceptual power and usefulness of the DID estimator.   

<font color="blue">
    <h3>
    Question 10</h3> 
    Look at the results of <i>ldurat_reg_panelfe</i> above, the last regression, which is a panel regression with fixed effects, and assess this remark ("true or false and discuss"): 
    </font>
    <p>
        <font color="blue">
    <i>Taken in isolation, that regression tells us nothing about causal influences on benefits duration. High earners take benefits longer, but we cannot really say why. Benefit duration is longer later in time, but we cannot really say why.</i>
    </i>
</font>

<hr>

<font color="blue">
    <h3>
    Question 11</h3> 
    Look at the results of <i>ldurat_reg_did2</i> above, the second-to-last regression, which is a panel regression with fixed effects and a time/group interaction, and assess this remark ("true or false and discuss"): 
    </font>
    <p>
        <font color="blue">
    <i>Taken in isolation, that regression tells us nothing about causal influences on benefits duration. High earners take benefits longer, but we cannot really say why. Benefit duration stays constant over time, and we cannot really say why.</i>
    </i>
</font>

<hr>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>