# EEP/IAS 118 - Introductory Applied Econometrics
## Problem Set 5, Spring 2022, Villas-Boas
#### <span style="text-decoration: underline">Due in Gradescope – April 21, 2022 at 11.59pm </span> 


Submit materials (all handwritten/typed answers, Excel workbooks, and R reports) as one combined pdf on [Gradescope](https://www.gradescope.com/courses/353120). All students currently on the EEP118 bCourses have been added using the bCourses email. If you do not have access to the Gradescope course, please reach out to the GSI's.

For help combining pdf's, see the "Tips for Saving/Combining PDFs" announcement on bCourses.

For the purposes of this class, we will be primarily Berkeley's _Datahub_ to conduct our analysis remotely using these notebooks.

If instead you already have an installation of R/RStudio on your personal computer and prefer to work offline, you can download the data for this assignment from bCourses (Make sure to install/update all packages mentioned in the problem sets in order to prevent issues regarding deprecated or outdated packages). The data files can be accessed directly  through $Datahub$ and do not require you to install anything on your computer. 

### Preamble
When writing R code, it's a good habit to start your notebooks or R scripts with a preamble, a section where you load all necessary packages, set paths or change the working directory, or declare other options.

Use the below code cell to load in packages you will use throughout the problem set (at least `haven`, `tidyverse` and `ggplot` this week). 

*Note:* **never** try to install packages on Datahub. All packages that you need are already installed and can be loaded immediately using the `library()` function. Attempting to install packages will create conflicts with the package versions on the server and potentially corrupt your notebook.

## Exercise 1: Using Hedonics to measure the economic cost of landfills

In this exercise you will use data on housing prices for two years, for a sample of houses,  and also information of the announcement of the construction of a landfill.  Also,  characteristics of houses in the sample are available.

Note that several of the problems require you to produce custom summary statistics and regression tables. For more information on how to produce these types of tables, see the Coding Bootcamps posted on Datahub.

### Data description


<center><b> Readme for data variables </b></center>

|Variable name 	|	Definition	|
|-----------|---------------|
|year	| 1978 or 1981 |
|age	| Age of the house, in years |
|nbh	| Neighborhood identification number, from 1 to 6 |
|price	| Selling price of the house  |	
|rooms	| Number of rooms in the house |
|area	| Square footage of house	| 
|land	| Square footage of lot	| 
|baths | Number of bathrooms	| 
|dist  | Distance from house to landfill, in feet |
|rprice | Price, in 1978 dollars| 

The first dependent variable of interest is whether there will be a landfill near a certain house in the sample. Let the variable **landfill** = 0 or 1, where 1 means the house ever had a landfill installed near it and zero otherwise. You will use a linear probability model to explain the probability of a region having a landfill as a function of observables before any landfill was installed. You will also estimate a logit specification, interpret marginal effects, and perform hypothesis testing.  
 
*Note: in economics, log always refers to the natural log, ln().*

_Note: The dataset is in Stata. It is available on bcourses and is called **pset5_2022.dta**._

1. What are the number of observations in the whole dataset? Define a new variable called **treated** equal to one if a house is located 15000 ft or less to an upcoming landfill (to be installed in 1981) and **treated** is equal to zero otherwise. Please filter the data to only observations for 1978. How many observations do we lose from not considering the year 1978? 

_Hint: Use `ifelse()`. Use `?ifelse()` to see the syntax for this command._ 

In [None]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 1_ here (replacing this text).

2. Using the filtered data, how many observations do we have for treated houses, and how many for control (untreated) houses in 1978, respectively? How many observations for treated and untreated houses in 1981?

In [None]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 2_ here (replacing this text).

3. Summarize the averages and standard deviations of the four variables: **rooms**, **land**, **area**, and **age**, by treatment status in the year 1978. This gives you the summary stats of observable housing variables before any landfill was constructed. 

First: Create a table, Table 1, with two columns, where the first is for the **treatment=0** group and the second for **treatment=1**. Please provide for rooms the average in row 1 and standard deviation in row 2. Continue by placing the average/standard deviations of the remaining variables (house characteristics) in the following lines row 3 average lot, row 4 std lot, etc. Make sure to include the table number and a title at the top of the table. In the last two rows put average and SD of price.

** Hint: see the "Summary Statistics Tables in Stargazer" section of Coding Bootcamp Part 5. If this section isn't visible, make sure to re-follow the Datahub link posted on bCourses*.

In [None]:
# insert your code for 3 here

➡️ Type your written answer for _Exercise 1 Question 3_ here (replacing this text).

4. Using data for 1981 only now, repeat the same for the year of 1981 (the year the landfill was constructed). Call this Table 2. 

In [None]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 4_ here (replacing this text).

5. Compute the difference in average house prices in treated and not treated houses in 1978 and compute the same difference in average price in 1981. How does that difference in average house price, between Treated and Not Treated homes, change from 1978 to 1981?  


In [None]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 5_ here (replacing this text).

6. This part only uses **data for 1978 only.** We want to see what factors, in 1978, are correlated with the probability of a house being treated (treated=1/0; where houses will be then treated in 1981 when the landfill is constructed). 

Estimate three (3) linear probability model regressions (as specified below) and present the estimates in a three-column table and call it Table 3. Make sure you use robust standard errors in all regressions. Let the dependent variable of all columns be the indicator variable of being treated. Produce the table also by denoting with a star * the coefficients that are significant at the 10% level, two stars ** those significant at the 5% and three stars *** those significant at the 1% level. 

   i) For the regression in column 1, specify a constant (i.e. intercept) and rooms as regressors. ii) In column 2, add to the constant and rooms, the land variable. iii) In column 3 present the estimates and standard errors from the regression of the treatment indicator on a constant, rooms, land, area, and age. 

_Hint: As in lecture 19, I am asking you to run separate regressions and present the estimates in a table with 3 columns. See Coding Bootcamp Part 5 for help producing these tables._

In [None]:
# insert your code here

a) Looking at the whole table, which coefficient measures the estimated change in treatment probability when the number of rooms of the house changes by one, **controlling for no other covariates**? What is its value and standard error? Is it significantly different from 0?

b) What does the estimated constant mean in column 1? Is it significantly different from zero?

c) Which coefficient measures the estimated change in treatment probability if lot size increases by one unit holding rooms, age, and area constant? Is it statistically significant at the 5 percent level?

➡️ Type your answer to _Exercise 1 Question 6a_ here (replacing this text)

➡️ Type your answer to _Exercise 1 Question 6b_ here (replacing this text)

➡️ Type your answer to _Exercise 1 Question 6c_ here (replacing this text)

7.	Run the linear probability model with the variables in column (3) of Table 3 __without robust standard errors__. What happened to the coefficient estimates and to the significance of the estimates? Which problem does this highlight?

In [4]:
# insert code here

➡️ Type your answer to _Exercise 1 Question 7_ here (replacing this text)

8.	Estimate the same right-hand side specification as in column 3 of Table 3 above but now using a Logit model. After you estimate the model, type the marginal effects command in R as discussed in lecture. What do you conclude in terms of the marginal effect of the number of rooms on the treatment probability at 1% significance? What about lot size at 5% significance? And area at 10% significance?

In [None]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 8_ here (replacing this text)

9.	Estimate a Logit model that, together with the estimation output from Question 8, allows you to test whether the additional covariates added in column 3 of Table 3 relative to column 1 matter for the landfill treatment probability or not. What do you conclude? 

_Hint: Do the five steps of Hypothesis testing by hand, do not use R’s built-in test command._

In [None]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 9_ here (replacing this text)

10.	Create the necessary variables and then estimate a Logit model that allows you to test whether the impact of the number of rooms is different depending on the size of the house.  What do you conclude? 

_Hint: Do the five steps of Hypothesis testing by hand, do not use R’s built-in test command._

In [None]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 10_ here (replacing this text)

## Exercise 2: Effects of the landfill treatment

This question focuses on the effects of the landfill treatment on housing price (`price`). A negative value for having a house near a landfill would cause a decrease in prices in treated compared to untreated houses after the landfill is installed there relative to before its installation. Let `treated*after` be the interaction of **treated** and an indicator variable called **after** (**after** is equal to one if the year is 1981, 0 otherwise).
Let the two models (2.1) and (2.2) be given by the following regressions: 

(2.1) $ price_{it} = \beta_0 + \beta_1 treated_{i} + \beta_2 after_t + \beta_3 treated_i * after_t + u_{it} $

(2.2) $ price_{it} = \beta_0 + \beta_1 treated_{i} + \beta_2 after_t + \beta_3 treated_i * after_t + \beta_4 area_{it} + \beta_5 land_{it} + \beta_6 age_{it} + v_{it} $

1. In equation (2.1), what does the coefficient $\beta_0$ measure? What does $\beta_1$ measure?

➡️ Type your answer to _Exercise 2 Question 1_ here (replacing this text)

2. In equation (2.1), what does the coefficient $\beta_2$ measure? What about $\beta_3$?

➡️ Type your answer to _Exercise 2 Question 2_ here (replacing this text)

3. What are the conditions needed so that we can interpret from equation (2.1) the coefficient of the $\beta_3$ (the difference in differences estimate) as the causal impact of receiving a landfill near a house treatment on house prices? 
What is a simple set of tests you could run to support this? Do not run these tests but explain what data you would use and collect and what tests you would run. Does any of the analysis in Exercise 1 also help answer this concern?

In [None]:
# insert your code here

➡️ Type your answer to _Exercise 2 Question 3_ here (replacing this text)

4. Estimate both models and display the estimates in a table with two columns. Place the estimates from model (2.1) in column 1 and the results from model (2.2) with additional covariates in column 2. 
Does adding the covariates affect the estimated coefficient of the landfill treatment (on the coefficient $\beta_3$ of the interaction **treated*after**)? How do you interpret the point estimate of $\beta_3$ in column (2)? (using Size, Sign and Significance)

In [None]:
# insert your code here

➡️ Type your answer to _Exercise 2 Question 4_ here (replacing this text)

5. Estimate the model like in column (1) and add **land**. Add that to a column (3) in the table above. 
Looking at the change in the $\beta_1$ estimate (on the row for **treated**) from column (1) to column (3), what does it imply for the correlation between being “treated” and the size of a house lot (**land**) of a house?

In [None]:
# insert your code here

➡️ Type your answer to _Exercise 2 Question 5_ here (replacing this text)

## Exercise 3: Gasoline and environmental regulation

Does environmental content regulation of gasoline decrease emissions? Areas that have pollution more than or equal to a threshold $X$ in 2021 get regulated in 2022 and areas that have pollution less than $X$ in 2021 do not get regulated in 2022. 
Cities in the US with pollution levels in 2021 greater or equal to $X$ were treated with environmental content regulation of gasoline (which becomes more costly to produce in order to be cleaner). Cities with pollution in 2021 below the threshold were not subject to this content regulation. You have data over time $t$ and for a random sample of cities $j$.

1.	How would you estimate the causal effect of environmental regulation on pollution? Write down the exact regression and define each variable. Also say which coefficient is interpreted as the causal effect of the gasoline content regulation.

➡️ Type your answer to _Exercise 3 Question 1_ here (replacing this text)

2. What assumption is key for you to interpret the coefficient as a causal effect of the environmental regulation aimed at reducing air pollution?

➡️ Type your answer to _Exercise 3 Question 2_ here (replacing this text)

3. If given data on admissions into emergency rooms over time and at the city level due to asthma, how would you estimate the causal effects of regulation on asthma attacks occurrences?

➡️ Type your answer to _Exercise 3 Question 3_ here (replacing this text)

4. If given data on gasoline prices over time and at the city level, how would you estimate the causal effects of regulation on gasoline prices consumers face?

➡️ Type your answer to _Exercise 3 Question 4_ here (replacing this text)

## Exercise 4: Diff-in-diff

Equation (4.1):
$$   Predicted Outcome_{jt} = i? - 80 After_{jt} + 35 Treated_{jt}  +  25  After_{jt}*Treated_{jt} $$

1. Using the above estimated equation and the numbers below, complete equation (4.1) and fill out the table below:

|Average outcome 	|	Control, **Treated** = 0	| Treatment, **Treated** = 1	| Difference: T-C (in each row)
|-----------|---------------|---------------|---------------|
|Before, **After** = 0	| 100 | $a$ | $b$ |
|After, **After** = 1 | $e$ | $d$ | $c$ |
|Difference: After-Before (in each column)	| $f$ | $g$ | D-in-D = $h$|

In [5]:
# insert the code here letter a in Exercise 4

➡️ Type your answer to letter $a$ in Exercise 4

In [None]:
# insert the code here letter b in Exercise 4

➡️ Type your answer to letter $b$ in Exercise 4

In [None]:
# insert the code here letter c in Exercise 4

➡️ Type your answer to letter $c$ in Exercise 4

In [6]:
# insert the code here letter d in Exercise 4

➡️ Type your answer to letter $d$ in Exercise 4

In [None]:
# insert the code here letter e in Exercise 4 Question 1

➡️ Type your answer to letter $e$ in Exercise 4

In [None]:
# insert the code here letter f in Exercise 4 Question 1

➡️ Type your answer to letter $f$ in Exercise 4

In [7]:
# insert the code here letter g in Exercise 4 Question 1

➡️ Type your answer to letter $g$ in Exercise 4

In [8]:
# insert the code here letter h in Exercise 4 Question 1

➡️ Type your answer to letter $h$ in Exercise 4

In [9]:
# insert the code here letter i in Exercise 4 (from the equation)

➡️ Type your answer to letter $i$ in Exercise 4 (from the equation)

## Exercise 5: Wages and union effects

The table below shows the results from estimating models of wage of individual $j$ and year $t$, with the explanatory variables in rows, and a different specification in each column.

Column 1 has no controls other than a dummy for the individual being in an union in year $t$. Column 2 adds characteristics of the individual. Column 3 adds year fixed effects to the specification in column 2 (we can do this because we see an individual for many years in the data). Finally, column 4 has, in addition to everything in column 3, individual fixed effects (an indicator variable for each of the individuals,which we can do because we see each individuals over many years in the data). 

![ex5%20ps4.png](attachment:ex5%20ps4.png)


1. How many individual fixed effects for the 545 individuals are estimated in column (4) given that we have a constant?

➡️ Type your answer to _Exercise 5 Question 1_ here (replacing this text)

2. Why do the variables **black** and **hisp** not get estimated in column (4)?

➡️ Type your answer to _Exercise 5 Question 2_ here (replacing this text)

3. If we have 8 years in the dataset, how many year fixed effects can we have in column (3)? Which year was dropped in column (3)? And given that, how do you interpret the estimate of the constant in column (3)? 

_(Note that middle years are omitted in the table printout in columns (3 and 4) to save space but they are estimated.)_

➡️ Type your answer to _Exercise 5 Question 3_ here (replacing this text)

4. In column (3), the estimated fixed effect for year 81 is positive. This means that log wage is larger in year 81 relative to year 80 (and significant). How does log wage change from year 87 relative to year 81? What is the estimated average change in wage between the same years?

In [10]:
# insert your code here

➡️ Type your answer to _Exercise 5 Question 4_ here (replacing this text)

5. Comparing the output from column (3) to (2) and then (4) to (3), which fixed effects improve the fit ($R^2$) the most? Time or individual fixed effects?

➡️ Type your answer to _Exercise 5 Question 5_ here (replacing this text)

**You should use exercise5.dta for the following questions**


6. a) Estimate the specification in 5.1, 5.2, 5.3 and 5.4 (replicate all columns). Add a new specification in column 5 by adding **experience**, **experience squared** and **married** to the regressors already included in column 4. 

    b) Test the null that the quadratic term of experience, married, and the experience are all jointly zero. Conduct the test by hand using the 5 steps of hypothesis testing. Do you reject the null at the 10% level? 

_Hint: See Coding Bootcamp Part 5 that explains fixed effects regressions. See also Lecture on panel and fixed effects R code, specifically the death penalty analysis example__

In [None]:
# read in the new dataset

In [None]:
# insert your code here for Question 6a

In [None]:
# insert your code here for Question 6b

➡️ Type your answer to _Exercise 5 Question 6.b)_ here (replacing this text)

7. Do you reject, in the specification (5.5), that **union** has a coefficient of 0.09 at the ten percent significance level? 

_Use the 5 steps of hypothesis testing by hand also, not using any canned functions in R_

In [12]:
# insert your code here

➡️ Type your answer to _Exercise 5 Question 7_ here (replacing this text)