# EEP/IAS 118 - Introductory Applied Econometrics
## Problem Set 5, Spring 2023, Villas-Boas
#### <span style="text-decoration: underline">Due in Gradescope – April 25, 2023 at 11.59pm </span> 


Submit materials (all handwritten/typed answers, Excel workbooks, and R reports) as one combined pdf on [Gradescope](https://www.gradescope.com/courses/492989).

This problem set is to be done using R. To receive full credit: answers must include a correct answer, demonstrate all steps (codes and corresponding output) used to obtain the answer, and be uploaded to correctly indicate on which pages your answers to each question can be found. In addition, for full credit, all confidence intervals/hypothesis tests must be conducted by hand: you can use functions like sd() or mean()to get values to plug into the formulas, but no credit will be given for the use of canned interval/test functions (i.e.linearHypothesis()) with no steps/calculations provided.

This problem set uses two data sets: **pset5_2023.dta & question5_2023.dta**.

### Preamble
When writing R code, it's a good habit to start your notebooks or R scripts with a preamble, a section where you load all necessary packages, set paths or change the working directory, or declare other options.

Use the below code cell to load in packages you will use throughout the problem set (at least `haven`, `tidyverse` and `ggplot2` this week). You will need more as you make tables, etc.

*Note:* **never** try to install packages on Datahub. All packages that you need are already installed and can be loaded immediately using the `library()` function. Attempting to install packages will create conflicts with the package versions on the server and potentially corrupt your notebook.

## Exercise 1: 

### Data description

Data description: In this exercise, you will use data on housing prices for two years, for a sample of houses, and also information on the announcement of the construction of a landfill.  Characteristics of houses in the sample are available.  For this problem, you will use the Stata file pset5_2023.dta provided on Datahub and bCourses.

Note that several problems require you to produce custom summary statistics and regression tables. For more information on producing these types of tables, see the Coding Bootcamps posted on Datahub.

The first dependent variable of interest is whether a landfill will be near a certain house in the sample. Let the variable *landfill* = 0 or 1, where 1 means the house ever had a landfill installed near it and zero otherwise. You will use a linear probability model to explain the probability of a region having a  landfill as a function of observables before any landfill was installed. You will also estimate a logit specification, interpret marginal effects, and perform hypothesis testing.

*Please perform all the calculations for this exercise using real 1978 prices (rprice).*


<center><b> Readme for data variables </b></center>

|Variable name 	|	Definition	|
|-----------|---------------|
|year	| 1978 or 1981 |
|age	| Age of the house, in years |
|nbh	| Neighborhood identification number, from 1 to 6 |
|price	| Selling price of the house  |	
|rooms	| Number of rooms in the house |
|area	| Square footage of house	| 
|land	| Square footage of lot	| 
|baths | Number of bathrooms	| 
|dist  | Distance from house to landfill, in feet |
|rprice | (Real) Price, in 1978 dollars| 


1. (i) What are the number of observations in the whole dataset? (ii) Define a new variable called **treated** equal to one if a house is located 20000 ft or less to an upcoming landfill (to be installed in 1981) and **treated** is equal to zero otherwise. (iii) Please filter the data to only observations for 1978. (iv) How many observations do we lose from not considering the year 1978?  _Hint: Use `ifelse()`. Use `?ifelse()` to see the syntax for this command._ 

In [2]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 1_ here (replacing this text).

2. Using the filtered data (only 1978), how many observations do we have for treated houses, and how many for control (untreated) houses in 1978, respectively? Looking instead at the 1981 data only, how many observations for treated and untreated houses in 1981?

In [4]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 2_ here (replacing this text).

3. Using the 1978 data, summarize the mean and SD of the four variables by their treatment status in the year 1978: **rooms**, **land**, **area**, **age**, and **rprice**. This gives you the summary stats of observable housing variables before any landfill was constructed. 

To do so, create Table 1 with two columns, where the first column is for the **treatment=0** group and the second column is for **treatment=1**. First two rows should correspond to mean and sd of `land`; then, continue by placing the average and sd of the remaining variables (indicated above) in order. Make sure to include the table number and a title at the top of the table.

** Hint: see the "Summary Statistics Tables in Stargazer" section of Coding Bootcamp Part 5. If this section isn't visible, make sure to re-follow the Datahub link posted on bCourses*.

In [6]:
# insert your code for 3 here

➡️ Type your written answer for _Exercise 1 Question 3_ here (replacing this text).

4. Using data for 1981 only now, repeat the same for the year of 1981 (the year the landfill was constructed). Call this Table 2. 

In [8]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 4_ here (replacing this text).

5. Compute the difference in average house prices in treated and not treated houses in 1978 and compute the same difference in average price in 1981. How does the difference in average house price, between Treated and Not Treated homes, change from 1978 to 1981?  


In [10]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 5_ here (replacing this text).

6. This question uses **data for 1978 only.** We want to see what factors, in 1978, are correlated with the probability of a house being treated (treated=1/0; where houses will be then treated in 1981 when the landfill is constructed). 

Estimate three (3) linear probability model regressions (as specified below) and present the estimates in a three-column table and call it Table 3. Make sure you **use robust standard errors** in all regressions. Let the dependent variable of all columns be the indicator variable of being treated. Produce the table also by denoting with a star * the coefficients that are significant at the 10% level, two stars ** those significant at the 5% and three stars *** those significant at the 1% level. 

   i) For the regression in column 1, specify a constant (i.e. intercept) and *age* as regressors. ii) In column 2, add to the constant and *age*, the *rooms* variable. iii) In column 3 present the estimates and standard errors from the regression of the treatment indicator on a constant, *age, rooms, and land*. 

_Hint: As in lecture 19, I ask you to run separate regressions and present the estimates in a table with 3 columns. See Coding Bootcamp Part 5 for help producing these tables._

In [12]:
# insert your code here

a) When looking at the whole table, which coefficient measures the estimated change in treatment probability when the age of the house changes by one year, **controlling for no other covariates**? What is its value and standard error? Is it significantly different from 0 at the 5 percent level?

b) What does the estimated constant mean in column 1? Is it significantly different from zero?

c) Which coefficient measures the estimated change in treatment probability if the house land square footage increases by one square foot holding rooms and age constant? Is it statistically significant at the 5 percent level?

➡️ Type your answer to _Exercise 1 Question 6a_ here (replacing this text)

➡️ Type your answer to _Exercise 1 Question 6b_ here (replacing this text)

➡️ Type your answer to _Exercise 1 Question 6c_ here (replacing this text)

7.	Run the linear probability model with the variables in column (3) of Table 3 __without robust standard errors__. Create a variable equal to the predicted probabilities. What is the number of predicted probabilities that are less than zero? Which problem does this highlight, if any?

In [14]:
# insert code here

➡️ Type your answer to _Exercise 1 Question 7_ here (replacing this text)

8.	Estimate the same right-hand side specification as in column 3 of Table 3 above but now using a Logit model. After you estimate the model, type the marginal effects command in R as discussed in lecture. What do you conclude in terms of the marginal effect of the number of rooms on the treatment probability at 5% significance? What about lot size at 1% significance? And rooms at 1% significance?

In [16]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 8_ here (replacing this text)

9.	Estimate a Logit model that, together with the estimation output from Question 8, allows you to test whether the additional covariates added in column 3 of Table 3 relative to column 1 matter for the landfill treatment probability or not. What do you conclude at the 1 percent significance level? 

_Hint: Do the five steps of Hypothesis testing by hand, do not use R’s built-in test command._

In [18]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 9_ here (replacing this text)

10.	Create the necessary variables and then estimate a Logit model that allows you to test whether the impact (marginal effect) of the number of rooms is different depending on land, the land area of the house lot.  What do you conclude at the 1 percent significance level? 

_Hint: Do the five steps of Hypothesis testing by hand, do not use R’s built-in test command._

In [20]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 10_ here (replacing this text)

## Exercise 2:

This question focuses on the effects of the landfill treatment on housing price (`price`). A negative value for having a house near a landfill would cause a decrease in prices in treated compared to untreated houses after the landfill is installed there relative to before its installation. Let `treated*after` be the interaction of **treated** and an indicator variable called **after** (**after** is equal to one if the year is 1981, 0 otherwise).
<u>**Please perform all the calculations for this exercise using real prices.**</u>

Let the two models (2.1) and (2.2) be given by the following regressions: 

(2.1) $ price_{it} = \beta_0 + \beta_1 treated_{i} + \beta_2 after_t + \beta_3 treated_i * after_t + u_{it} $

(2.2) $ price_{it} = \beta_0 + \beta_1 treated_{i} + \beta_2 after_t + \beta_3 treated_i * after_t + \beta_4 rooms_{it} + \beta_5 land_{it} + \beta_6 age_{it} + v_{it} $

1. In equation (2.1), what does the coefficient $\beta_0$ measure? What does $\beta_2$ measure?

➡️ Type your answer to _Exercise 2 Question 1_ here (replacing this text)

2. In equation (2.1), what does the coefficient $\beta_1$ measure? What about $\beta_3$?

➡️ Type your answer to _Exercise 2 Question 2_ here (replacing this text)

3. What are the conditions needed so that we can interpret from equation (2.1) the coefficient of the $\beta_3$ (the difference in differences estimate) as the causal impact of receiving a landfill near a house treatment on house prices? 
What would be a simple set of tests you could run to support this? Do not run these tests but explain what data you would use and collect and what tests you would run. Does any of the analysis in Exercise 1 also help answer this concern?

In [22]:
# insert your code here

➡️ Type your answer to _Exercise 2 Question 3_ here (replacing this text)

4. Estimate both models and display the estimates in a table with two columns. Place the estimates from model (2.1) in column 1 and the results from model (2.2) with additional covariates in column 2. 
Does adding the covariates affect the estimated coefficient of the landfill treatment (on the coefficient $\beta_3$ of the interaction **treated*after**)? How do you interpret the point estimate of $\beta_3$ in column (2)? (using Size, Sign and Significance)

In [23]:
# insert your code here

In [None]:
➡️ Type your answer to _Exercise 2 Question 4_ here (replacing this text)

5. Estimate the model like in column (1) and add **area**. Add that to a column (3) in the table above. 
Looking at the change in the $\beta_1$ estimate (on the row for **treated**) from column (1) to column (3), what does it imply for the correlation between being “treated” and the size of a house in square footage (**area**)?

In [25]:
# insert your code here

➡️ Type your answer to _Exercise 2 Question 5_ here (replacing this text)

## Exercise 3: 

Does a sugar sweetened beverage (SSB) tax regulation of beverages decrease obesity of middle school students? In a school district, a superintendent announced that if a student population had an average Body Mass index (BMI) greater than X (where X is the 85th percentile for that age in 2022), the vending machines for soda get subject to the SSB tax in 2023, and schools that have average student BMI less than X in 2022 do not get regulated in 2023 with a SSB tax.  Suppose that you have data over time $t$ and for a random sample of schools $j$.

1.	How would you estimate the causal effect of the SSB tax on the outcome $Y$ ($Y$=the average BMI in a school)? Write down the exact regression and define each variable. Also say which coefficient is interpreted as the causal effect of the SSB tax.

➡️ Type your answer to _Exercise 3 Question 1_ here (replacing this text)

2. What assumption is key for you to interpret the coefficient as a causal effect of the SSB tax?

➡️ Type your answer to _Exercise 3 Question 2_ here (replacing this text)

3. Given data on diabetes teen diagnosis over time and by school, how would you estimate the causal effect of the tax on diabetes diagnosis among young students?

➡️ Type your answer to _Exercise 3 Question 3_ here (replacing this text)

4. Given data on beverage prices over time and by school for soda vending machines, how would you estimate the causal effects of the tax mandate on the soda prices students face -- to see if the tax was actually implemented and reflected higher prices?

➡️ Type your answer to _Exercise 3 Question 4_ here (replacing this text)

## Exercise 4: 

Equation (4.1):
$$   Predicted Outcome_{jt} = i + 50 After_{jt} + 5 Treated_{jt}  -  25  After_{jt}*Treated_{jt} $$

1. Using the above estimated equation and the numbers below, complete equation (4.1) and fill out the table below:

|Average outcome 	|	Control, **Treated** = 0	| Treatment, **Treated** = 1	| Difference: T-C (in each row)
|-----------|---------------|---------------|---------------|
|Before, **After** = 0	| $a$ | 125 | $b$ |
|After, **After** = 1 | $e$ | $d$ | $c$ |
|Difference: After-Before (in each column)	| $f$ | $g$ | D-in-D = $h$|

In [27]:
# insert the code here letter a in Exercise 4

➡️ Type your answer to letter $a$ in Exercise 4

In [29]:
# insert the code here letter b in Exercise 4

➡️ Type your answer to letter $b$ in Exercise 4

In [31]:
# insert the code here letter c in Exercise 4

➡️ Type your answer to letter $c$ in Exercise 4

In [33]:
# insert the code here letter d in Exercise 4

➡️ Type your answer to letter $d$ in Exercise 4

In [35]:
# insert the code here letter e in Exercise 4 Question 1

➡️ Type your answer to letter $e$ in Exercise 4

In [37]:
# insert the code here letter f in Exercise 4 Question 1

➡️ Type your answer to letter $f$ in Exercise 4

In [39]:
# insert the code here letter g in Exercise 4 Question 1

➡️ Type your answer to letter $g$ in Exercise 4

In [41]:
# insert the code here letter h in Exercise 4 Question 1

➡️ Type your answer to letter $h$ in Exercise 4

In [43]:
# insert the code here letter i in Exercise 4 (from the equation)

➡️ Type your answer to letter $i$ in Exercise 4 (from the equation)

## Exercise 5: 

Open the dataset for exercise 5 (**question5_2023.dta**), three years of data (87-89) for firms on how much scrap they produce and characteristics of firm $j$ such as union, sales and number of employers. Some firms got a grant in 1988 to reduce scrap production. The table below shows the results from estimating models of scrap by firm $j$ in year $t$ on the variables in rows, for 4 different specifications in columns.

![Screenshot 2023-03-27 at 4.19.14 PM.png](attachment:3e4361ee-0ebe-4687-8187-8cbd6ab09dfe.png)

Column 1 has no controls other than if firm $j$ got the grant in year $t$, column 2 adds union, sales and employment numbers of a firm $j$ as controls, and then column 3 adds to the specification in column 2 the year fixed effects (we can do this because each year we see many firms). Finally, column 4 has, in addition to everything in column 3, firm fixed effects (an indicator variable for each firm -- we can do that because we see the firm in three years).

1. If we have N firms, how many firm fixed effects are estimate in column (4) when there is a constant?

➡️ Type your answer to _Exercise 5 Question 1_ here (replacing this text)

2. Why does the union variable not get estimated in column (4)? [**fcode** is the firm code]

➡️ Type your answer to _Exercise 5 Question 2_ here (replacing this text)

3. We have three years, 87,88,89. Why does the year 1987 indicator not get estimated in column (3)?

➡️ Type your answer to _Exercise 5 Question 3_ here (replacing this text)

4. In column (3) you see that the estimated fixed effects for year 1988 and for 1989 are negative. This means that scrap is lower in 1988 relative to 1987 (albeit not significant). How much less scrap on average is there in 1989 relative to 1987? Is that significantly different from zero?

In [45]:
# insert your code here

➡️ Type your answer to _Exercise 5 Question 4_ here (replacing this text)

5. Given that scrap is decreasing over the years and the grant was implemented in year 1988 and 1989, please explain why you see the change in the estimated grant coefficient in column (3) relative to that one in column (2)? 

*Hint: In column 2 you would say that the grant lowered scrap (albeit not significantly) whereas in column (3) what would you now say after controlling for time changing effects)*

➡️ Type your answer to _Exercise 5 Question 5_ here (replacing this text)

**You should use question5_2023.dta for the following questions**


6. Estimate in R the specification in 3.3 and 3.4 (replicate them both). Add a specification 3.5 by including the quadratic of employment and quadratic of sales as additional variables in the regression specification 3.4. Test the null, using the joint parameter test in R, that the quadratic term of employment and quadratic term sales are both jointly zero. What is the p value of the F stat you get?

In [46]:
# read in the new dataset

In [47]:
# insert your code here for Question 6

➡️ Type your answer to _Exercise 5 Question 6_ here (replacing this text)

7. What is the p value for the grant coefficient in the specification 3.5? Do you reject in the specification 3.5 that the grant has a coefficient that is zero at the ten percent significance level?

In [51]:
# insert your code here

➡️ Type your answer to _Exercise 5 Question 7_ here (replacing this text)