# EEP/IAS 118 - Introductory Applied Econometrics

## Problem Set 4, Spring 2024, Villas-Boas

<span style="color:red;"> Please read the updated instructions: </span> 

**In order to receive full credit**

**(1)** Please **assign all and only the appropriate pages to each question** on Gradescope.

**(2)** Codes and outputs not properly displayed will be marked as incorrect.

**(3)** All confidence intervals/hypothesis tests **must be conducted by hand** - use functions like sd() or mean() to get values to plug into the formulas. No credit will be given for the use of canned interval/test functions (i.e. linearHypothesis()), though checking if your answer aligns with those obtained from using canned functions advisable.

**(4)** Please use (i) coding cells for calculations / demonstrating work, and (ii) markdown cells for all written response and calculated values. Your written response in the markdown cell should have everything you'd like the grader to read. (i.e. In 5 steps of hypothesis testing, if you calculate the test statistic and critical value (or p-value), the calculated values should be explicitly mentioned in the markdown cell.)

### Preamble
When writing R code, it's a good habit to start your notebooks or R scripts with a preamble, a section where you load all necessary packages, set paths or change the working directory, or declare other options.

Use the below code cell to load in packages you will use throughout the problem set (at least `haven`, `tidyverse` and `ggplot2` this week). 

In [None]:
# Preamble

### Exercise 1. 
Data Description available on bCourses and is called pset4_2024.dta.

We use a dataset describing prices and characteristics of apartments sold in 2015 in a country, in addition to neighborhood characteristics and indicators on what season of the year each sale took place.

*Note: In economics and in EEP 118, log always refers to the natural log, ln(.)*

| VARIABLE | Definition	|
|:-:|:-:|
| spring | A dummy variable = 1 if apartment sold in Spring 2015 and = 0 otherwise |
| fall | A dummy variable = 1 if apartment sold in Fall 2015 and = 0 otherwise |
| winter | A dummy variable = 1 if apartment sold in Winter 2015 and = 0 otherwise |
| sizeofunit | Size of unit in square meters | 
| price | Apartment price in Patacas |
| floor | Apartment floor (values range from -1 (basement) to 79 (79th floor)) |
| constructionyear | Year of construction |
| numberofhighschools | Number of high schools near apartment |
| distancetonearestgreenest | Network distance to greenest green space in meters |
| mediumage | Medium age of residents |
| populationdensity | Population density (population/area) |
| distanceSubway | Distance to a subway station in meters |


#### Q1-1. ####
Read in the data and create a new column for the log of price, calling it *lprice*.


In [None]:
# Code for Q1-1

#### Q1-2 ####
For variables *price*, *lprice* and *distanceSubway* get the sample mean, standard deviation, minimum, and maximum. (No need to convert units or to write them out.)

In [None]:
# Code for Q1-2

#### Q1-3 ####
a) Create an *overlapped* histogram for variable *price* for (group 1) apartments that are located close to the subway (close defined as less than 250 away from the subway station entrance) and for (group 2) apartments 250 or more meters from the subway entrance. Remember to add titles, axes labels and legends for the two groups. 

(Hint: See PS3 Q6 | Histograms section of Coding Bootcamp Part 4)

b) Report the average apartment price for each group.

In [None]:
# Code for Q1-3(a)

In [None]:
# Code for Q1-3(b)

#### Q1-4 ####
Estimate the model of apartment price as a linear function of a constant, distance to subway, and distance to the closest green space. Interpret the estimated parameter for distance to greenspace using Sign, Size, and Significance (SSS).

In [None]:
# Code for Q1-4

➡️ Written Response for Q1-4

**Sign:**


**Size:**


**Significance:**


#### Q1-5 ####
In absolute terms, does a change in distance to the subway or distance to the closest greenspace have a larger impact on the expected apartment sales price? (Use Standardized Regression - `lm.beta() function` - Lecture 13/Section 6) 

In [None]:
# Code for Q1-5

➡️ Written Response for Q1-5

#### Q1-6 ####

Estimate the (unrestricted) model of apartment prices as a linear function of a constant, distance to subway, distance to the closest greenspace, number of high schools, medium age of residents, and population density. Test the joint significance of the variables medium age of residents, number of high schools, and population density at the 5% significance level, by estimating the relevant restricted model. 

*Note*: (i) Conduct HT by hand to receive credit, demonstrating your work. (ii) The calculated test statistic, critical value / p-value should be explicitly mentioned in the markdown cell in your write-up of the five steps. Please do not assume the reader will correctly pick out all the relevant values from the outputs of the coding cell.

In [None]:
# Code for Q1-6

➡️ Written Response for Q1-6

**Step 1**: State the hypotheses. 

**Step 2**: Test statistic.

**Step 3**: Critical value or p-value.

**Step 4**: Rejection Rule.

**Step 5**: Interpret.

#### Q1-7 ####

Assume you estimate the model of apartment prices as a linear function of the same covariates from Q1-6 + additional characteristics of the apartment (size of unit, floor, and construction year). And you are interested in testing jointly the significance of these three additional variables at the 1% significance level. Again, to receive full credit, you must conduct this hypothesis test by hand and show your steps and calculations. 

In [None]:
# Code for Q1-7

➡️ Written Response for Q1-7

**Step 1**: State the hypotheses. 

**Step 2**: Test statistic.

**Step 3**: Critical value or p-value.

**Step 4**: Rejection Rule.

**Step 5**: Interpret.

#### Q1-8 ####
I am still worried that the regression in Part 7 above is not considering that the housing market is seasonal, and more people may be seeking to buy during the different seasons of the year, and therefore, we could have omitted variable biases. What regression do you propose to control for the season of the year each apartment sold? Estimate that model. Looking at the coefficient on the distance to the subway, what do you conclude regarding the seasonality of the housing markets having caused bias on our previous estimates of the impact of distance to the subway in the unrestricted model in Part 7? (Be explicit about whether we likely have upward or downward bias.)

In [None]:
# Code for Q1-8

➡️ Written Response for Q1-8

#### Q1-9 ####

Consider now a model where you regress the price on only the size of the unit, the floor of the unit, and the distance to the subway. 

(a) Construct a 90% confidence interval for the average predicted price for an apartment that is located 200 meters from the subway, has a size of 250 square meters, and is on the 26th floor of a building.

(b) What would be the 90% confidence interval for the predicted price of a **specific** apartment (not the average predicted) with those characteristics. 

(c) Are you surprsied that the intervals differ between (a) and (b)? And that the latter has a wider interval?

(see Lecture 14 / Section 8)

In [None]:
# Code for Q1-9(a)

➡️ Final Answer for Q1-9(a)

In [None]:
# Code for Q1-9(b)

➡️ Final Answer for Q1-9(b)

In [None]:
➡️ Written Response for Q1-9(c)

#### Q1-10 ####

We want to select the best model to use for a future analysis of the benefits of public transportation. We are selecting between model (11.1) and model (11.2) below. Estimate both models, creating any variables you need. Based on your estimated output using linear regression analysis, which model would you choose to use in the remaining analysis by policymakers considering the economic benefits of expanding subway station locations?

$ (10.1) \;\; Price_i = \beta_0 + \beta_1 distanceSubway_i + \beta_2 DistanceGreen_i + \beta_3 SizeOfUnit_i + u_i $

$ (10.2) \;\; Price_i = \gamma_0 + \gamma_1 log(distanceSubway) + \gamma_2 DistanceGreen_i + \gamma_3 SizeOfUnit_i + \gamma_4 SizeOfUnit * distanceSubway_i + w_i $

In [None]:
# Code for Q1-10

➡️ Written Response for Q1-10

### Exercise 2 ###

We want to select the best model to use for a future analysis of the benefits of public transportation. We are selecting between model (2.1) and model (2.2) below.

$ (2.1) \;\; price_i = \beta_0 + \beta_1 distanceSubway_i + \beta_2 log(SizeOfUnit)_i + u_i $

$ (2.2) \;\; log(price_i) = \alpha_0 + \alpha_1 distanceSubway_i + \alpha_2 log(SizeOfUnit)_i + w_i $

(a) Estimate both models above, creating any variables you need. Using model (2.2), please predict the average apartment price that is 150 meters away from the subway and 170 square meters in size. *(Hint: See Lecture 14 / Section 8)*

In [None]:
# Code for Q2-a

➡️ Written Response for Q2-a

(b) Which of the two models would you suggest policymakers should use when considering the economic benefits of expanding subway station locations? *(Hint: See Lecture 15 / Section 8)*

In [None]:
# Code for Q2-b

➡️ Written Response for Q2-b

### Exercise 3

(a) Estimate model (2.2) from above separately for apartments sold in the spring and those not sold in the spring (aka those sold in any other season). Formally test at the 10% significance level whether the regressions should be estimated separately or whether we can pool the data like we have been doing. *(Hint: See Lecture 17)*

In [None]:
# Code for Ex3-a

➡️ Written Response for Ex3-a

**Step 1**: State the hypotheses. 

**Step 2**: Test statistic.

**Step 3**: Critical value or p-value.

**Step 4**: Rejection Rule.

**Step 5**: Interpret.


(b) I would like to know whether the effect of the log of apartment size on the log of apartment price in model 2.2 differs depending on the median age of the residents being more than 55 years old. Estimate a model that enables you to test this and please interpret your findings. Compare the p-value for the estimated coefficient of interest to the 10 percent significance level to conclude whether you reject the null of no heterogeneity in the effect of log apartment size on log apartment prices mattering differently in areas where median age is 55 or more (older population), against a two-sided alternative, holding all else equal. *(Hint: Generate an over55 dummy variable and then the interaction terms you need and add it to the regression)*

In [None]:
# Code for Ex3-b

➡️ Written Response for Ex3-b

**Step 1**: State the hypotheses. 

**Step 2**: Test statistic.

**Step 3**: Critical value or p-value.

**Step 4**: Rejection Rule.

**Step 5**: Interpret.