# EEP/IAS 118 - Introductory Applied Econometrics Problem Set 2

## Problem Set 2, Spring 2024, Villas-Boas

### <span style="text-decoration: underline">Problem Set 2: See bCourses / Gradescope for deadlines  </span> 

Refer to PS1 (pdf) for submission instructions.

Unless specified otherwise for the problem, please round your intermediate steps and final solutions to ***four (or more) decimal digits***.

## Exercise 1: (R) Relationship between CEO Salary and Covariates

**Guidelines:**

- Exercise 1 should be completed using R.
- When you want an output to show, you need to explicitly call the object. For example, if you want to compute and then also show the average price, you type
    - `MeanPrice<- mean(data$price)`
    - `MeanPrice`
    
    The top command  will just save the output to a new object in R called MeanPrice. Instead, after saving/creating MeanPrice you need to then type MeanPrice on its own, so it displays the output. Answers that do not display the desired output for the grader to see will be graded as incorrect.
- To write comments in your script or in code cells (text that will appear but will not be read as R commands), type a # at the beginning of the line. Use these as notes to keep track of which question you are trying to answer, the purpose of each command, etc.

**To get started:**

- For those working with R studio on local installation, install the following two packages (as seen in header image below). For those working on Datahub, DO NOT install packages.
    - `install.packages("tidyverse")`
    - `install.packages("haven")`
- Then, for both local and Datahub users, load these packages.
    - `library(tidyverse)`
    - `library(haven)`
- This time you are opening a STATA-formatted “.dta” file, and so you will use the read_dta() command, provided by the haven package. Load data into R as.
    - `YourDataName <- read_dta("file_name")`

**R Tips:**
- See the Bootcamp Parts 1 and 2 and Sec 1 and 2 R Demo for basics. See also all the Lecture R script files, where we did most of what you are asked to do here: e.g., Lecture5.R, Lecture6.R, and Lecture7.R.
- The command `table()` lists all values a variable takes in the sample and the number of times it takes each value. 
- To summarize data for a specified subset of the observations, you can use `filter()` to subset the data, and then either `summary()` for simple summary statistics or `summarise()` in tidyverse to generate more detailed summary statistics. (See section 2 R demo)
- If you are using R Studio, your header should be organized with notes such that purpose of the script file is clear, packages needed to run the script is installed and loaded at the top, and so on. See image below for example:

<img src="attachment:369465a9-1a76-4f59-a250-10d0653b29d9.png" width="600" />


**Data Description:**

The data set for this exercise comes from a sample of cars sales in 1999. In this exercise, we delve into that data set to provide insights into determinants of car price. The `pset2_2024.dta` file includes the following variables to be used in PS2.

<center><b> Table 1: Description of Dataset for PS2 </b></center>

| Variable Name | Description |
| :-----------: | :----------: |
| domestic | =1 if domestically produced </br> =0 otherwise |
| qu | sales quantity (number of new car registrations) | 
| price | price (in 1000 Euro, 1999 purchasing power) |
| horsepower | horsepower (in kW) |
| fuel | fuel consumption (liter per km at 90km/hr) |
| weight  | weight (in kg) |
| luxury | =1 if luxury </br> =0 if not luxury |luxury 



In [1]:
# Preamble (Load relevant packages)
# insert code here

### Question 1:

First, we would like you to become familiar with your data:

**(a)** Read in the data: `ps2_df <- read_dta("pset2_2024.dta")`

In [3]:
# insert code here (all you need to do is copy & paste correctly)

**(b)** How many observations are in the data set?

In [5]:
# insert code here

**(c)** How many observations have fuel consumption of more than 7 liters per km (at 90km/hr)?

Hint: Use the tidyverse (dplyr) method (after loading in the appropriate package in the preamble):
- `filter()` trims dataframe to obs with fuel more than X
- Then, look at # rows using `ps2_df_fuelover7 <- filter(ps2_df, fuel> X)`

In [7]:
# insert code here

**(d)** Using the (full) data, find the _mean_ and _range_ price as well as _range_ of weight of cars recorded in the dataset.

Hint 1: Reference PS1 Ex3 Part (d)

Hint 2: Range is defined as the difference between max and min.

In [9]:
# insert code here

**(e)** In the (full) data set, construct a new variable `pricePerkg` equal to the price divided by weight, using the `mutate()` function. 

You can create a new variable: `ps2_df <- mutate(ps2_df, pricePerkg = price/weight)`

This new variable is not meant to be interpreted with any economic meaning, but rather created for the purpose of practicing R.

In [11]:
# insert code here

**(f)** Plot a histogram of this constructed variable `pricePerkg`.  

Hint: `hist(ps2_df$pricePerkg)`

In [13]:
# insert code here

**(g)** What is the range, mean, and median of prices of cars with fuel consumption of over 7 liters per km?

Hint: Use the filtered (subsetted) dataframe object from part (c).

In [15]:
# insert code here

**(h)** From the (full) data frame, using the `group_by()` and `summarise()` functions, output the mean price for domestically and non domestically produced cars separtely in a single table. 

Hint: Use `group_by()` and `summarise()` in the tidyverse package. See Section 2 R demo. Remember to `ungroup()` your dataframe before proceeding to the next exercise.

In [17]:
# insert code here

**(i)** Based on (h), do cars produced domestically have a higher average price than non-domestically produced cars?

➡️ Type your answer here (replacing this text)

### Question 2: 

Consider the following model, _utilizing the full dataset_, 

$$price_i = \beta_0 + \beta_1 fuel_i + u_i \;\;\;\;\;\; i = \text{car 1, 2, … N in the data} \;\;\;\;\;\; (eq1) $$

**(a)** Estimate the model using the `lm()` command. Output the regression table. Then, _interpret_ your estimated coefficients $\beta_0$ and $\beta_1$, using the triplet S(ign), S(ignificance), S(ize). <For PS2, you may skip Significance, as it has yet to be covered in our class.>

In [19]:
# insert code here

➡️ Type your answer here (replacing this text)

**(b)** Find and interpret the R-Squared from your regression.

➡️ Type your answer here (replacing this text)

**(c)** What is the predicted price for a car with the average fuel efficiency in the dataset? 

In [21]:
# insert code here

➡️ Type your answer here (replacing this text)

### Question 3: 

Consider the following model, _utilizing only the observations of domestically produced cars_:

$$log(price_i) = \beta_2 + \beta_3 log(fuel_i) + v_i \;\;\;\;\;\; i = \text{car 1, 2, … N in the data} \;\;\;\;\;\; (eq2) $$

**(a)** Estimate the model using the `lm()` command. Output the regression table. Then, _interpret_ your estimated coefficient $\beta_3$, using the triplet S(ign), S(ignificance), S(ize). 

<For PS2, you may skip Significance, as it has yet to be covered in our class.>

Note 1: Unlike Q2, Q3 asks you to utilize a smaller subset of data with only domestically produced cars (`domestic` = 1). Use `filter()` function. Reference Ex1 Q1(c).

Note 2: As an intial step, you need to first generate the logarithm of the variables of interest for this question's model. 

In [23]:
# insert code here

➡️ Type your answer here (replacing this text)

**(b)** Using the results from estimating (eq.2), how would you expect the price to change if the fuel consumption increases by 5%?

➡️ Type your answer here (replacing this text)

### Question 4: 

We will now explore the role of weight and fuel efficiency in explaining the price of car, _utilizing the full dataset_:

$$price_i = \beta_4 + \beta_5 fuel_i + \beta_6 weight_i + \gamma_i \;\;\;\;\;\; i = \text{car 1, 2, … N in the data} \;\;\;\;\;\; (eq3) $$

**(a)** Estimate equation (eq.3). How did your estimate of coefficient on $fuel_i$ change between equation (eq.1) and equation (eq.3)?

Note 1: As mentioned above, in Q4, we are back to utilizing all observations, domestic and foreign produced cars.

In [25]:
# insert code here

➡️ Type your answer here (replacing this text)

**(b)** Without performing any calculations, what information does this give you about the correlation between `fuel` and `weight`? (Explain your reasoning in no more than 4 sentences. Hint: OVB)

➡️ Type your answer here (replacing this text)

**(c)** First, compute the correlation between the horsepower of the car and the weight of the car in question. 

Second, if we were to add, to the model in eq 3, the variable 'horsepower' as an additional explanatory variable and the coefficient on horsepower is known to be positive, what do you think will happen to the estimated coefficient on the weight when compared to the estimate you got in equation (eq. 3)? Explain your reasoning using OVB briefly here again. 

In [27]:
# insert code here

➡️ Type your answer here (replacing this text)

**(d)** Predict the expected price of a car produced domestically, with fuel measure equal to 5 liters per km and weight equal to 530 kg, using estimates from a model that allows you to do so.

➡️ Type your answer here (replacing this text)

### Exercise 2: (Use intuition only; No R, dataset or Calculation Involved)

Policy makers are interested in understanding important determinants of the number of cars sold (_Per_Capita_Number_of_Cars_Sold_) in US cities. They run the following regression model:

$$Per\_Capita\_Number\_of\_Cars\_Sold_j = \beta_0  + \beta_1 GP_j + \beta_2 CP_j + \beta_3 FC_j + u_j \;\;\;\;\;\;\;\;\;\; (eq4) $$

where $GP_j$ corresponds to average gasoline retail price in a US city _j_, the price of a car in city j ($CP_j$), and also whether city j has free charging stations for electric cars ($FC_j$ is an indicator). Finally, $u_j$ is the disturbance term related to the per-capita car sales in city j.  

**(a)** What do you expect the sign of $\beta_3$ to be in equation (eq.4)? Why?

(Even if you are not sure, take a specific stance, as the sign you choose will be used in your solution to part (b). Note, there is no correct answer for (a).)

➡️ Type your answer here (replacing this text)

**(b)** What would likely happen to estimated $\beta_1$ on $GP_j$ if you omit $FC$ from the estimation, assuming that cities with FC=1 are cities that also have higher gasoline prices than cities without free charging, cities with FC=0? Explain (very briefly) why. (Be clear on whether we would get an upward or downward bias.)

Hint: Use OVB formula and, from part (a), what you expect

➡️ Type your answer here (replacing this text)