# EEP/IAS 118 - Introductory Applied Econometrics

## Problem Set 4, Spring 2023, Villas-Boas

#### <span style="text-decoration: underline">Due in Gradescope – see deadline due time  in Gradescope – April 4, 2023</span> 

Submit materials as **one pdf** on [Gradescope](https://www.gradescope.com/courses/492989). After uploading the pdf to Gradescope, please **assign all and only the appropriate pages to each question**. Questions that do not have properly assigned pages on Gradescope may not be graded. Codes and outputs not properly displayed will be marked as incorrect.

**For full credit**, all confidence intervals/hypothesis tests must be conducted by hand - you can use functions like sd() or mean() to get values to plug into the formulas, but no credit will be given for the use of canned interval/test functions (i.e. linearHypothesis()) with no steps/calculations provided. Do not round any intermediate steps or final answers to less than four decimal digits.

### Preamble
When writing R code, it's a good habit to start your notebooks or R scripts with a preamble, a section where you load all necessary packages, set paths or change the working directory, or declare other options.

Use the below code cell to load in packages you will use throughout the problem set (at least `haven`, `tidyverse` and `ggplot` this week). 

*Note:* **never** try to install packages on Datahub. All packages that you need are already installed and can be loaded immediately using the `library()` function. Attempting to install packages will create conflicts with the package versions on the server and potentially corrupt your notebook.

In [1]:
# insert your code here

### Exercise 1. 

In this problem set, we use a dataset on the annual salary of executives and the characteristics of the firm, and the firm's outcomes. This exercise is to be completed using R. If the labor market does not value a characteristic of the employer such as an outcome in the firm that the executive is responsible for (i.e. the value of sales or change in the rate of return) or the years of tenure as an executive (proxying experience), the demand for those executives and their salary goes downs and vice versa.

*Note: log refers to the natural log, ln().*

| VARIABLE | Definition	|
|:-:|:-:|
| SALARY	| annual CEO salary (including bonuses) in 1990 (in thousands USD)  |
| SALES | firm sales in 1990 (in millions USD) |
| ROE | average return on equity, 1988–1990 (in percent)  |
| FINANCE | = 1 if a financial company, 0 otherwise | 


#### Q1-1. ####
Read the data and create the log of salary as an additional column in the data. Call this variable lsalary. Note: The dataset is in Stata. It is available on bCourses and is called pset4_2023.DTA.

In [3]:
# insert your code here

#### Q1-2 ####
For *only* the following variables, report the sample mean, standard deviation, minimum, and maximum: salary, log salary, average return on equity, sales, and whether the company is a financial one (Hint: summarise()) 

Coding outputs from summarise() is sufficient, as long as what variable and summary stats each number pertains to is clear - no need to worry about unit / unit conversion.

In [5]:
# insert your code here

#### Q1-3 ####
(a) Create a histogram for the CEO salary. Label everything: add axis titles and a main title. (Hint: see the Histograms section of Coding Bootcamp Part 4) 

(b) Then, plot another histogram for CEO salary in the financial sector. 

(c) Finally, plot the same graph for CEO salary not in the finance sector. 

(d) What is the average CEO salary for each group? (Remember, units and unit conversion.)

(e) Overlap both histograms into the same figure.

In [7]:
# insert your code here (a)

In [8]:
# insert your code here (b)

In [9]:
# insert your code here (c)

In [10]:
# insert your code here (d)

➡️ Type your written answer for _Exercise 1 Question 3d_ here (replacing this text).

In [11]:
# insert your code here (e)

#### Q1-4 ####
Estimate the model of salary as a linear function of a constant, firm’s sales, and average ROE. Comment on the estimated intercept and each of the right-hand side variables’ estimated parameters in terms of Sign, Size, and Significance (SSS).

In [17]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 4 Intercept_ here (replacing this text).

➡️ Type your written answer for _Exercise 1 Question 4 Coefficient 1_ here (replacing this text).

➡️ Type your written answer for _Exercise 1 Question 4 Coefficient 2_ here (replacing this text).

#### Q1-5 ####
In absolute terms, does average ROE or sales have a larger impact on expected salary? (Use Standardized Regression - Lec 13/Sec6) 

In [19]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 5_ here (replacing this text).

#### Q1-6 ####
(a) Estimate the model of salary as a linear function of a constant, firm’s sales, average ROE, and an indicator for being in the financial sector. 

(b) Test the joint significance of the ROE and sales variables at the 1% significance level.

In [21]:
# insert your code here

#### Q1-7 ####
Looking at the coefficients on the ROE across two models in Q4 and Q6, what do you conclude in terms of the type of firm being in the finance sector having caused bias on the impact of ROE on salary if not considered?

➡️ Type your written answer for _Exercise 1 Question 7_ here (replacing this text).

#### Q1-8 ####
(a) Using only data on nonfinance sector observations, specify a model to predict the average salary of an executive whose firm has an ROE of 10% with 4,000 (meaning 4 billion USD) in sales.

(b) Construct a 95% confidence interval for the predicted average salary for an executive with those characteristics.

In [25]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 8_ here (replacing this text).

#### Q1-9 ####
What would be the 95% confidence interval for the predicted salary (not the average predicted) with those characteristics in 8.a. above? Are you surprised that the intervals differ between questions 8.b. and 9? (Hint: Lec 14)

In [29]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 9_ here (replacing this text).

#### Q1-10 ####
We are selecting between model (1A.) and model (1B.) below. Estimate both models, creating any variables you need.  Based on your estimated output using linear regression analysis, which model would you choose to use in labor market analysis? 

$ (Model\;1A) \;\; wage_i = \beta_0 + \beta_1 ROE_i + \beta_2 sales_i + \beta_3 finance_i + u_i $

$ (Model\;1B) \;\; wage_i = \gamma_0 + \gamma_1 ROE_i + \gamma_2 log(sales_i) + \gamma_3 finance_i + v_i $

In [31]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 10_ here (replacing this text).

#### Q1-11 ####
If an executive’s tenure is positively correlated with salary and also positively correlated with the sales of the company, are we overestimating the effect of sales on salary in model 1A? please explain briefly

➡️ Type your written answer for _Exercise 1 Question 11_ here (replacing this text).

### Ex2 ###

We want to select the best model to use for future labor market analysis. We are selecting between model (2C) and model (2D) below.

$ (Model\;2C) \;\; salary_i = \theta_0 + \theta_1 ROE_i + \theta_2 sales_i + \theta_3 finance_i + u_i $

$ (Model\;2D) \;\; log(salary_i) = \eta_0 + \eta_1 ROE_i + \eta_2 sales_i + \eta_3 finance_i + v_i $

#### Q2-1 ####
Estimate both models in exercise 2, creating any variables you need.  Which model would you choose to use henceforth?  (Lec 15)

In [34]:
# insert your code here

➡️ Type your written answer for _Exercise 2 Question 1_ here (replacing this text).

#### Q2-2 ####
Using the model (2D) from this Exercise 2, please predict the average salary for an executive in a firm with ROE = 12.6% and sales = 4500 (4.5 Billion USD) in a finance firm.

In [37]:
# insert your code here

### Ex3: Finance or not Finance ###

#### Q3-1 ####
Estimate model 2C separately for executives in the finance sector and for those not in the finance sector. Formally test at the 5% significance level whether the regressions should be estimated separately or whether we can pool the data like we have been doing so far. (Hint: Lec 17) 

In [39]:
# insert your code here

➡️ Type your written answer for _Exercise 3 Question 1_ here (replacing this text).

#### Q3-2 ####
I would like to know whether the effect of the ROE on salary differs depending on whether the company is in the finance sector.

$ (Model\;1A) \;\; wage_i = \beta_0 + \beta_1 ROE_i + \beta_2 sales_i + \beta_3 finance_i + u_i $

Estimate a model, by adjusting Model 1A, that enables you to test this and please interpret your findings. Compare the p value for the estimated coefficient of interest at the 5 percent significance level to conclude whether you reject the null of no heterogeneity in the effect of ROE on salary for finance and non-finance firms, against a two-sided alternative, holding all else equal. (Hint: Generate the interaction you need and add it to the regression.) 

In [43]:
# insert your code here

➡️ Type your written answer for _Exercise 3 Question 2_ here (replacing this text).