# EEP/IAS 118 - Introductory Applied Econometrics Problem Set 2

## Problem Set 2, Answer Key, Spring 2023, Villas-Boas

#### <span style="text-decoration: underline">Due in Gradescope – see deadline due time  in Gradescope – Feb 23, 2023</span> 

Submit materials as **one pdf** on [Gradescope](https://www.gradescope.com/courses/492989). After uploading the pdf to Gradescope, please **assign appropriate pages to each question**. Furthermore, please double check that, **on the pdf**, (i) codes are runnable and fully visible and (ii) associated outputs are fully visible. **Points will be deducted** if pages are not assigned to pages and codes/outputs are cut-off and not fully visible.

## Exercise 1 (in R). Relationship between CEO Salary and Covariates

#### Guidelines:
* This exercise should be completed using R.

* When you want an output to show, you need to explicitly call the object. For example, if you want to show the mean airline fares, you can't just type `MeanSalary <- mean(data$salary)`, because that will just save the output to MeanSalary. Instead, you need to then type `MeanSalary` on its own, so it displays the output. Answers that do not display the desired output will be graded as incorrect.

* To write comments in your script or in code cells (text that will appear but will not be read as R commands), type a `#` at the beginning of the line. Use these as notes to keep track of which question you are trying to answer, the purpose of each command, etc.

In [1]:
# a commented out line of code - nothing happens if you run this cell

#### To get started:

* For those working with R studio on local installation, install the following two packages (as seen in header image below). For those working on Datahub, DO NOT install packages.
    * `install.packages("tidyverse")`
    * `install.packages("haven")`

* Then, for both local and Datahub users, load these packages.
    * `library(tidyverse)`
    * `library(haven)`

* This time you are opening a STATA-formatted “.dta” file, and so you will use the `read_dta()` command, provided by the haven package. Load data into R as.
    * `YourDataName <- read_dta("file_name")`


In [2]:
# insert code here (call in the necessary packages here)

#### R Tips:

* See the Bootcamp Parts 1 and 2 for some basic. See also all the Lecture R script files, where we did most of what you are asked to do here: e.g., Lecture5.R, Lecture6.R, and Lecture7.R.

* The command `table()` lists all values a variable takes in the sample and the number of times it takes each value. 

* To summarize data for a specified subset of the observations, you can use `filter()` to subset the data, and then either `summary()` for simple summary statistics or `summarise()` in tidyverse to generate more detailed summary statistics.

* If you are using R Studio, your header should be organized with notes such that purpose of the script file is clear, packages needed to run the script is installed and loaded at the top, and so on. See image below for example: ![image.png](attachment:4ce1131d-b239-4344-bcc4-8b67303c0081.png)

#### Data Description: 

The data set for this exercise comes from a 1990 sample of salaries for several firms’ chief executive officers (CEO). In this problem set, we delve into that data set to provide insights into the CEO salaries and then relate the variation in salaries in the sample to firms and individual CEO characteristics. 

The **pset2_2023data.dta** file includes the following variables, only which we will use in this pset:


| Variable Name | Description and Units |
| :-: | :-----------------------: |
| salary | 1990 compensation, 1000 USD |
| age | in years |
| college | = 1 if attended college |
| grad | = 1 if attended graduate school |
| comten | years with company |
| ceoten | years as ceo with company |
| sales | 1990 firm sales, million USD |
| profits | 1990 profits, million USD | 


### Question 1: First, we would like you to become familiar with your data:

<span style="color:red"> **Note of Caution:**</span> The units for salary is 1,000 USD. This means, when your coefficient or summary stats on the (unit) variable salary reads 800, the correct answer or interpretation we are looking for is 800,000 USD.

**(a)** Read in the data: use `my_data <- read_dta("pset2_2023data.dta")`

In [4]:
# insert code here

**(b)** How many individual CEO's are in the data set?

In [6]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(b)_ here (replacing this text)

**(c)** How many CEO's have salary greater than 1,000,000 dollars?

Hint: Using the tidyverse method. The command `filter()` trims dataframe to obs with salary more than 1,000,000 USD. Then, look at # rows after using `over1000 <- filter(my_data, salary > 1000)`

In [8]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(c)_ here (replacing this text)

**(d)** What is the average  salary in the data set? (Reminder: Convert and indicate appropriate units.)

In [10]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(d)_ here (replacing this text)

**(e)** What is the range for the variable salary in the data? (Reminder: Convert and indicate appropriate units.)

In [12]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(e)_ here (replacing this text)

**(f)** What is the range for tenure of CEO's in the data

Hint: `summarise(my_data, "Range CEO Tenure (Yrs)" = max(ceoten) - min(ceoten))`

In [14]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(f)_ here (replacing this text)

**(g)** Construct a variable TenureBeforeCEO equal to tenure at the company minus the tenure as CEO. 

You can create a new variable: `my_data <- mutate(my_data, TenureBeforeCEO = comten-ceoten)`

Note: you do not need to interpret this variable, this is just to practice making new variables in R.

In [16]:
# insert code here

**(h)** Plot a histogram of this constructed variable.  

Hint: `hist(my_data$TenureBeforeCEO)`

In [18]:
# insert code here

**(i)** What is the range of tenure for CEOs whose salaries are over 1,000,000 USD?

In [20]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(i)_ here (replacing this text)

**(j)** What is the mean? What is the median? (of tenure for CEOs whose salaries are over 1 mil USD)

Hint 1: `summary(over1000$ceoten)`

Hint 2: Using tidyverse method,

`summarise(over1000,`

`         "Mean Tenure Years for Salaries > 1000" = mean(ceoten),`

`         "Median Tenure Years for Salaries > 1000" = median(ceoten))`

In [22]:
# insert code here

➡️ Type your answer to _Exercise 1 Q1(j)_ here (replacing this text)

**(k)** Filter out the college graduate CEOs into a separate data R object. Compute the mean salary separately for CEO's with and without graduate school attendance.

Hint: Use `group_by()` and `summarise()` in the tidyverse package. See Section 2 slides or notes for introduction to `group_by` function.

In [24]:
# insert code here

**(l)** Do CEOs with graduate attendance have higher average salaries than CEOS without a graduate attendance?

➡️ Type your written answer for EX1 Q1I here (replacing this text).

### Question 2

Consider the following model, where x_i = tenure in years of individual i as CEO

$$Salary_i= \beta_1 + \beta_2 x_i + u_i~~~~~~~~ i = \text{individuals 1,2,...N in the data} ~~~~~~~~~~~~~~~~~~~~(eq.1)$$

**(a)** Estimate the model in R with the `lm()` command. Interpret your `β_1` and `β_2` coefficients, remembering the triplet S(ign), S(ignificance), S(ize), though you don't need to comment on significance in this problem-set. Make sure your uploaded pdf includes the R output.

In [26]:
# insert code here

➡️ Type your written answer for EX1 Q2a here (replacing this text).

**(b)** How well does the number of years of tenure as CEO predict the salary?

➡️ Type your written answer for EX1 Q2b here (replacing this text).

**(c)** What is the predicted salary for a CEO with an average tenure as CEO of 4 years? 

In [28]:
# insert code here

➡️ Type your written answer for EX1 Q2c here (replacing this text).

### Question 3

Consider the following model for observations of CEO with **more than 0 years of tenure**, where x_i=tenure in years of individual i  as CEO

$$log(Salary_i)= \beta_3 + \beta_4 log(x_i) + v_i~~~~~~~~ i = \text{individuals 1,2,...N in the data} ~~~~~~~~~~~~~~~~~~~~(eq.2)$$

**(a)** Estimate the model in R with the `lm()` command. Interpret your `β_4` coefficient, remembering the triplet S(ign), S(ignificance), S(ize), though you don't need to comment you need to generate the logarithm of the variables of interest for this question's model. Use log(), namely the log of salary and log of tenure as CEO. Run the model only for those that have non-zero tenure as CEO, that is for a filtered subset of the data that ceoten > 0.

In [30]:
# insert code here

➡️ Type your answer to _Exercise 1 Q3(a)_ here (replacing this text)

**(b)** Using the results from estimating (eq. 2), how would you expect the salary to change if the years of CEO tenure increases by 25%?

➡️ Type your written answer for EX1 Q3b here (replacing this text).

### Question 4

We will now explore the role of CEO and firm characteristics in explaining the salary where x is the CEO tenure years as before and $grad_i$ is an indicator equal to one if the CEO attended graduate school and equal to zero otherwise.

$$Salary_i= \beta_5 + \beta_2 x_i + \beta_3 grad_i + \gamma_i~~~~~~~~ i = \text{individuals 1,2,...N in the data} ~~~~~~~~~~~~~~~~~~~~(eq.3)$$

**(a)** Estimate equation (eq.3). How did your estimate `β_2` for x (that is, CEO tenure) change between equation (eq.1) and equation (eq.3)? 

In [33]:
# insert code here

➡️ Type your written answer for EX1 Q4a here (replacing this text).

**(b)** Without performing any calculations, what information does this give you about the correlation between the CEO tenure years and whether a CEO attended graduate school? (Explain your reasoning in no more than 4 sentences. Hint: OVB)

➡️ Type your written answer for EX1 Q4b here (replacing this text).

**(c)** Compute the correlation between the years of CEO tenure and the profits of the firm of the CEO in question. If you include the variable profits in the model in equation (eq.3) and that slope coefficient for profits estimate is 0.58, what do you think will happen to the estimated coefficient on the CEO tenure years when compared to the estimate you got in equation (eq. 3)? Explain your reasoning using OVB briefly here again. 

In [35]:
# insert code here

➡️ Type your written answer for EX1 Q4c here (replacing this text).

**(d)** Predict the expected salary for a CEO with no graduate education and with 20 years of tenure as CEO using your estimates from equation (eq.3).

In [37]:
# insert code here

➡️ Type your answer to _Exercise 1 Q4(d)_ here (replacing this text)

## Exercise 2 (Intuition Only; No R or Calculation Involved)

Policy makers are interested in understanding important determinants of number of electric cars sold in US cities and run the following model:

$$Per~Capita~Number~of~Electric~Cars~Sold_j= \beta_1 + \beta_2 GP_j + \beta_3 ER_j + u_j ~~~~~~~~~~~~~~~~~~~~ (eq.4)$$

where $GP_j$ corresponds to gasoline retail average price in a US city j and is assumed to be related to the Number of Electric vehicles sold in city j (PerCapitaNumber of Electric Cars Sold), and also related to whether city j has environmental regulation for gasoline or not (indicator ER=1, or =0) . Finally, $u_j$ is the disturbance term related to the electric vehicle per capita sales in city j.  

**(a)** What do you expect the sign of `β_3` to be in equation (eq.4)? Why? 

➡️ Type your written answer for EX2(a) here (replacing this text).

**(b)** What would probably happen to `β_2` if you omit ER=EnvRegulation from the estimation, assuming that cities with environmental regulation can only sell gasoline that is more costly? Explain (very briefly) why.

➡️ Type your written answer for EX2(b) here (replacing this text).

**Please remember to submit your Jupyter Notebook displaying all codes and output.**