# EEP/IAS 118 - Introductory Applied Econometrics
## Problem Set 2, Spring 2021, Villas-Boas
#### <span style="text-decoration: underline"> Due as announced on bcourses </span> 


Submit materials (pdf of your Jupyter notebook with all code cells run) as one combined pdf on [Gradescope](https://www.gradescope.com/courses/226571).

## Exercise 1: Food expenditures, food at home, away from home, in US, by region

### Guidelines

This exercise should be completed using R. Remember, when you want an output to show in the notebook, you need to *explicitly* call the object in your R code. For example, if you want to show the mean of number of cars per consumer unit, you can't just type `MeanNcars <- mean(ncars)`, because that will just save the output to `MeanNcars`. Instead, you need to then type `MeanNcars` on its own so it displays the output. **Answers that do not display the desired output will be graded as incorrect.**


To write comments in your script (text that will appear but will not be read as commands), type a \# at the beginning of the line. Use these as notes to keep track of which question you are trying to answer, the purpose of each command, etc.
 

In [None]:
# a commented out line of code - nothing happens if you run this cell

To get started:

This time you are opening a STATA-formatted ".dta" file, so you will use the `read_dta()` command instead of the import command that was applied to a “.csv” spreadsheet in the first problem set. You will need to load the `haven` package to do this. Your command should look like this: 

`DataName <- read_dta("file_name")`

Exercises require both R work and written answers.

### R Tips	
* See the Coding Bootcamp Parts 1 and 2 for some basic operations (i.e. counting observations, scatterplots, generating variables) See also all the Lecture R command files to replicate lectures where we did most of what you are asked to do here : e.g., Lecture5.R, Lecture6.R, and Lecture7.R.


* The command `table()` lists all values a variable takes in the sample and the number of times it takes each value. 


* To summarize data for a specified subset of the observations, you can use `filter()` to subset the data, and then either `summary()` for simple summary statistics or `summarise()` in `tidyverse` to generate more detailed summary statistics.

### Data Description
The data for this exercise come from the Bureau of Labor Statistics (BLS) Consumer Expenditure Survey (CES). With special permission researchers can have access to very dis-aggregate data, but anyone can access their representative consumer spending data by U.S. census region and there are nine Census Regions. In this problem set we delve into that data to provide insights into number of cars in consumer units (households) for nine Census regions  and then relate number of cars per consumer units to the consumer unit's gross income and also to the number of people in the consumer unit. 

The **dataPset2.dta** file includes the following variables:

| Variable Name | Description                             |
|---------------|-----------------------------------------|
| region        | U.S. Census Region                      |
| ncu           | Number of consumer units (in thousands) |
| incgross      | Income before taxes in 2018 (USD)       |
| incnet        | Income after taxes in 2018 (USD)        |
| n             | Number of people per consumer unit      |
| ncars         | Number of Vehicles per consumer unit in 2018 |
| exp           | Average annual expenditures of consumer unit in 2018 Dollars |
| fahe          | Food at home expenditures in 2018 |
| fawaye        | Food away from home expenditures in 2018 |


### Preamble
When writing R code, it's a good habit to start your notebooks or R scripts with a preamble, a section where you load all necessary packages, set paths or change the working directory, or declare other options.

Use the below code cell to load in packages you will use throughout the problem set (at least `haven` and `tidyverse` this week):

*Note:* **never** try to install packages on Datahub. All packages that you need are already installed and can be loaded immediately using the `library()` function. Attempting to install packages will create conflicts with the package versions on the server and potentially corrupt your notebook.

In [None]:
# Use this cell to load desired packages

### Question 1
First we would like you to become familiar with your data:

(a) Please detail the following: How many US regions are in the data set? How many regions have number of cars per consumer unit larger than 1.9?

In [None]:
# insert code here

(b) What is the average number of cars per consumer unit in the data set?   

In [None]:
# insert code here

(c) What is the range for the variable **incgross**, gross income, in the data? 

In [None]:
# insert code here

(d) Construct a variable *cpc* equal to the total number of cars **per capita** which is equal to total cars by consumer unit divided by Number of people in the consumer unit . You will need to create this new variable in R. 

In [None]:
# insert code here

(e) Plot a histogram of this constructed variable. What is the range of numbers of cars per capita in the US in 2018?

In [None]:
# insert code here

(f) What is the mean of *cpc*? What is the median *cpc*?  

In [None]:
# insert code here

(g) Compute the mean of number of cars per capita for households separately in coastal and non coastal regions (hint: use `group_by()` and `summarise()` in the `tidyverse` package. 

In [None]:
# insert code here

### Question 2

Consider the following model of the number of cars per capita and gross income: 

$$cpc = \beta_0 + \beta_1 incgross + u$$

(a) Estimate the model in R with the `lm()` command. Interpret your $\hat \beta_1$ coefficients, remembering the triplet S(ign), S(ignificance), S(ize), though you do not need to comment on significance in this problem set. **Make sure your uploaded pdf includes the R output.**


In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.2.a. here (replacing this text)

(b) How well does total gross income predict per capita number of cars?


In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.2.b. here (replacing this text)

(c) What are the predicted number of cars per capita for a consumer unit with total gross annual  income of 100,000 dollars? 


In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.2.c. here (replacing this text)

### Question 3

Consider the following model:

$$ log(ncars) = \beta_2 + \beta_3 log(incgross) + u $$

you need to generate the logarithm of the variables of interest for this question's model, namely the log of number of cars and log of gross income. Use the `log()` function.

(a) Estimate the model, and interpret your $\hat \beta_3$ coefficient.

In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.3.a. here (replacing this text)

(b) Using the results from estimating equation (2), how would you expect per capita number of cars to change if gross income increases by 1\%?


In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.3.b. here (replacing this text)

(c) How about if gross income increases by 18\% ?


In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.3.c. here (replacing this text)

### Question 4
We will now explore the role of consumer unit size in predicting the log of number of cars.

$$ log(ncars) = \beta_4 + \beta_5 log(incgross) + \beta_6 log(N) +u $$

(a) How did your estimate $\hat \beta$ for gross income change between equation (3) and equation (4)? 


In [None]:
# insert code here

➡️ Type your answer to _Exercise 1.4.a. here (replacing this text)

(b) Without performing any calculations, what information does this give you about the correlation between total gross income and household size? (Explain your reasoning in no more than 4 sentences.)

➡️ Type your answer to _Exercise 1.4.b. here (replacing this text)

(c) Predict the expected number of cars a consumer unit has, with 2 members and total gross income of 50,000 using your estimates from equation (4).  

In [None]:
# insert code here

## Exercise 2: Vaccination Rates in 2021

Many researchers have attempted to estimate important determinants of vaccination rates in the population. Suppose a researcher was interested in the effect of a public health ad and vaccine information campaign on the vaccination rate. One could estimate such a regression as follows:

$$Npc_j=\beta_0 + \beta_1  Hpc_j +\beta_2  Education_j + u_j \ \ \ \ \ \ \ \ \ (1) $$
where $Npc_j$ corresponds to the vaccination rate (number of vaccines given in arms, divided by population) in region $j$,  $Education_j$ is the average number of years of education in region $j$, and $Hpc_j$ is the per capita dollar amount (in 2021) spent on vaccine campaigns per region $j$. 	

(a) What do you expect the sign of $\beta_1$ to be in equation (1)? Why?

➡️ Type your answer to _Exercise 2.a. here (replacing this text)

(b) List three other factors that could influence whether the vaccination rate increases (or decreases) that would be in the $u_j$. 

➡️ Type your answer to _Exercise 2.b. here (replacing this text)

(c) What would happen to $\beta_1$ if you omit education from the estimation? Explain (very briefly) why.

➡️ Type your answer to _Exercise 2.c. here (replacing this text)

(d) Give an example of one factor that would induce $\beta_1$ to be biased. State the direction of the bias and how you determined that direction.

➡️ Type your answer to _Exercise 2.d. here (replacing this text)

(e) What are the four conditions that must be satisfied for $\beta_1$ to be unbiased? Explain whether you believe each assumption is satisfied, and why or why not (suppose we used vaccination rates of the covid vaccine and amount spent per capita in vaccine adds in major media across regions to estimate $\beta_1$).

➡️ Type your answer to _Exercise 2.e. here (replacing this text)