# EEP/IAS 118 - Introductory Applied Econometrics

## Problem Set 3, Spring 2023, Villas-Boas

#### <span style="text-decoration: underline">Due in Gradescope – see deadline due time  in Gradescope – Mar 7, 2023</span> 

Submit materials as **one pdf** on [Gradescope](https://www.gradescope.com/courses/492989). After uploading the pdf to Gradescope, please **assign appropriate pages to each question**. Questions that do not have assigned pages on Gradescope may not be graded. Codes and outputs not properly displayed will be marked as incorrect.

For the purposes of this class, we will be using Berkeley's _Datahub_ to conduct our analysis remotely using these notebooks.

If instead you already have an installation of R/RStudio on your personal computer and prefer to work offline, you can download the data for this assignment from bCourses (Make sure to install/update all packages mentioned in the problem sets in order to prevent issues regarding deprecated or outdated packages).

* The data files can be accessed directly  through _Datahub_ and do not require you to install anything on your computer. 
* Before submitting, make sure that all code cells are run with all output fully visible, and **do not print the entire dataset in your submission**. If you viewed the data earlier, remove that line of code and re-run the code cell (as datasets get bigger this adds many pages to pdf submissions and increases the likelihood we miss your answer).

*Note: Coding Bootcamp [Part 3](https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FENVECON-118-SP23&branch=main&urlpath=retro%2Ftree%2FENVECON-118-SP23%2F1_CodingBootcamp%2FCoding+Bootcamp+Part+3_2023.ipynb) and [Part 4](https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FENVECON-118-SP23&branch=main&urlpath=retro%2Ftree%2FENVECON-118-SP23%2F1_CodingBootcamp%2FCoding+Bootcamp+Part+4_2023.ipynb) covers all necessary R methods.

### Preamble
When writing R code, it's a good habit to start your notebooks or R scripts with a preamble, a section where you load all necessary packages, set paths or change the working directory, or declare other options.

Use the below code cell to load in packages you will use throughout the problem set (at least `haven`, `tidyverse`, and `ggplot2` this week). 

*Note:* **never** try to install packages on Datahub. All packages that you need are already installed and can be loaded immediately using the `library()` function. Attempting to install packages will create conflicts with the package versions on the server and potentially corrupt your notebook.

In [1]:
# include preamble code here

## Exercise 1: Relationship between Housing Prices (in USD) and Characteristics of US Cities.

This exercise is to be completed using R. We will establish a simple linear relationship between **housing prices and characteristics of cities** in a sample of cities. This is called a hedonic regression, relating price to characteristics. The idea is that if a characteristic is valued in a city, demand for housing increases as people move there, and then housing price increases, all else constant. Vice versa, if people do not value a characteristic, like crime, for example.  

*Note: in economics, log always refers to the natural log, ln().*

### Data description

We will use September 2021 data from Zumper on one-bedroom apartment prices and 2019 data from the FBI on crime for US cities and other characteristics of US cities, such as number of bars, air quality index, wealth of the city measured by GDP, population, whether the city has a winning record majors sports team, as well as the number of sports teams in the major basketball, baseball and American football leagues. The data has 98 cities.

<center><b> Readme for data variables, several sources - collected by Villas-Boas, Fall 2021 </b></center>

|Variable name 	|	Definition	| Source    |
|:-:|:-|:-:|
| city | City name | |
| state | State name | |
|pricesept2021	|	One bedroom housing price, in USD	|	www.Zumper.com	|
|successteams	| Dummy variable =1 if at least one NBA, NFL, or MLB team in a city had a winning record last season (2020 season), =0 otherwise| Google search
|violentcrime2019	| Violent crimes (in thousands) | [FBI](https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-8/table-8.xls/view)
|numberbars	| Number of bars, count	| [www.yellowpages.com](www.yellowpages.com)
|aqi2020	| Annual 2020 air quality index (AQI) |	[EPA](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual)
|gdp	| Gross domestic product (billion $) |	[BEA](https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1&acrdn=5)
|popul2019	| 2019 population (in thousands of people)	| [FBI](https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-8/table-8.xls/view)
|nteams	| Number of major professional sports teams	| Google search


**1.** The dataset is in Stata format (.dta) and was created for the purpose of this problem set only. It is available on bcourses and Datahub and is called **Villas-Boas_2023pset3.dta**.

Read the data into R using
`my_data <- read_dta("Villas-Boas_2023pset3.dta")` 
and create a new variable, *gdpPc*, as the GDP per capita, defined as `gdp/popul2019`

In [3]:
# insert your code here

**2.** For each of the variables (including the newly created *gdpPc* variable), get (a) the sample mean, (b) the median, (c) the minimum, and (d) the maximum.

_Hint: see the section on `summary()` in Coding Bootcamp Part 1_

In [5]:
# insert your code here

**3.** Create a $\color{magenta}{\text{pink}}$ bars histogram for *pricesept2021*. Label everything: add axis titles and a main title.

_Hint: see the Histograms section of Coding Bootcamp Part 4_

In [7]:
# insert your code here

**4.** We will now compare housing across the two groups (group **successteams = 1** and **successteams = 0**) using the dataset. How many cities have at least one successful team and how many cities have no successful team?

In [9]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 4_ here (replacing this text).

**5.** Draw separate histograms of the housing price in each group of cities, with and without successful teams.

*Hint: see the “Histograms” section of Coding Bootcamp Part 4*

In [11]:
# insert your code here

**6.**	Overlap both histograms into the same graph and comment on differences (be precise - and explain why the differences intuitively make sense).

*Hint: see the “Stacking/Multiple Histograms” section of Coding Bootcamp Part 4*

In [13]:
# insert your code here

➡️ Type your written answer for _Exercise 1 Question 6_ here (replacing this text).

**7.**	Compute an estimate for the mean of the housing prices for the successful teams city group (**successteams=1**) in the data frame. Construct a 95% confidence interval for this mean. Give an interpretation of these results in a sentence.  

_(Hint: use `mean()` and `sd()` to get the necessary information to construct the CI)_

In [15]:
# insert code here

➡️ Type your answer to _Exercise 1 Question 7_ here (replacing this text)

**8.**	Let $D$ be the difference in prices between the cities with successful sports teams (**successteams=1**) and unsuccessful (**successteams=0**) groups. State an estimator $\hat{D}$ for $D$ and use the estimator to compute an estimate of $D$. Compute a standard error for $\hat{D}$. Derive a 90% confidence interval for $D$ and interpret in one sentence. 

In [17]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 8_ here (replacing this text)

**9.** Using the data frame, test whether the average of the housing values **pricesept2021** for the successful teams’ city group is statistically different at the 1% significance level ($\alpha$ = 0.01) from average housing values in the unsuccessful teams’ city group (that is, in terms of the hypothesis, the null is equal, and the alternative is not equal). (Recall the 5 step-procedure for hypothesis testing). 

In [19]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 9_ here (replacing this text)

**10.**	Let's now look at air quality in the data. **The U.S. AQI is EPA’s index for reporting air quality**. Draw a histogram for **aqi2020** and add a vertical red line at the EPA standard for Spare the Air Day AQI = 100. (https://www.airnow.gov/aqi/aqi-basics/)

For example, in the Bay Area, a Spare the Air Alert is called when air quality is forecast to be unhealthy, or above 100 in the AQI, in any one of the reporting zones. An alert may span over two days if air quality is expected to remain unhealthy for prolonged periods. If air quality is unhealthy in the Bay Area, it is almost always because of two kinds of air [pollutants](https://www.sparetheair.org/understanding-air-quality/air-pollutants-and-health-effects/whos-at-risk): [Ozone](https://www.sparetheair.org/understanding-air-quality/air-pollutants-and-health-effects/ozone) and [fine particulate matter, or PM2.5](https://www.sparetheair.org/understanding-air-quality/air-pollutants-and-health-effects/particulate-matter).

_Hint: see the "Lines" section of Coding Bootcamp Part 4_

In [21]:
# insert your code here

**11.**	(a) Regress **pricesept2021** on a constant, **successteams, violentcrime2019, aqi2020, numberbars, gdpPc**. (b) Generate a series of the predicted values of price and plot those against the price data series: What do you see in terms of fit? 

In [23]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 11_ here (replacing this text)

**12.**	What is the percent variation of housing prices that the model is explaining, and what percent is the model **NOT** explaining?

➡️ Type your answer to _Exercise 1 Question 12_ here (replacing this text)

**13.**	Compute the residuals series and plot the residuals on the vertical axis against **gdpPc** in the x axis, using `ggplot()`. When plotting, exclude the outlier city with gdpPc > 6, by setting the ggplot scale limits as follows: `lims(x = c(0, 6), y = c(-1000,1500))`. 

Is the constant variance assumption for the residuals valid or not for different levels of **gdpPc**, that is, of GDP per capita when you look at the scatter plot of the estimated residuals? 

In [51]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 13_ here (replacing this text)

**14.**	Using the triplet Sign Size Significance (SSS), let’s interpret two of the coefficients from the model in Question 11. 

**(a)** What can you say of the effect of **aqi2020** on housing prices holding other factors constant? 

➡️ Type your answer to _Exercise 1 Question 14 part (a)_ here (replacing this text)

**(b)** What about the coefficient on **successteams**? Use the (SSS) interpretation again.

➡️ Type your answer to _Exercise 1 Question 14 part (b)_ here (replacing this text)

**15.**	Given the estimated coefficients in Question 11's regression, and after you estimate the correlation between **gdpPc** and the air quality index across cities, what will happen to the estimated coefficient of **gdpPc** if you do not include the air quality index (**aqi2020**) in the estimated regression in question 11? Go through the Omitted Variable formula and explain briefly.

In [28]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 15_ here (replacing this text)

**16.**	Now estimate the model in Question 11 but do not include **aqi2020**. What is the new estimate of the coefficient on **gdpPc**, and do you confirm your answer in Question 15? 

In [30]:
# insert your code here

➡️ Type your answer to _Exercise 1 Question 16_ here (replacing this text)

**17.**	What happens to the R squared ($R^2$) when you do not include the **aqi2020** variable in the equation compared to the R squared in Question 11?

In [32]:
# insert your code

➡️ Type your answer to _Exercise 1 Question 17_ here (replacing this text)

## Exercise 2: Survey Evidence Towards Safe Working Conditions of Essential Workers during COVID

Two former EEP majors and I have a co-authored a forthcoming paper in the _Applied Economics Perspectives and Policy_ journal, using survey data to investigate consumer preferences towards sustainable food and products manufactured under safe working conditions of essential labor during the COVID pandemic[<sup>1</sup>](#fn1). 

We ask survey respondents some basic demographics, and also whether they or a loved one were exposed or had COVID in early 2020. Then we ask them to make choices among food options that vary in price and also vary in whether they are produced under safe essential labor conditions. Based on stated survey responses we summarize the choices consistent with respondents preferring the safe labor working conditions product. We also break up the agreement with choosing worker-safe products for respondents that were exposed to COVID in 2020 either directly or through a loved one, and those that were not exposed.

 | | Percent choosing Safe Worker Condition Product | Total number of respondents |
 |:--|:--:|:--:|
 | Overall | 64.42% | 860 |
 | Exposed to COVID | 67.89% | 601 |
 | Not Exposed to COVID | 56.37% | 259 |

 Let $p$ be the fraction of respondents that opt to choose the safe working condition products.

**1.**	Use the survey results to estimate $p$. Also estimate the standard error of your estimate. 

In [33]:
# insert any code here for using R as your calculator

➡️ Type your answer to _Exercise 2 Question 1_ here (replacing this text)

**2.**	Construct a 95% confidence interval for $p$. Interpret. 

In [35]:
# insert any code here for using R as your calculator

➡️ Type your answer to _Exercise 2 Question 2_ here (replacing this text)

**3.**	Construct a 99% confidence interval for $p$. Is it larger or narrower than the 95% confidence interval? Why? 

In [37]:
# insert any code here for using R as your calculator

➡️ Type your answer to _Exercise 2 Question 3_ here (replacing this text)

**4.**	Is there statistical evidence that more than 60% of respondents chose the labor safe product? Use the 5 steps for hypothesis testing with a 5% significance level. 

In [39]:
# insert any code here for using R as your calculator

➡️ Type your answer to _Exercise 2 Question 4_ here (replacing this text)

**5.**	Is there statistical evidence that opting to choose the labor safe product is more likely for respondents that were directly exposed to COVID (or through a loved one) compared to respondents that were not exposed at the 1% significance level? Explain (to answer this question use the 5 steps for hypothesis testing). 

In [41]:
# insert any code here for using R as your calculator

➡️ Type your answer to _Exercise 2 Question 5_ here (replacing this text)

**Please remember to submit your Jupyter Notebook displaying all codes and output.**

<span id="fn1"> Link to the working paper with Nica and Jackie: https://escholarship.org/uc/item/0nv2n39w.</span>