# Problem Set 3, Spring 2021, Villas-Boas

Due <u>as announced on bCourses, end of day Pacific Time</u>

Submit materials (Jupyter notebook with all code cells run) as one pdf on [Gradescope](https://www.gradescope.com/courses/226571).

# Exercise 1: The Value of Environmental Services

## Background

This exercise is based on the paper [Does Hazardous Waste Matter? Evidence from the Housing Market and the Superfund Program by Greenstone and Gallagher 2008](https://academic.oup.com/qje/article-abstract/123/3/951/1928203?redirectedFrom=fulltext). This
paper explores individuals’ willingness to pay (WTP) for environmental quality by observing the
impact of increasing environmental quality on housing prices. This method is known as the hedonic
valuation method, whereby the price of a good is determined by each of its attributes. In the case of a house, its value is determined by all of its physical characteristics (e.g., number of bedrooms) as well as the characteristics of the neighborhood in which it is located (including environmental quality). 

For this paper, the change in environmental quality results from the cleanup of hazardous
waste sites. The study focuses on Superfund sites, areas designated by the US government as
contaminated by hazardous waste and that pose a hazard to environmental/human health. The
EPA placed certain Superfund sites on the National Priorities List (NPL), which meant that these
sites were legally required to undergo remediation. If individuals value environmental quality, then
housing prices should increase after nearby NPL sites are cleaned up. To determine the extent to
which individuals value the cleanup, we can compare housing close to a hazardous waste site that
was cleaned up to *comparable* homes near a similar waste site that was not cleaned up.


## Data

The data include observations on 447 census tracts that are within 2 miles of a hazardous waste site.$^1$  This includes census tracts where the site in question was on the NPL, and thus was legally required to be cleaned up, as well as those that were not on the NPL. 

The shared dataset contains the following variables for each census tract in the year 2000.


|Variable Name	|	Description |	
| :----------- | :----------------------- |
| _fips_   | Federal Information Processing Standards (FIPS) census tract identifier  |
| _npl_	|	binary indicator for whether the site in a given census tract was placed on the NPL|
| _lnmdvalhs_ | log median housing value $^2$ |
| *owner\_occupied*	| % of housing that is owner occupied |
| *pop\_den*	| Population density|
| *ba\_or\_better*	| % of the population that has a Bachelors degree or higher |
| *unemprt*	| Unemployment rate |
| *povrat*	| Poverty rate |
| *bedrms1-bedrms5*	| % of housing with the indicated number of bedrooms |
|  *bedrms\_3orless*  |  % of housing with 3 bedrooms or less |
|*blt0\_10yrs* |% of housing $< 10$ years old |





$^1$ : Census tracts are statistical subdivisions of a county that are defined by the Census Bureau to allow comparisons from census to census.

$^2$ : Median Housing Value, *before logging*, is recorded in USD.

## Tips

* Do **NOT** install packages on the server. All packages that you need to complete the assignment are already installedand can be loaded with the `library()` function. Trying to install packages will create conflicts and potentially require reloading a fresh copy of the notebook.
* When submitting the notebook, do **NOT** include output of the entire dataset. This is a large dataset, and printing out all rows will take multiple pages of output and make it much harder for the GSIs to find your answers.

* `xtable` and `stargazer` are two great packages for making tables. `xtable` is great for turning a dataframe into a table output, while `stargazer` is great for making a table of your regression results.$^2$ I recommend exploring these packages when you are making your summary statistics table and regression output table in this problem set. The first half of Coding Bootcamp Part 5 goes over these two packages and is accessible [here](https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FENVECON-118&urlpath=tree%2FENVECON-118%2FSpring2021-J%2FSections%2FCoding+Bootcamps).

* All packages that you need for this problem set are already installed and ready for use on Datahub. \textbf{Do not try to install packages on the Datahub server}; if you want to use new packages in Datahub to solve the problem set inside the Problem Set 3 notebook, let any of the GSI's know so they ask for those packages you want to use to be installed.

$^2$: `stargazer` is great for professional-looking LaTeX or postscript tables, but if you want to produce tables for your regression output in the notebook you will need to use `type = "html"` and then copy/paste the html output into a Markdown cell.

## Preamble

#### Use the below code cell to load all your packages (we will use `haven()` and `tidyverse()`).

In [None]:
# Add your preamble code here

**1.**  Load the data. This problem set will focus on census tracts that have more than 20% of homes built within the last ten years; that is for $blt0\_ 10yrs > 0.2$.
So first open the original data (`Greenstone_Gallagher_PS3_2021.dta`) and create a subset of the data to be used in this problem set. Call this data frame in R  $my\_dataPset3$.  How many observations do you lose when you focus on this subset of the data?

#### *Hint:* use `filter(data, condition)`

In [None]:
# Add any code for part 1 here.

➡️ Type your written answer for part 1 here.

**2.** Briefly describe your $my\_dataPset3$ data set. Since these data use the impact of the NPL to estimate WTP for environmental services, you 

**(a)** Want to look at the number of census tracts in both the NPL group and the non-NPL group (and the number in the sample overall),

**(b)** Create a variable that is the median value of housing price and call it $housep$, which is the median housing price in dollars (in levels, not logs) and report the average in npl=1 and in npl=0 groups, and 

**(c)** compare the means of the following characteristics (covariates)  of these census tracts across the two groups. By compare, we mean report them and discuss whether the sample averages are similar across groups, and why you might expect such a result (no need to test formally if means of these covariates are similar). The covariates you are to check are percent owner\_occupied, unemployment rate, and poverty rate.

(Hint: use `group_by()` and `summarise()` to obtain separate summary statistics according to the value of $npl$)

In [None]:
# Add any code for part 2 here.

➡️ Type your written answer for part 2 here.

**3.** We will now compare housing prices (in levels, not logs) across the two groups (group *npl* = 1 and
*npl* = 0) using the $my\_dataPset3$ sub dataset. Draw a histogram of the median housing price in
each group.

(Hint: see [Coding Bootcamp Part 4](https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FENVECON-118&urlpath=tree%2FENVECON-118%2FSpring2021-J%2FSections%2FCoding+Bootcamps) for how to do this using **ggplot2**. For base R use `hist(data$variable, main = "Title", xlab = "MedianHousingPrice", ylab = "Frequency")`)


In [None]:
# Add any code for part 3 here.

**4.** Overlap both histograms into the same graph and comment on differences (be precise - and explain
why the differences intuitively make sense).

(Hint: see the Histograms section of [Coding Bootcamp Part 4](https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FENVECON-118&urlpath=tree%2FENVECON-118%2FSpring2021-J%2FSections%2FCoding+Bootcamps))

In [1]:
# Add any code for part 4 here.

➡️ Type your written answer for part 4 here.

**Show work for all steps for Questions 5-7. You can use R to calculate the sample mean,
variance, and # obs, and perform arithmetic operations (or do it by hand). No credit will
be given if you use a canned condence interval/hypothesis test function in R. Credit will
be lost if you do not clearly show your steps.**

**5.** Compute an estimate for the mean of the variable housep for the NPL group (npl=1) in the
$my\_dataPset3$ data frame. Construct a 90% confidence interval for this mean. Give an interpretation of these results in a sentence.

(Hint: use `mean()` and `sd()` to get the necessary information to construct the CI)

In [None]:
# Add any code for part 5 here.

➡️ Type your written work for and answer to part 5 here.

**6.** Let $D$ be the difference in $housep$ between the NPL (npl=1) and non-NPL (npl=0) groups. State
an estimator $\hat D$
for $D$ and use the estimator to compute an estimate of $D$. Compute a standard
error for $\hat D$. Derive a 95% confidence interval for $D$ and interpret in one sentence.

In [None]:
# Add any code for part 6 here.

➡️ Type your written work for and answer to part 6 here.

**7.** Using the $my\_dataPset3$ data frame, test whether the average of the housing values ($housep$) for
the NPL group is statistically different at the 10% significance level ($\alpha = 0.1$) from average housing values in the non-NPL group (that is, in terms of the hypothesis, the null is equal, and the alternative is not equal). (Recall
the 5 step-procedure for hypothesis testing).


In [None]:
# Add any code for part 7 here.

➡️ Type your written work for and answer to part 7 here.

In [None]:
# insert code here

**8.** Now we want to see how adding covariates on the right hand side of our equation affects the coefficient on the treatment indicator $npl$. We run the following regressions using the data $my\_dataPset3$
in R:


\begin{align}
housep&= \beta_0+\beta_1 npl+ u  & (1)\\
housep&= \beta_0+\beta_1 npl+ \beta_2  unemprt +u    & (2)\\
housep&= \beta_0+\beta_1 npl+ \beta_2 unemprt + \beta_3 owner\_occupied +u &(3)
\end{align} 

**(a)** Interpret (SSS) the estimated coefficient, $\hat \beta_1$, that you obtain from estimating equation (1). 


➡️ Type your written answer to 8 (a) here.

**(b)** Looking both at $R^2$ and the evolution of $\hat \beta_1$ as we add variables from equation (1) to (2) to (3), 

1. Comment on which variable matters in explaining the outcome, and which is likely correlated with the variable $npl$ (go through equation by equation). 
2. What does this tell you about how sites were selected to end up on the National Priorities List (that is, about the correlation between $npl$ and those additional variables you added? (Hint: go through the OVB formula)


➡️ Type your written answer to 8 (b) here.

**9.** If you estimate

\begin{align*}housep= \beta_0+\beta_1 npl+ u\end{align*}

 using the complete data set  (not the subset $my\_dataPset3$), what happens to the standard errors of the coefficient of $npl$? Explain briefly why that is. 

➡️ Type your written answer to 9 here.

## Exercise 2: Attitudes toward the COVID Vaccines

*Note:* this question does not require R. If you do use R, you must show all steps used to calculate the relevant formulas from lecture. No credit will be given if a canned routine is used. Credit will be lost if values are given and work is not shown.	

The Institute of Global Health Innovation (Imperial College London) released a report about Global Attitudes towards a COVID-19 vaccine. The survey covers 15 countries and was implemented at different periods in time, from November 2020 to mid January of 2021. The total number of responses was around 13,500.

In one of the questions, respondents were asked whether they strongly agreed (scale of 5), agreed (a4), neither agree nor disagree (a 3), disagree (a 2) or strongly disagree (a 1) with the statement:$^3$

"To what extent do you agree or disagree that if a COVID-19 vaccine were made available to you this week, you would defnitely get it?"
      

$^3$: *Aggregate view of latest week available for each country - see page 13 for exact survey dates:* [Link to Nature Article](https://www.nature.com/articles/d41586-021-00368-6)

In row 1 below are the percentage and number of respondents that agree or strongly agree with the statement, over the whole sample for all countries. Below that first row, we also report results broken down for the United Kingdom (UK) for two periods during which the survey was conducted (Nov 05-15,2020; and then Jan 11-17, 2021) in that country.

| | % Responding "agree/strongly agree"| Total Number of Respondents|
|--|--|--|
|All Countries and Periods of Survey | 54% | 13,500 |
| UK: Nov 5-15, 2020 | 55% | 1,005|
| UK: Jan 11-17, 2021 | 80% | 1,000|


Consider first the overall result (all countries/periods). Let $p$ be the fraction of respondents that approve.

**1.** Use the survey results to estimate $p$. Also estimate the standard error of your estimate.

➡️ Type your written work for and answer to Part 1 here.

**2.** Construct a 95% confidence interval for $p$. Interpret.

➡️ Type your written work for and answer to Part 2 here.

**3.** Construct a 90% confidence interval for $p$. Is it larger or narrower than the 95% confidence interval?
Why?

➡️ Type your written work for and answer to Part 3 here.

**4.** Is there statistical evidence that more than 50% of UK in November 2020 agreed that they would
get the vaccine if given to them? Use the 5 steps for hypothesis testing with a 1% signifiance level.

➡️ Type your written work for and answer to Part 4 here.

**5.** Is there statistical evidence that agreement with taking the vaccine in the UK increased in 2021
relative to November 2020 at the 5% significance level? Explain (to answer this question use the 5
steps for hypothesis testing).

➡️ Type your written work for and answer to Part 5 here.