# Problem Set 3, Spring 2020, Villas-Boas

Due <u>Tuesday, March 17 at 9:30am</u>

Submit materials (Jupyter notebook with all code cells run) as one pdf on [Gradescope](https://www.gradescope.com/courses/85265).

# Exercise 1: The Value of Environmental Services

## Background

This exercise is based on the paper [Does Hazardous Waste Matter? Evidence from the Housing Market and the Superfund Program by Greenstone and Gallagher 2008](https://academic.oup.com/qje/article-abstract/123/3/951/1928203?redirectedFrom=fulltext). This
paper explores individuals’ willingness to pay (WTP) for environmental quality by observing the
impact of increasing environmental quality on housing prices. This method is known as the hedonic
valuation method, whereby the price of a good is determined by each of its attributes. In the case of a house, its value is determined by all of its physical characteristics (e.g., number of bedrooms) as well as the characteristics of the neighborhood in which it is located (including environmental quality). 

For this paper, the change in environmental quality results from the cleanup of hazardous
waste sites. The study focuses on Superfund sites, areas designated by the US government as
contaminated by hazardous waste and that pose a hazard to environmental/human health. The
EPA placed certain Superfund sites on the National Priorities List (NPL), which meant that these
sites were legally required to undergo remediation. If individuals value environmental quality, then
housing prices should increase after nearby NPL sites are cleaned up. To determine the extent to
which individuals value the cleanup, we can compare housing close to a hazardous waste site that
was cleaned up to *comparable* homes near a similar waste site that was not cleaned up.


## Data

The data include observations on 450 census tracts that are within 2 miles of a hazardous waste site.$^1$  This includes census tracts where the site in question was on the NPL, and thus was legally required to be cleaned up, as well as those that were not on the NPL. 

The shared dataset contains the following variables for each census tract in the year 2000.


|Variable Name	|	Description |	
| :----------- | :----------------------- |
| _fips_   | Federal Information Processing Standards (FIPS) census tract identifier  |
| _npl_	|	binary indicator for whether the site in a given census tract was placed on the NPL|
| _lnmdvalhs_ | log median housing value |
| _owner\_occupied_	| % of housing that is owner occupied |
| _pop\_den_	| Population density|
| _ba\_or\_better_	| % of the population that has a Bachelors degree or higher |
| _unemprt_	| Unemployment rate |
| _povrat_	| Poverty rate |
| _bedrms1-bedrms5_	| % of housing with the indicated number of bedrooms |
|  bedrms\_3orless  |  % of housing with 3 bedrooms or less |
|blt0\_10yrs |% of housing $< 10$ years old |





$^1$ : Census tracts are statistical subdivisions of a county that are defined by the Census Bureau to allow comparisons from census to census.

## Tips

* Do **NOT** install packages on the server. All packages that you need to complete the assignment are already installedand can be loaded with the `library()` function. Trying to install packages will create conflicts and potentially require reloading a fresh copy of the notebook.
* When submitting the notebook, do **NOT** include output of the entire dataset. This is a large dataset, and printing out all rows will take multiple pages of output and make it much harder for the GSIs to find your answers.

* `xtable` and `stargazer` are two great packages for making tables. `xtable` is great for turning a dataframe into a table output, while `stargazer` is great for making a table of your regression results.$^2$ I recommend exploring these packages when you are making your summary statistics table and regression output table in this problem set.

* If you google and find other R packages you would like to use, make sure you install them in your locally-installed RStudio. If you want to use new packages in DataHub to solve the problem set inside the Problem Set 3 notebook, let any of the GSI's know so they ask for those packages you want to use to be installed.  As mentioned, all packages that you need for this problem set are already installed and ready for use on Datahub.

$^2$: `stargazer` is great for professional-looking LaTeX or postscript tables, but if you want to produce tables for your regression output in the notebook you will need to use `type = "html"` and then copy/paste the html output into a Markdown cell.

## Preamble

Use the below code cell to load all your packages (we will use `haven()` and `tidyverse()`).

In [1]:
# Use this cell to load any packages or set options.

### 1.
Load the data. This problem set will focus on census tracts that have 50% or less homes built within the last ten years; that is for $blt0\_ 10yrs <=0.5$. So first open the original data (`Greenstone_Gallagher_PS3.dta`) and create a subset of the data to be used in this problem set. Call this data frame in R  $my\_dataPset3$.  How many observations do you lose when you focus on this subset of the data?

*Hint:* use `filter(data, condition)`

In [None]:
# Add any code for part 1 here.

Add any written answer for part 1 here.

### 2.
Briefly describe your $my\_dataPset3$ data set. Since these data use the impact of the NPL to estimate WTP for environmental services, you 
1. Want to look at the number of census tracts in both the NPL group and the non-NPL group (and the number in the sample overall),
2. What is the average home value in npl=1 and in npl=0 groups. and 
3. Compare the means of at least 4 characteristics (covariates like unemployment) of these census tracts across the two groups. By compare, we mean report them and discuss whether they are similar across groups, and why you might expect such a result (no need to test formally if means of 4 co variates are similar).

(Hint: use `group_by()` and `summarise()` to obtain separate summary statistics according to the value of $npl$)

In [2]:
# Add your code for part 2 here.

Add any written answer for part 2 here.

### 3.
We will now compare (log) of housing prices across the two groups (group $npl=1 $ and $npl=0$) using the  $my\_dataPset3$ sub dataset. Draw a histogram of the (log) median housing price for each group.  Comment on differences (be precise - and explain why the differences intuitively make sense).\\

(Hint: use `hist(data$variable, main="Title", xlab="Log(Y)", ylab="Frequency")` )


In [4]:
# Add your code for part 3 here.

Add any written answer for part 3 here.

### Show work for all steps for Questions 3-6. You can use R to calculate the sample mean, variance, and \# obs, and perform arithmetic operations (or do it by hand). No credit will be given if you use a canned confidence interval/hypothesis test function in R. Credit will be lost if you do not clearly show your steps.

### 4.
Compute an estimate for the mean of the variable *lnmdvalhs* for the NPL group in the $my\_dataPset3$ data frame. Construct a 90% confidence interval for this mean. Give an interpretation of these results in a sentence.

In [5]:
# Add your code for part 4 here.

Add any written answer for part 4 here.

### 5.
Call the difference in  *lnmdvalhs* between the NPL and non-NPL groups $D$. State an estimator $\hat D$ for $D$ and use the estimator to compute an estimate of $D$. Compute a standard error for $\hat D$. Derive a 95% confidence interval for $D$ and interpret in a sentence.

In [6]:
# Add your code for part 5 here.

Add any written answer for part 5 here.

### 6.
Using the $my\_dataPset3$ data frame, test whether the log of housing values (*lnmdvalhs*) for the NPL group is statistically different at the 5\% level from log of housing values in the non-NPL group. (Recall the 5 step-procedure for hypothesis testing).

In [7]:
# Add your code for part 6 here.

Add any written answer for part 6 here.

### 7.
Now, consider the case where we use one-sided test, such that the alternative hypothesis is now $H_a = D > 0$. Is there a scenario under which you would reject this null hypothesis, but not the null hypothesis in part 6? If so, what is the probability that this occurs? Use a picture in the explanation of your answer.


Add any written answer for part 7 here.

### 8.

Now we want to see how adding covariates on the right hand side of our equation affects the coefficient on the treatment indicator *npl*.  We run the following regressions using the data $my\_dataPset3$ in R: 

\begin{align*}
lnmdvalhs&= \beta_0+\beta_1 npl+ u ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~(1) \\
lnmdvalhs&= \beta_0+\beta_1 npl+ \beta_2  unemprt +u ~~~~~~~~~~~~~~~~~~~(2)\\
lnmdvalhs&= \beta_0+\beta_1 npl+ \beta_2 unemprt + \beta_3 owner\_occupied +u~~~~~~~~~~~~~~(3)
\end{align*} 


In [11]:
# Add your code for part 8 here.

#### (a) 
Interpret (SSS) the estimated coefficient, $\hat \beta_1$, that you obtain from estimating equation (1). 

Add any written answer for part 8(a) here.

#### (b)
Looking both at $R^2$ and the evolution of $\hat \beta_1$ as we add variables,

i) Comment on which variable matters in explaining the outcome, and which is likely correlated with the variable *npl* (go through equation by equation). 
    
ii) What does this tell you about the how sites were selected to end up on the National Priorities List (that is, about the correlation between $npl$ and those additional variables you added? (Hint: go through the OVB formula)

Add any written answer for part 8(b) here.

### 9.
 If you estimate 

\begin{align*}
lnmdvalhs= \beta_0+\beta_1 npl+ u
\end{align*}

 using the complete data set (not the subset $my\_dataPset3$), what happens to the standard errors of the coefficient of $npl$? Explain briefly why that is. 

In [12]:
# Add your code for part 9 here.

Add any written answer for part 9 here.

## Exercise 2: Perception on Wildfires

#### Note: this question does not require R. If you do use R, you must show all steps used to calculate the relevant formulas from lecture. No credit will be given if a canned routine is used. Credit will be lost if values are given and work is not shown.

#### The Public Policy Institute conducts statewide surveys (across California) to collect information about a variety of topics (health, environment, political attitudes, education, wildfires and power shut offs).   In November 2019, PPIC conducted a survey on more than 1600 adults. One of the many questions was aimed to gather information about how  California voters viewed the Governor's handling of the wildfire season and the power shut offs.  They asked the following question: *Do you approve or disapprove of the way that Governor Newsom is handling the issue of wildfires and power shutoffs in California?* https://www.ppic.org/wp-content/uploads/crosstabs-all-adults-1119.pdf.

 Below is the percentage of respondents that do approve. We also report results broken out for two regions of residence.

|               |% responding "approve"| Total Number of  Respondents|
|---------------|----------------------|-----------------------------|
|All            |            46%       |              1693           |
|Central Valley |            40%       |              318            |
|SF Bay Area    |            47%       |              346            |


Consider first the overall result (all voters).  Let $p$ be the fraction of Californian voters that approve.

#####  1. Use the survey results to estimate $p$.  Also estimate the standard error of your estimate.

Add any written answer for part 1 here.

##### 2. Construct a 99% confidence interval for $p$. Interpret.

Add any written answer for part 2 here.

##### 3. Construct a 95% confidence interval for $p$.  Is it larger or narrower than the 99% confidence interval? Why?

Add any written answer for part 3 here.

##### 4. Is there statistical evidence that more than 40% of Bay Area adults believe that the governor did well? Use the 5 steps for hypothesis testing with a 5% significance level.

Add any written answer for part 4 here.

##### 5. Is there statistical evidence that the SF Bay Area adults are more likely to approve than Central Valley adults? Explain (to answer this question use the 5 steps for hypothesis testing).

Add any written answer for part 5 here.