## Problem Set 8
### UGBA 88: Data and Decisions, Fall 2019

In [None]:
#run this cell once, then *restart kernel*
%pip install gsExport

In [None]:
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np
import gsExport
import pandas as pd

Deadline: This assignment is due *Wednesday*, December 4th at **11:59pm**. Late work will not be accepted.

You will submit your solutions using both OKpy and Gradescope. You will find detailed submission instructions ([here](https://docs.google.com/document/d/1vrg66vGtBf93xt4-LUQPpacUAQAxIJEeJ10fRsb8oUc/)). **Please do not remove or add cells and please ignore the '#newpage' cells** (these are here to facilitate Gradescope submission).

You should start early so that you have time to get help if you're stuck. Post questions on [Piazza](https://piazza.com/class/jzw0f05ebpof0). Check the syllabus  for the office hours schedule. Remember that Connector Assistant office hours are for *coding questions only*.

<img src="yelp.png" alt="Drawing" style="width: 300px;"/>

**(60 points)**

### Background

In this problem set you will replicate the main results from the paper *Learning from the Crowd: Regression Discontinuity Estimates of the Effects of an Online Review Database*, written by Michael Anderson and Jeremy Magruder. Both authors are on the faculty at UC Berkeley's department of Agricultural and Resource Economics. You can find the paper itself on bCourses. **You will need to reference the paper to complete this problem set.**

The authors study the following research question: do online reviews drive consumer demand? More specifically, the authors estimate the causal effect of positive Yelp ratings on restaurant reservation availability. 

### Data

Prof. Anderson and Prof. Magruder have graciously provided us with the data they collected for their paper.

The authors collected two data sets and merged them. The first data set is the full history of reviews for each San Francisco restaurant on Yelp.com as of February 2011. Using this database, the authors reconstructed the average rating and total number of reviews for each restaurant at every point in time.

The authors combine these data with reservation availability data from a large online
reservation website (e.g., Open Table or Reserve). This website lists real-time reservation availability for a subset of the restaurants in the Yelp data. From July 21st to October 29th 2010, the authors recorded reservation availability for a party of four on Thursday, Friday and Saturday evenings. They checked the availability at 7 and 8 pm. Availability was measured approximately 36 hours prior to the time of the desired reservation.

We will work with a subset of their data that will be sufficient for replicating the main results. Each row of data corresponds to a restaurant at a given night of reservation availability. Here is a description of each column in the dataset:

* `restaurant_id`: ID for restaurant
* `restaurant`: Name of restaurant
* `date`: Night of reservation
* `wk_id`: Week of reservation
* `total_reviews`: Total reviews on Yelp
* `avg_rating`: Average review rating of restaurant on Yelp
* `display_rating`: Restaurant rating *displayed* to Yelp visitors (in half star increments)
* `neighborhood`: San Francisco neighborhood of restaurant
* `yelp_category`: Cuisine type
* `delivers`: Indicator for whether restaurant delivers
* `takeout`: Indicator for whether restaurant sells take out food
* `price`: price category of restaurant ($ \$, \; \$\$, \; \$ \$ \$, \; \$ \$ \$ \$ $)
* `avail_7pm`: Indicator for whether restaurant is available at 7pm on `date`
* `avail_8pm`: Indicator for whether restaurant is available at 8pm on `date`


In [None]:
#run this cell to load the data

#read in data
yelp_data = Table.read_table("yelp_data.csv")

delivers = np.int32(yelp_data.column('delivers') == 'Yes')
takeout = np.int32(yelp_data.column('takeout') == 'Yes')

yelp_data = yelp_data.with_column('delivers', delivers).with_column('takeout', takeout)

yelp_data.show(5)

#newpage

### Part I: Descriptive Statistics

Below are some basic descriptive statistics from the paper that provide useful context. Note: the corresponding values in your data will vary somewhat from the numbers in this table, but not by much.

<img src="summary_stats.png" alt="Drawing" style="width: 600px;"/>

There are about 325 restaurants represented in our data. The average review rating for a restaurant is about 3.6 stars. Restaurants are available for a 7pm reservation about 59% of the time.

**a. (2 points)**  Plot a barplot of `display_rating`. This should display the frequency of each value of `display_rating` in the data.

In [None]:
#Write code here

`display_rating` takes a discrete set of values, 1 to 5 in half star increments (though our sample takes a more limited range). This masks substantial variation in `avg_rating` across restaurants, which is a simple average of all past Yelp reviews for the restaurant.

**b. (2 points)** Plot a histogram of `avg_rating`.

In [None]:
#Write code here

Even among restaurants with the same `display_rating`, there is substantial variation in `avg_rating`.

#newpage

### Part II: Regression Approach

Our objective is to measure the average causal effect of Yelp star rating on reservation availability.

**c. (4 points)**  Describe a hypothetical experiment that would identify the causal effect of Yelp star ratings on reservation availability.

*Write your answer here*

In the absence of an experiment, a natural approach to use here is **regression**. In this section we will estimate the relationship between Yelp star ratings and availability.

**d. (2 points)** Create a scatter plot with `display_rating` on the horizontal axis and the average value of `avail_7pm` by display rating on the vertical axis. (Hint: your plot should have 5 points total.)

In [None]:
#Write code here

**e. (3 points)** Based on your plot from **part (d)**, describe the relationship between `display_rating` and `avail_7pm`.

*Write your answer here*

**f. (5 points)**  Estimate a regression model with `avail_7pm` as the dependent variable and `display_rating` as the explanatory variable. Report the coefficients. Interpret the coefficent on `display_rating` in a full sentence. (Be sure to mention the *magnitude* of the estimate and not just the *sign*.)

In [None]:
#write your code here: define MSE function

In [None]:
#write any additional code here

*Interpret coefficient here*

This coefficient may not have a causal interpretation. Restaurants with higher display ratings may have less availability for reasons unrelated to the causal effect of the display rating. Next, we investigate `price` as a potential confounding variable. [We will assume here that restaurants do not change their pricing in response to their Yelp rating.]

**g. (2 points)** Create a scatter plot with `display_rating` on the horizontal axis and average `price` by display rating on the vertical axis.

In [None]:
#Write code here

**h. (3 points)** Describe the pattern you see. What issue does this present for estimating the causal effect of `display_rating` on reservation availability?

*Write your answer here*

Of course, we can always include `price` as a control in our regression model.

**i. (3 points)** Estimate the following regression model:

$$\text{avail}\_\text{7pm}_{i} = \alpha + \beta \text{display}\_\text{rating}_{i} + \gamma \text{price}_{i} + e_{i}$$

**Be sure to report the coefficients.**

In [None]:
#write your code here: define MSE function

In [None]:
#write any additional code here

**j. (3 points)** Interpret the estimated coefficient $\beta$ in a full sentence. (Be sure to mention the *magnitude* of the estimate and not just the *sign*.)

*Write your answer here*

However, even after controlling for price, there may be other confounding factors that differ between restaurants with high and low ratings that produce an omitted variable bias.

**k. (5 points)** Describe a potential confounding factor that would generate an **omitted variable bias** in our regression of of restaurant availability on `display_rating` (with `price` already included as a control). What sign would you anticipate for this omitted variable bias, and why?

*Write your answer here*

#newpage

### Part III: Regression Discontinuity Approach

Though we can control for *observable* differences across restaurants using regression, *unobservable* differences remain a major concern. Fortunately, Yelp's review system provides a natural regression discontinuity research design that may allow us to account for *unobservable* differences across restaurants. `avg_rating` will serve as the running variable.

Here is an explanation provided in the text of Anderson and Magruder (2012):

*"When leaving a review on Yelp, a user must assign a rating from 1 to 5 stars in whole-star
increments. Yelp aggregates all reviews for a given business and displays the average
rating prominently. However, when Yelp computes the average rating they round off to
the nearest half-star. Two restaurants that have similar average ratings can thus appear
to be of very different quality. For example, a restaurant with an average rating of 3.24
displays a 3-star average rating while a restaurant with an average rating of 3.26 displays
a 3.5-star average rating."*

In this section we will implement this idea. First, we will confirm that `display_rating` changes sharply around specific cutoffs in `avg_rating`. We will do this by creating a *bin scatter* plot as we saw in lecture. We will bin the data by `avg_rating`, then plot the average `display_rating` for each bin around the cutoff. We will use a bandwidth of 0.5. That is, we will focus on restaurants with value `avg_rating` within 0.5 of the cutoff.

In [None]:
#create grid for avg_rating bins
break_points = np.arange(2.75,4.80,0.05)

#create column of bins
bins = pd.cut(yelp_data.column('avg_rating'), break_points, right = False)

#add bins to data
yelp_data = yelp_data.with_column('bin', bins)

In [None]:
#create table of bin means around 4 star cutoff
#bandwidth = 0.5
disc4_means = yelp_data.where('avg_rating', are.between_or_equal_to(3.25, 4.25)).group('bin', collect = np.mean)

plt.scatter(disc4_means.column('avg_rating mean'), disc4_means.column('display_rating mean'))
plt.title('Display Rating by Average Review Rating')
plt.xlabel('Average Review Rating')
plt.axvline(x=3.75, linewidth=1, color = 'red')

The display rating changes sharply once `avg_rating` reaches 3.75. [The paper exploits similar cutoffs at 3.25 (moving from 3 stars to 3.5 stars) and 4.25 (moving from 4 stars to 4.5 stars).]

**l. (3 points)** Compared to the regression approach taken in **Part II**, what are the advantages of this approach to estimating the causal effect of a Yelp star rating?

*Write your answer here*

**m. (3 points)** We can use this regression discontinuity approach to estimate the causal effect of a Yelp star rating on availability. If the standard regression discontinuity assumptions are met, we will recover this treatment effect *for what set of restaurants*?

*Write your answer here*

An important identifying assumption for the RD approach is that other restaurant characteristics vary *smoothly* or *continuously* in the running variable at the cutoff. In other words, we should not see a discontinuity in these other covariates.

Below we check whether `price` varies continuously at the cutoff. This is the same plot as above with `display_rating` replaced by `price`.

In [None]:
#run this cell to generate price plot

plt.scatter(disc4_means.column('avg_rating mean'), disc4_means.column('price mean'))
plt.title('Price by Average Review Rating')
plt.xlabel('Average Review Rating')
plt.axvline(x=3.75, linewidth=1, color = 'red')

By contrast to `display_rating`, there is no clear discontinuity in `price` at the cutoff of 3.75. This is consistent with the continuity assumption from lecture (and the text).

Next, you'll make the main set of RD plots: plots of the outcome against the running variable.

**n. (6 points)** Replicate Figure 2(b) from the paper. [Your figure should be close, but not quite identical.]

In [None]:
#Write code for figure here

**o. (5 points)** Estimate the regression discontinuity model for the figure above. Using a bandwidth of 0.5 (meaning restricting the data to observations with values of `avg_rating` within 0.5 of the cutoff, 3.75), estimate the following regression model:

$$\text{avail}\_\text{7pm}_{i} = \alpha + \beta \mathbb{1}_{\text{4 stars}} + \gamma (\text{avg}\_\text{rating}_{i} - c) + \delta (\text{avg}\_\text{rating}_{i} - c) \times \mathbb{1}_{\text{4 stars}}  + e_{i}$$

where $c = 3.75$, the cutoff for a `display_rating` of 4. Note that $\mathbb{1}_{\text{4 stars}}$ is an indicator for a `display_rating` of 4.

**Be sure to report the coefficients.**

In [None]:
#create indicator for a display_rating value of 4
disp_4 = np.int32(yelp_data.column('display_rating') == 4)

yelp_data = yelp_data.with_column('disp_4', disp_4)

In [None]:
#write your code here: define MSE function

In [None]:
#write any additional code here

**p. (3 points)** Based on this model, what do you estimate is the causal effect of a *half star* increase in Yelp rating on reservation availability?

*Write your answer here*

**q. (3 points)** Run the cell below to overlay your regression line over the bin scatter plot. Replace the values `alpha` ($\alpha$), `beta` ($\beta$), `gamma` ($\gamma$), and `delta` ($\delta$) with your coefficient estimates.

In [None]:
alpha = ...
beta = ...
gamma = ...
delta = ...

#plot regression lines
X_plot_pre = np.linspace(-0.5, 0, 10)
plt.plot(X_plot_pre, alpha + X_plot_pre*gamma, color = 'black')

X_plot_post = np.linspace(0, 0.5, 10)
plt.plot(X_plot_post, alpha + beta + X_plot_post*(gamma + delta), color = 'black')

dist_4 = yelp_data.where('avg_rating', are.between_or_equal_to(3.25, 4.25)).column('avg_rating') - 3.75
disc4_means = yelp_data.where('avg_rating', are.between_or_equal_to(3.25, 4.25)).with_column('dist_4', dist_4).group('bin', collect = np.mean)


#generate bin scatter plot
plt.scatter(disc4_means.column('dist_4 mean'), disc4_means.column('avail_7pm mean'))
plt.title('7pm Availability by Average Review Rating')
plt.xlabel('Average Review Rating')
plt.axvline(x=0, linewidth=1, color = 'red')

**r. (3 points)** Describe how this figure provides visual evidence for the average causal effect of a restaurant's `dispay_rating` on that restaurant's reservation availability.

*Write your answer here*

## Submission

Before submitting, please click "Kernel" above and click "Restart & Run All" to ensure all of your code is working as expected. This is important. Code that does not run cannot be graded. After confirming that all of your work looks and runs as you'd like it to, run **BOTH** of the below cells to submit your work.

Make sure that the following runs successfully for submission to OkPy.

In [None]:
from client.api.notebook import Notebook
ok = Notebook('pset8.ok')                
_ = ok.auth(inline=True)
_ = ok.submit()

In [None]:
gsExport.generateSubmission('pset8.ipynb')

Then, make sure that the following runs successfully to generate a PDF to upload to Gradescope. **Do not upload any other PDF to Gradescope other than the one generated by the below code.** If you have difficulty downloading the PDF, please review the submission instructions ([here](https://docs.google.com/document/d/1vrg66vGtBf93xt4-LUQPpacUAQAxIJEeJ10fRsb8oUc/)) or see Piazza for troubleshooting steps.