## Problem Set 7
### UGBA 88: Data and Decisions, Fall 2019

In [None]:
#run this cell once, then *restart kernel*
%pip install gsExport

In [8]:
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np
import gsExport

Deadline: This assignment is due Monday, November 25th at noon (12pm). Late work will not be accepted.

You will submit your solutions using both OKpy and Gradescope. You will find detailed submission instructions ([here](https://docs.google.com/document/d/1vrg66vGtBf93xt4-LUQPpacUAQAxIJEeJ10fRsb8oUc/)). **Please do not remove or add cells and please ignore the '#newpage' cells** (these are here to facilitate Gradescope submission).

You should start early so that you have time to get help if you're stuck. Post questions on [Piazza](https://piazza.com/class/jzw0f05ebpof0). Check the syllabus  for the office hours schedule. Remember that Connector Assistant office hours are for *coding questions only*.

## Question 1: Orange Juice Sales

**(35 points)** In this question you will analyze data on orange juice sales and prices in [Dominick's](https://en.wikipedia.org/wiki/Dominick%27s) supermarkets (you can read more about these data [here](https://www.chicagobooth.edu/research/kilts/datasets/dominicks)). The data come from 83 supermarkets in the Chicago area. Sales and prices vary by brand, supermarket, and week.

Each row contains the following information on a given brand of orange juice in a particular supermarket in a given week:

* `sales`: the number of units sold in the store
* `price`: price of the orange juice
* `brand`: brand of the orange juice (Tropicana or Dominick's)
* `feat`: whether the orange juice was 'featured' or promoted in the store
* `tropicana`: an indicator for whether the brand is Tropicana (= 0 if brand is Dominick's)

In [None]:
#run this cell to load the data

#read in data
oj_data = Table.read_table("oj.csv")
oj_data.show(5)

In this question, you will estimate how orange juice sales respond to prices. In economics, this relationship is summarized by the **price elasticity of demand**, which measures proportional changes in quantity demanded of a good or service in response to a proportional change in price. It gives the percentage change in quantity demanded in response to a one percent change in price. Mathematically, it can be expressed as

$$e_{p} = \frac{\partial Q}{\partial p} \frac{p}{Q}$$

where $P$ is price and $Q$ is quantity. We can estimate the price elasticity of demand using the following regression:

$$\ln(\text{Sales})_{i} = \alpha + \beta \ln(\text{Price})_{i} + e_{i}$$

With both the dependent variable and explanatory variables expressed in natural logs, $\beta$ now has an **elasticity interpretation**: a 1% increase in price is associated with a $\beta$ percent increase in sales. (For more detail on the use of the natural log function in regression, see *Mastering Metrics*.)

In [None]:
#first we'll define log_sales and log_price and add them as columns
log_sales = np.log(oj_data.column('sales'))
log_price = np.log(oj_data.column('price'))

oj_data = oj_data.with_columns(['log_sales', log_sales, 'log_price', log_price])

**a. (3 points)** Create a scatterplot with `log_sales` on the vertical axis and `log_price` on the horizontal axis. Describe the relationship you see in a sentence.

In [3]:
#write code here

*Describe the relationship you see in a sentence*

**b. (5 points)** Estimate the price elasticity of demand (the regression given above). Report your coefficients. Interpret your estimate for the price elasticity ($\beta$) in a full sentence. (Be sure to mention the *magnitude* of the estimate and not just the *sign*.)

In [4]:
#write your code here: define MSE function

In [5]:
#write any additional code here

*Interpret estimate in sentence*

**c. (3 points)** Plot a histogram of `price` (*not* `log_price`), grouped by `feat`. How do prices for featured and non-featured orange juices compare?

In [6]:
#write your code here

*How do prices compare?*

**d. (3 points)** Calculate average `sales` (*not* `log_sales`) by `feat`. How do sales for featured and non-featured orange juices compare?

In [None]:
#write your code here

*How do sales compare?*

**e. (3 points)** How do you anticipate controlling for `feat` in the regression will affect your estimated price elasticity? Why? (Your argument should be based statistical reasoning, not what you think drives consumer behavior.)

*Write answer here*

**f. (5 points)** Using regression, estimate the price elasticity of demand while *controlling for `feat`*. Report your coefficients. Interpret your estimate for the price elasticity in a sentence. (Be sure to mention the *magnitude* of the estimate and not just the *sign*.)

In [None]:
#write your code here: define MSE function

In [None]:
#write any additional code here

*Interpret estimate in sentence*

**g. (4 points)** Using regression, estimate the price elasticity of demand while *controlling for `tropicana`* (but not `feat`). Report your coefficients.

In [None]:
#write your code here: define MSE function

In [None]:
#write any additional code here

**h. (5 points)** Based on your regression results in **part (b)** and **part (g)**, does Tropicana or Dominic's have a larger average `log_price`? Explain your reasoning.

*Write answer here*

**i. (4 points)** Prices tend to be higher in supermarkets located in more dense areas of Chicago. These supermarkets also tend to attract more customers per day. Explain why the fact that we are not controlling for the density of the supermarket's location may bias our estimate for the price elasticity of demand. What sign do you anticipate for that bias, and why?

*Write answer here*

#newpage

## Question 2: Regression and the Oregon Health Study

**(20 points)** In this question you will re-examine the Oregon Health Study from Chapter 1 of *Mastering 'Metrics* and earlier problem sets. In 2008, the state of Oregon held a (randomized) lottery where lottery winners were eligible to apply for enrollment in a Medicaid expansion program. Refer to the discussion in *Mastering 'Metrics* for more details on the experiment.

The data you will use this time will have a slightly different set of columns from what you used in previous problem sets. Here we have restricted the data to individuals who have only signed themselves up (and not their families) for the lottery, getting around the balance issues discussed in Problem Set 6.

* `person_id`: identifier for participants
* `win_lottery`: indicator for whether participant won lottery
* `english`: participant requested English language materials
* `female`: indicator for female participant
* `zip_msa`: whether participant lives in a Metropolitan Statistical Area (MSA) (i.e., near a city)
* `age`: age of participant
* `cost_any_owe`: indicator for currently owing any money for medical expenses
* `any_medicaid`: indicator for whether participant is with or without Medicaid coverage

In [None]:
#run this cell to load the data
ohs_data = Table.read_table("ohs.csv")
ohs_data.show(5)

In Problem Set 5 we estimated the causal effect of winning the lottery on `cost_any_owe` by comparing the mean of `cost_any_owe` for lottery winners to the same mean for lottery losers. Recall that we can make that same comparison using regression by estimating the following model:

$$\text{cost}\_\text{any}\_\text{owe}_{i} = \alpha + \beta \times \text{win}\_\text{lottery}_{i} + e_{i} $$

**a. (4 points)** Estimate the regression above. Report the coefficients.

In [3]:
#write your code here: define MSE function

In [4]:
#write any additional code here

**b. (3 points)** Confirm that you get the same treatment effect estimate by comparing means of `cost_any_owe` for lottery winners and lottery losers. Make sure to print your calculation of the difference in means.

In [None]:
#write code here

**c. (4 points)** Describe the interpretation for your $\beta$ estimate in a sentence. (Be sure to mention the *magnitude* of the estimate and not just the *sign*.) Is this a causal effect? Why or why not? [Note that a one unit change in `win_lottery` is moving from zero (lottery loser) to one (lottery winner).]

*Write your answer here*

**d. (4 points)** Now we'll try estimating a similar regression but using our other covariates, `female`, `age`, `english`, `zip_msa`, as controls.

Estimate the following regression model:

$$ \text{cost}\_\text{any}\_\text{owe}_{i} = \alpha + \beta \times \text{win}\_\text{lottery}_{i} + \gamma_{1} \times \text{female}_{i} + \gamma_{2} \times \text{age}_{i} + \gamma_{3} \times \text{english}_{i} + \gamma_{4} \times \text{zip}\_\text{msa}_{i} +  \epsilon_{i} $$


Be sure to report the coefficients.

In [None]:
#write your code here: define MSE function

In [None]:
#write any additional code here

**e. (5 points)** Your estimate for $\beta$ should not meaningfully change from **part (a)** to **part (d)**. What does this tell you about the relationship between `win_lottery` and the other covariates, and what can explain this relationship?

*Write your answer here*

## Submission

Before submitting, please click "Kernel" above and click "Restart & Run All" to ensure all of your code is working as expected. This is important. Code that does not run cannot be graded. After confirming that all of your work looks and runs as you'd like it to, run **BOTH** of the below cells to submit your work.

Make sure that the following runs successfully for submission to OkPy.

In [None]:
from client.api.notebook import Notebook
ok = Notebook('pset7.ok')                
_ = ok.auth(inline=True)
_ = ok.submit()

Then, make sure that the following runs successfully to generate a PDF to upload to Gradescope. **Do not upload any other PDF to Gradescope other than the one generated by the below code.** If you have difficulty downloading the PDF, please review the submission instructions ([here](https://docs.google.com/document/d/1vrg66vGtBf93xt4-LUQPpacUAQAxIJEeJ10fRsb8oUc/)) or see Piazza for troubleshooting steps.


In [None]:
gsExport.generateSubmission('pset7.ipynb')