## Problem Set 6
### UGBA 96: Data and Decisions, Fall 2018

In [None]:
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np
import gsExport

Deadline: This assignment is due Monday, October 29th at noon (12pm). Late work will not be accepted.

You will submit your solutions using both OKpy and Gradescope. You will find detailed submission instructions at the bottom of this notebook and on bCourses ([here](https://bcourses.berkeley.edu/files/73630077/download?download_frd=1)). **Please do not remove or add cells and please ignore the '#newpage' cells** (these are here to facilitate Gradescope submission).

You should start early so that you have time to get help if you're stuck. Post questions on Piazza. Check the syllabus  for the office hours schedule. Remember that Connector Assistant office hours are for *coding questions only*.

#newpage

## Question 1: Surviving the Titanic

<img src="titanic.jpg" alt="Drawing" style="width: 400px;"/>

**(20 points)** On April 15th, 1912, four days after it set sail from Southampton, England, en route to New York City, the Titanic struck an iceberg in the North Atlantic. At the time, the Titanic was the largest passenger liner ever made. Of the 2,224 passengers and crew aboard, an estimated 1,502 died. The ship included some of the wealthiest people in the world as well as hundreds of European emigrants to the United States.

It is well known that survival rates varied substantially by age, sex ([‘women and children first’](https://en.wikipedia.org/wiki/Women_and_children_first)), and passenger class. First and second class passengers had considerably higher survival rates than third class passengers and crew. (If you would like to investigate yourself, these data are available in the table below.)

In this question, you will compare survival rates for third class passengers and crew.

Run the cell below to read in the survival data on Titanic passengers and crew.

In [None]:
titanic = Table.read_table('titanic.csv')

#drop first and second class passengers, keeping third class and crew members
titanic = titanic.where('pclass', are.not_equal_to('first')).where('pclass', are.not_equal_to('second'))

titanic.show(5)

Each row represents a passenger or crew member on the Titanic. The data include the following columns:
 * `pclass`: indicates the passenger class for passengers; set to 'crew' for crew members
 * `survived`: an indicator for whether the person survived the Titanic
 * `sex`: the passenger or crew member's sex (takes values 'male' and 'female')
 * `age`: age of the passenger (value is missing for crew members)
 * `fare`: passenger's fare (value is missing for crew members)

**a. (4 points)** Calculate and print separate survival rates for third class passengers and crew. Are crew substantially more likely to survive than third class passengers? [You do not need to conduct a formal hypothesis test.]

In [None]:
#Write code here

*Write your answer here*

**b. (4 points)** Calculate survival rates separately by both `sex` and `pclass`. How do third class passengers and crew compare here?

In [None]:
#write code here

*Write answer here*

**c. (5 points)** How can your findings in **part (a)** and **part (b)** be reconciled?

*Write answer here*

**d. (7 points)** Using a matching strategy to control for `sex`, compute and print the difference in survival rates for third class passengers and crew. Conditional on `sex`, were crew substantially more likely to survive than third class passengers?

[To answer this, first match each third class passenger to a randomly selected crew member of the same sex. Then compute the survival rates for third class passengers and the matched crew members. Finally, calculate and print the difference in survival rates between the two groups.]

In [None]:
#below we've created separate tables for third class passengers and crew

#generate third class table
third_class = titanic.where('pclass', 'third')

#generate crew table
crew = titanic.where('pclass', 'crew')
#add column ids
crew = crew.with_column('id', np.arange(crew.num_rows))

crew.show(5)

In [None]:
#define matching function here
def match_sex_id(sex_value):
    ...

In [None]:
#write remaining code here; remember to print differences in survival rates

*Write answer here*

#newpage

## Question 2: Revisiting the Oregon Health Study

**(25 points)** This question revisits the Oregon Health Study analyzed in Problem Set 5 and discussed in detail in *Mastering Metrics*.

In the Oregon Health Study, when individuals signed up to participate in the lottery they could also list additional individuals in their household to participate in the lottery (e.g. their spouse). Although the state randomly selected individuals from the list as lottery winners, the *entire household* of any selected individual was considered a lottery winner and eligible to apply for insurance. This adds a wrinkle to our experiment analysis--now participants from larger households are more likely to win the lottery than participants in smaller households. [For this reason, in PS5 the data provided were restricted to participants that did not sign up any household members.]

In this question, you will re-analyze the Oregon Health Study, accounting for this fact.

**a. (3 points)** Why does the fact that the lottery winners are disproportionately
drawn from larger households complicate the experiment analysis? Why can't we simply compare lottery winners and lottery losers to estimate the average causal effect of winning the lottery on, say, health?

*Write answer here*

Run the cell below to read in the Oregon Health Study data you will use for the remainder of this problem.

In [None]:
#run this cell to read in the data
ohs = Table.read_table("ohs_hh.csv")
ohs.show(5)

Here is a description of what each column represents:

* `win_lottery`: indicator for whether participant won lottery

* `any_medicaid`: indicator whether a participant is with or without Medicaid coverage

* `household_size`: whether participant signed up 1 (just self), 2, or 3 household members for lottery

* `cost_any_owe`: indicator for whether participant owes any money for medical expenses 12 months after lottery

* `female`: indicator for whether participant is female

* `age`: age of participant that signed up for lottery

* `english`: indicator for whether the participant requested English-language materials for lottery application (proxy for English as preferred language)

Run the cell below to conduct a **balance test**: a comparison of baseline characteristics for lottery winners and lottery losers.

In [None]:
#balance check: compare baseline characteristics
ohs.select('win_lottery', 'household_size', 'english',  'female', 'age').group('win_lottery', np.mean)

We can see that lottery winners have signed up more household members, are less likely to speak English as their preferred language, and are less likely to be female. There are differences, though they may not seem large in magnitude. Run the cells below to conduct a permutation test for whether differences in English-preference rates are statistically significant. [In this case, the null hypothesis is that there is no difference in English-preference rates between groups, while the alternative hypothesis is that there is some difference.]

Below we have provided the useful function `permuted_sample_average_difference` defined in Data 8.

In [None]:
#Run this cell to define the function permuted_sample_average_difference 
def permuted_sample_average_difference(table, label, group_label, repetitions):

    tbl = table.select(group_label, label)

    differences = make_array()
    for i in np.arange(repetitions):
        shuffled = tbl.sample(with_replacement = False).column(1)
        original_and_shuffled = tbl.with_column('Shuffled Data', shuffled)

        shuffled_means = original_and_shuffled.group(group_label, np.average).column(2)
        simulated_difference = shuffled_means.item(1) - shuffled_means.item(0)

        differences = np.append(differences, simulated_difference)

    return differences

In [None]:
#calculate observed difference in english as preferred language
english_diff = ohs.where('win_lottery', 1).column('english').mean() - ohs.where('win_lottery', 0).column('english').mean()
english_diff

#assign array of simulated test statistics under null
english_sim_diff = permuted_sample_average_difference(ohs, 'english', 'win_lottery', 1000)

#calculate and assign p-value
english_pvalue = np.count_nonzero(abs(english_sim_diff) >= abs(english_diff))/1000

#print results
print('estimated difference =', english_diff)
print('p_value =', english_pvalue)

The unadjusted data are imbalanced. However, if we condition on `household_size`, each participant has the same chance of winning the lottery. This is an example where the **Selection on Observables** assumption is clearly satisfied.

Below, you will control for `household_size` using matching and examine whether this corrects the imbalance documented above.

**b. (5 points)** Match each lottery winner to a randomly selected lottery loser with the same value of `household_size`. [Your code may take a while to run. If your notebook appears to be hung up, try restarting your kernel.]

In [None]:
#below we've created separate tables for lottery winners and losers

#generate third class table
winners = ohs.where('win_lottery', 1)

losers = ohs.where('win_lottery', 0)
losers = losers.with_column('id', np.arange(losers.num_rows))

losers.show(5)

In [None]:
#write your code here
def match_size_id(size_value):
    ...

In [None]:
#use .apply get array of matched noncontacted voters
match_indices =...

#use .take and `match_indices` to get a table of matched noncontacted voters
losers_size_matches = ...
losers_size_matches.show(5)

**c. (4 points)** As above, check balance between lottery winners and matched lottery losers by comparing average values of `household_size`, `english`, `female`, and `age` for the two groups. Be sure to print the results.

In [None]:
#Write code for lottery losers here

In [None]:
#write code for lottery winners here

**d. (5 points)** Test whether the differences in English-speaking rates you measure are statistically significant. Has matching corrected the imbalance?

In [None]:
#use the code below to combine your two tables (winners and losers_size_matches) into one table

#create combined array for each column
win_lottery = np.concatenate((winners.column('win_lottery'), losers_size_matches.column('win_lottery')))
household_size = np.concatenate((winners.column('household_size'), losers_size_matches.column('household_size')))
english = np.concatenate((winners.column('english'), losers_size_matches.column('english')))
female = np.concatenate((winners.column('female'), losers_size_matches.column('female')))
age = np.concatenate((winners.column('age'), losers_size_matches.column('age')))
cost_any_owe = np.concatenate((winners.column('cost_any_owe'), losers_size_matches.column('cost_any_owe')))
any_medicaid = np.concatenate((winners.column('any_medicaid'), losers_size_matches.column('any_medicaid')))

#combine arrays into table
combined = Table().with_columns('win_lottery', win_lottery, 'household_size', household_size, 'english', english, 'female', female, 'age', age, 'cost_any_owe', cost_any_owe, 'any_medicaid', any_medicaid)
combined

In [None]:
#write your code here

*Write your answer here*

**e. (5 points)** Estimate the average causal effect of winning the lottery on `cost_any_owe` (i.e., the reduced form or Intent to Treat) and conduct a hypothesis test for whether your estimate is statistically significant. [The null hypothesis is that the treatment effect is zero for each participant.] Be sure to print the results.

In [None]:
#Write your code here

**f. (3 points)** Describe your findings in a complete sentence. Be sure to reference the *meaning* of the variables you're examining rather than just the column names (e.g. don't say '`cost_any_owe` decreases by ...'; instead say what that means in plain English).

*Write your answer here*

#newpage

## Submission

Before submitting, please click "Kernel" above and click "Restart & Run All" to ensure all of your code is working as expected. This is important. Code that does not run cannot be graded. After confirming that all of your work looks and runs as you'd like it to, run **BOTH** of the below cells to submit your work. Failure to submit correctly may result in a 10% point penalty.

First, make sure that the following runs successfully for submission to OKpy.

In [None]:
from client.api.notebook import Notebook
ok = Notebook('pset6.ok')                
_ = ok.auth(inline=True)
_ = ok.submit()

Then, make sure that the following runs successfully to generate a PDF to upload to Gradescope. **Do not upload any other PDF to Gradescope other than the one generated by the below code.** If you have difficulty downloading the PDF, please see Piazza for troubleshooting steps.

In [None]:
gsExport.generateSubmission('pset6.ipynb')