# Homework 5: Maps, Hypothesis Testing, and Sampling
Welcome to the last homework assignment!

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests.

In [None]:
# Don't change this cell; just run it. 
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from client.api.notebook import Notebook
ok = Notebook('hw05.ok')
_ = ok.auth(inline=True)

Reading:
- Textbook section [7.3](https://data-8r.gitbooks.io/textbook/chapters/07/3/example-bike-sharing-in-the-bay-area.html) and chapter [8](https://data-8r.gitbooks.io/textbook/chapters/08/randomness.html)

Deadline:

This assignment is due **Tuesday, August 1 at 1PM**. You will receive an early submission bonus point if you turn in your final submission by **Monday, July 31 at 1PM**. Late work will not be accepted unless you have made special arrangements with your TA or the instructor.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck. Drop-in office hours will be held at various times in the week; check the course calendar on the [course webpage](http://data8r.org) for the latest schedule.

Once you're finished, select "Save and Checkpoint" in the File menu and then execute the `submit` cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. If you submit more than once before the deadline, we will only grade your final submission.

In [None]:
_ = ok.submit()

## 1. Concepts in Hypothesis Testing


High blood pressure is thought to cause or be associated with several health risks, so medication is sometimes prescribed to lower blood pressure for people with that condition.  There are two common measurements of blood pressure: the pressure when your heart beats and pushes blood through your arteries (systolic), and pressure in between heartbeats (diastolic).  We'll focus on systolic pressure for simplicity.  Pressures are measured in units of millimeters of mercury, which is abbreviated "mmHg".

Suppose we conduct an observational study to determine the efficacy of a medication, PressureLow, for reducing blood pressure.  Surveying a group of American adults, we find that the average systolic blood pressure of people taking PressureLow is *higher* than the average blood pressure of people not taking PressureLow.

### Question 1
Describe the confounding factor you think would be most important in such a study.

*Write your answer here, replacing this text.*

Suppose we forge ahead with an hypothesis test.  We decide to test the following null hypothesis:

> *Null hypothesis:* The people in our study were randomly assigned to receive the medication, and the medication had no effect.

### Question 2
What is the appropriate alternative hypothesis, given that this is our null hypothesis?

*Write your answer here, replacing this text.*

Some (synthetic) data are loaded in the next cell.

In [13]:
d = Table.read_table("pressurelow.csv")
d

### Question 3
What does each row in the table `d` represent?

*Write your answer here, replacing this text.*

You decide to use the average blood pressure of the people taking PressureLow as your test statistic.

In [11]:
def bp_test_statistic(input):
    return np.mean(input)

### Question 4
Use the test statistic function to compute the observed test statistic.

In [12]:
observed_test_statistic = ...
observed_test_statistic

### Question 5
**True** or **false** and **explain:** Calling the function `simulate_test_stat_under_null` (defined in the cell below) *with an argument of 1000* is a reasonable way to simulate a test statistic we could see if the null hypothesis were true.

In [14]:
def simulate_test_stat_under_null(sample_size):
    sample = d.sample(sample_size, with_replacement=False)
    pressures = sample.column("Systolic Pressure")
    return bp_test_statistic(pressures)

*Write your answer here, replacing this text.*

The following histogram displays the distribution of test statistics from datasets simulated correctly under the null hypothesis.

![Simulated test statistics histogram](simulated_stats_hist.png)

### Question 6
Do you reject the null hypothesis, or not?

*Write your answer here, replacing this text.*

### Question 7
Does this mean that PressureLow indeed caused higher blood pressure?  **Explain.**

*Write your answer here, replacing this text.*

## 2. The Support for BCRA


If you've been following the news, you might know that Republicans in the US Congress are attempting to pass a bill about health care.  Various bills have been offered over the past few months.

Pollsters have found that many of these bills are unpopular among Americans.  For example, [this article](http://www.npr.org/2017/06/28/534612954/just-17-percent-of-americans-approve-of-republican-senate-health-care-bill) describes a pollster's findings about the popularity of the Better Care Reconciliation Act (BCRA).

When the pollsters tried to figure out the popularity of BCRA, they were trying to learn the *proportion of registered US voters* who would say that they approve of the bill.  They couldn't ask every registered voter, so instead they asked a random sample of 995.

In this exercise, we will simulate such random sampling to understand what happens.  How much random variability do we see in the proportion of sampled voters who approve of the bill?

We will imagine that we have actually asked all the registered voters about BCRA.  (Actually, the dataset only has 100,000 responses, because we found that the servers you're provided have a hard time handling all 200 million registered voters!)  Those data are loaded in the following cell:

In [41]:
response_codes = Table.read_table("bcra_population.csv")
response_codes.group_barh("Response Code")

We encoded the responses as 0, 1, 2, or 3 to save space.  0 refers to "Approve", 1 to "Disapprove", 2 to "Heard of it, just unsure", and 3 to "Have not heard enough about it to have an opinion".

### Question 1
Write code to produce a new table called `population`.  It should have a single column called `"Response"`, and it should have one row for each row in `response_codes`.  Instead of a response code like 0, it should have the corresponding string like `"Approve"`.

*Hint 1:* It should start like this:

|Response|
|-|
|Approve|
|Have not heard enough about it to have an opinion|
|Disapprove|

<p align="center">... (99997 rows omitted)</p>

*Hint 2:* Define a function and use `apply`.  Your function can reference the `responses` array we've given you.

In [None]:
# This array is provided for your convenience.
responses = make_array(
    "Approve",
    "Disapprove",
    "Heard of it, just unsure",
    "Have not heard enough about it to have an opinion")

...

population = ...
population

### Question 2
Implement the function called `proportions`, according to the docstring given below.

* `"Response"`: The response.
* `"Proportion"`: The proportion of people in `population` with that response.

In [None]:
def proportions(tbl, col_name):
    """Computes the frequency distribution of column col_name in table tbl.
    
    Args:
      - tbl (Table): Any table.
      - col_name (str): The name of a column in tbl.
    
    Returns:
      Table: A table containing the frequency distribution of column col_name
        in table tbl.  It has two columns: col_name and "Proportion".  It has
        one row for each unique value in the col_name column."""
    ...
    return ...

# These lines use your function to compute the frequency distribution
# of the "Response" column in the population table, and make a bar
# chart of that distribution.
support_proportions = proportions(population, "Response")
support_proportions.barh("Response", "Proportion")

### Question 3
Sample **50** people from `population` without replacement, producing a table named `small_sample`.  **Then,** draw a bar chart of the frequency distribution of their responses.

*Hint:* You can do this in one line of code.

In [None]:
small_sample = population.sample(50, with_replacement=False)
proportions(small_sample, "Response").barh("Response", "Proportion")

### Question 4
Compare the distribution of responses in the sample and the distribution of responses in the population.  Are they the same?  If not, how do they differ?

*Write your answer here, replacing this text.*

### Question 5
If you sampled again, would you see the same pattern, or a different pattern?

*Write your answer here, replacing this text.*

### Question 6
Repeat questions 3 through 5, but with a sample of 995.  Compare the distribution of responses to the small sample and the population.

In [None]:
# Use this cell to run any code you need.
...

*Write your answer here, replacing this text.*

### Question 7
The following cell defines some code.  Read it, run it, and describe:

* what each function does,
* how many people were sampled *in total* when you ran the cell,
* what is displayed in the chart it produces, and
* how many numbers are in the dataset whose distribution is represented in the histogram.

*Note:* It may take about a minute to run the cell.

In [None]:
def simulate_approval_proportion(sample_size):
    the_sample = population.sample(sample_size, with_replacement=False)
    return the_sample.where("Response", are.equal_to("Approve")).num_rows / the_sample.num_rows

def approval_distribution(sample_size, num_simulations):
    simulation_sizes = Table().with_column("Sample size", np.repeat(sample_size, num_simulations))
    results = simulation_sizes.apply(simulate_approval_proportion, "Sample size")
    simulation_sizes.with_column("Proportion approving", results).hist("Proportion approving", bins=np.arange(0, 1+.02, .02))

approval_distribution(50, 2000)

*Write your answer here, replacing this text.*

### Question 8
Run the next cell and compare the results with the previous histogram.  How many people were sampled in total when you ran the cell?  Describe how the *law of large numbers* can help explain the difference in the two histograms.

In [None]:
approval_distribution(995, 2000)

*Write your answer here, replacing this text.*