In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

<table style="width: 100%;">
<tr style="background-color: transparent;">
<td width="100px"><img src="https://cs104williams.github.io/assets/cs104-logo.png" width="90px" style="text-align: center"/></td>
<td>
  <p style="margin-bottom: 0px; text-align: left; font-size: 18pt;"><strong>CSCI 104: Data Science and Computing for All</strong><br>
                Williams College<br>
                Fall 2023</p>
</td>
</tr>


# Lab 7: Hypothesis Testing

<hr style="margin: 0px; border: 3px solid #500082;"/>

<h2>Instructions</h2>

- Before you begin, execute the cell at the TOP of the notebook to load the provided tests, as well as the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute these cells again.  
- Be sure to consult your [Python Reference](https://cs104williams.github.io/assets/python-library-ref.html)!
- Complete this notebook by filling in the cells provided. For problems asking you to write explanations, you **must** provide your answer in the designated space. 
- Please be sure to not re-assign variables throughout the notebook.  For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously.
- This lab has hidden tests on it. That means even though tests may say 100% passed, doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the lab.
- To use one or more late days on this lab, please fill out our [late day form](https://forms.gle/4sD16h3hN1xRqQM27) **before** the due date.

<hr/>
<h2>Setup</h2>


In [None]:
# Run this cell to set up the notebook.
# These lines import the numpy, datascience, and cs104 libraries.

import numpy as np
from datascience import *
from cs104 import *
%matplotlib inline

In this lab, we give you three real-world datasets and you will run hypothesis tests for each. We will structure our questions around these six steps (although sometimes we break these steps down even further into smaller parts).

---

**Steps for hypothesis testing.**

A. Define a Null and Alternate Hypothesis.

B. Choose a test statistic (typically an absolute difference in means or proportions).

C. Compute the test statistic for the observed data.

D. Simulate many samples according to the Null Hypothesis's model, and compute the test statistic for each of thoese samples.

E. Calculate the p-value by comparing your observed statistic to the simulated test statistics.

F. Use the p-value and p-value cutoff to draw a conclusion about the Null Hypothesis.

---

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 1. Therapeutic Touch (50 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Use simulation and test statistics to reject or fail to reject a null hypothesis.
- Compute a p-value to assess the statistical significance.
</font>

#### What is the Therapeutic Touch?

The Therapeutic Touch (TT) is the idea that everyone can field the Human Energy Field (HEF) around individuals. Certain practictioners claim they have the ability to feel the HEF and can massage it in order to promote health and relaxation in individuals. Those who practice TT have described different people's HEFs as "warm as Jell-O" and "tactile as taffy". 

TT was a popular technique used throughout the 20th century that was recommended as a great way to bring balance to a person's health. 

##### Emily Rosa

Emily Rosa was a 4th grade student who had wide exposure to the world of TT due to her parents. Her parents were both medical practitioners and skeptics of the idea of TT. 

For her 4th grade science fair project, Emily decided to test whether or not TT practitioners could truly had any interaction with a person's HEF. 

##### Emily's Experiment

Emily's experiement was clean, simple, elegant, and effective. Due to her parents occupations in the medical field, she had wide access to people who claimed to be TT practitioners. 

Emily took 21 TT practitioners and used them for her science experiment. She would take a TT practitioner, ask them to extend their hands through a screen (through which they can't see). On the other side would be Emily, who would flip a coin and put out either her left hand or her right hand, depending on the coin. The TT practitioner would then have to correctly answer which hand Emily put out. Overall, through 210 samples, the practitioner picked the correct hand 44% of the time. 

Emily's main goal here was to test whether or not the TT practitioners guesses were random, like a flip of the coin. 

<!-- BEGIN QUESTION -->

#### Part 1.1 Step A: Define a Null and Alt Hypothesis (5 pts)


State the Null and Alternative Hypothesis for Emily's experiment.  Keep in mind that you must be able to simulate the Null Hypothesis to generate random samples.  

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

Type your answer here:

* Null Hypothesis:  ...

* Alternative Hypothesis:  ...


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

#### Part 1.2 Expected Outcome Under Null Hypothesis (5 pts)


Remember that the pracitioner got the correct answer 44% of the time. Under the Null Hypothesis, on average, what proportion of times do we expect the practitioner to guess the correct hand? Make sure your answer is between 0 and 1. 

In [None]:
tt_expected_correct_proportion = ...
tt_expected_correct_proportion

In [None]:
grader.check("p1.2")

The goal now is to see if our deviation from this expected proportion of correct answers is due to something other than chance. 

#### Part 1.3 Step B: Test statistic (5 pts)


We'll use the absolute difference between the expected proportion of correct answers under the null hypothesis (which you defined as a variable above) and the actual proportion of correct answers as our test statistic.


Define the function `tt_test_statistic` that takes in an actual proportion, and returns the value of this test statistic.  Your code should use the variable `tt_expected_correct_proportion` that you defined above.

In [None]:
def tt_test_statistic(actual_proportion):
    ...

In [None]:
grader.check("p1.3")

#### Part 1.4 Step C: Test Statistic for Observed Data (5 pts)


Use your newly defined function to calculate the observed test statistic from Emily's experiment where the practicioner was correct 44% of the time.

In [None]:
observed_tt_test_statistic = ...
observed_tt_test_statistic

In [None]:
grader.check("p1.4")

#### Part 1.5 Proportions Corrent/Incorrect for Null Hypothesis model (5 pts)


Was the observed test statistic you just calculated due to chance? To address this question, we'll assume the Null Hypothesis model assumption's are true, simulate samples according to the Null Hypothesis model, and calculate the test statistic for each simulation.

As a first step define an array `tt_null_model_proportions` containing two elements that reflect the proportions of correct and incorrect answers expected by a TT practictioner under the Null Hypothesis.

In [None]:
tt_null_model_proportions = ...

In [None]:
grader.check("p1.5")

#### Part 1.6 Make Sample: Generate Sample According to Null Hypothesis (5 pts)


Now, define a function `one_tt_sample` that creates a sample for one simulation round.  That function should use your `tt_null_model_proportions` to simulate what proportion of the time TT practioners choose the correct hand.  In other words, your function should return the proportion of correct answers in the simulated sample.

Emily asked TT practitioners to pick a hand 210 times.  That will be the size of our samples during simulation, but to make our code more general, we'll define `one_tt_sample` to take the sample size as a parameter.

Run your function a number of times to gain confidence it is working propertly.

In [None]:
def one_tt_sample(sample_size):
    proportion_correct = ...
    return proportion_correct

# Show the result of one simulation of size 210.
one_tt_sample(210)

In [None]:
grader.check("p1.6")

#### Part 1.7 Step D: Simulate A Few Samples (5 pts)


Now let's create a simulated distribution of test statistics under the null hypothesis.

We'll first call the function to create only 5 sample statistics to help you debug, and then we'll provide another cell below to conduct a large simulation.

Hints: 
- Examine the documentation for the function `simulate_sample_statistic`. - What functions have you already implemented that you'll use as inputs to `simulate_sample_statistic`?
- How many samples did Emily Rosa have? 

In [None]:
small_tt_test = simulate_sample_statistic(...,
                                          ...,
                                          ...,
                                          5)


# Inspect the output!
small_tt_test

In [None]:
grader.check("p1.7")

#### Part 1.8 Step D: Simulate Many Samples (5 pts)


Now assign `simulated_tt_statistics` to an array of 1000 test statistics that you simulated and visually compare how the observed test statistic and the distribution of the simulated test statistics under the Null Hypothesis compare. 

In [None]:
num_tt_trials = ...
simulated_tt_statistics = simulate_sample_statistic(...,
                                                    ...,
                                                    ...,
                                                    num_tt_trials)


In [None]:
# Visualize the results
plot = Table().with_column('abs(expected - actual proportion correct)', simulated_tt_statistics).hist(left_end=observed_tt_test_statistic)
plot.dot(observed_tt_test_statistic)

We can make a visual argument as to whether or not we believe the observed test statistic is likely to occur under the null, or we can use the definition of p-values to help us make a more formal argument.

In [None]:
grader.check("p1.8")

#### Part 1.9 Step E: Calculate p-value (5 pts)


Recall, the p-value is the chance a test statistic simulated under the Null Hypothesis is equal to the value of the observed (real-world) test statistic or is even further in the direction of the Alternative Hypothesis.

Using `simulated_tt_statistics` and `observed_tt_test_statistic`, as well as the [empirical_pvalue](https://www.cs.williams.edu/~cs104/auto/inference-library-ref.html#empirical_pvalue) function from our library  to calculate the p-value.  This is our library's implementation of the function you wrote for Prelab 7.

In [None]:
tt_p_val = ...
tt_p_val

In [None]:
grader.check("p1.9")

<!-- BEGIN QUESTION -->

#### Part 1.10 Step F: Draw Conclusion (5 pts)


Given the conventional p-value cutoff of 0.05, do we reject or fail to reject the Null Hypothesis?  What conclusions can you draw from Emily Rosa's experiment?

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

Checkout [this Wikipedia page](https://en.wikipedia.org/wiki/Emily_Rosa) to learn more about Emily Rosa, the youngest person to have a research paper published in a peer reviewed medical journal.

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 2. Using TVD as a Test Statistic (40 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Utilize the TVD Test statistic for comparing distributions
- Use simulation and test statistics to reject or fail to reject a null hypothesis
</font>

Total variation distance (TVD) is the test statistic that we have used to compare two distributions of categorical data. Recall we define TVD as:

$$\frac{sum( \mbox{absolute difference in proportion for each category})}{2}$$

You may wish to refer to [Chatper 11.2](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html#a-new-statistic-the-distance-between-two-distributions) and our lecture on [Assessing Models](https://www.cs.williams.edu/~cs104/index.html) if you'd like a refresher on TVD.

**Run the cell below to load in the `covid` table.**  This table is based on data downloaded from the [CDC Data Catalog](https://data.cdc.gov/browse) in May, 2022, when COVID-related deaths in the U.S. reached one million.  The table shows, as proportions, the Race and Hispanic Origin of a random sample of 100 people who have passed away from COVID as proportions.  It also shows the distribution of the U.S. population across those categories.

In [None]:
covid = Table.read_table("covid-deaths.csv")
covid.show()

<!-- BEGIN QUESTION -->

#### Part 2.1 Explore the Data (5 pts)


Create an overlaid bar chart of the "COVID-19 deaths" and "Population" columns.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 2.2 What do you see? (5 pts)


What differences can you identify between the two distributions in the bar chart?  In particular, compare the death rates for specific groups in the 100-person sample to the proportions of these groups in the population.  

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 2.3 Step A: State Your Hypotheses (5 pts)


We wish to test whether the disproportionate death rates are due to just chance or to other factors.  Define the Null Hypothesis, Alternative Hypothesis, and test statistic in the cell below.

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

Type your answer here:

* Null Hypothesis:

* Alternative Hypothesis:

* Test Statistic:


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

#### Part 2.4 Steps B & C: Test Statistics (5 pts)


Write a function `total_variation_distance` that takes as input two distributions and computes the TVD statistic on these two distributions.  

**Hint:** We have written this function before in [lecture](https://www.cs.williams.edu/~cs104/lectures/19-hypothesis-testing.html#tvd-statistic-distance-between-two-distributions) -- do not start from scratch! 
 
Then use this function to set `observed_covid_test_statistic` to be equal to the observed test statistic for the data in `covid`. 

In [None]:
def total_variation_distance(distribution1, distribution2):
    ...
    
observed_covid_test_statistic = ...
observed_covid_test_statistic

In [None]:
grader.check("p2.4")

#### Part 2.5 Make Sample (5 pts)


Under the null hypothesis, we'll simulate a sample many times. 

For our COVID scenario, we take the following steps to obtain **one sample**: We use the distribution from the COVID population and sample from each category in this distribution for a total of `sample_size` times. 

Use the `sample_proportions` function from our library to sample from the population in the `covid` table. 

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

In [None]:
def sample_covid_proportions(sample_size): 
    population_array = ...
    ...

In [None]:
# One sample with a large sample size. 
sampled = sample_covid_proportions(100000)
sampled

In [None]:
grader.check("p2.5")

#### Part 2.6 Step D: Create Sample and Compute Test Statistic Many Times (5 pts)


Great! We've implemented TVD (the test statistic) and a way to sample from the population. 

However, we can simplify our test statistic function here. We know that for every sample (whether that sample is the observed sample or a sample simulated from the null hypothesis) that we will calculate the TVD between the sample and and the `"Population"` from the `covid` table. 

In [None]:
# Run this function
def covid_tvd_statistic(sample):
    """
    Returns the total variation distance between the sample and population
    """
    return total_variation_distance(sample, covid.column("Population"))

To check this function, we'll combine the following two facts (1) when the sample size is large, the sample is close to the population and (2) when two distributions are very similar TVD=0. 

Run the following cell and confirm that the output of `covid_tvd_statistic` is close to 0. 

In [None]:
sample = sample_covid_proportions(100000)
covid_tvd_statistic(sample)

We're now ready to run the simulation under the Null Hypothesis. 

We'll use the `simulate_sample_statistic()` function from our inference library and combine this with the functions we implemented above.  

*Hint:* Go back to the beginning of the problem and read the description to get what the actual observed `sample_size` is. 

In [None]:
num_trials = 10000
#Read the description from the original problem to get the sample size 
sample_size = ...

simulated_covid_test_statistcs = simulate_sample_statistic(...,
                                                           sample_size,
                                                           ...,
                                                           num_trials)


In [None]:
grader.check("p2.6")

Run the cell below to plot a histogram of your simulated test statistics, as well as the observed value of the test statistic.

In [None]:
# Plot the results
plot = Table().with_column("Simulated Covid TVDs", simulated_covid_test_statistcs).hist()
plot.dot(observed_covid_test_statistic)

#### Part 2.7 Step E: Compute p-value (5 pts)


Use the function `empirical_pvalue` and the array of simulated statistics to calculate the p-value of your hypothesis test. Make sure that this number is consistent with what you observed in the histogram above. 



In [None]:
covid_p_value = ...
covid_p_value

In [None]:
grader.check("p2.7")

<!-- BEGIN QUESTION -->

#### Part 2.8 Step F: Draw Conclusion (5 pts)


 What can you conclude about COVID deaths of various ethnic groups? Explain your answer using the results of your hypothesis test. Assume a p-value cutoff of 0.05. 

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<hr style="margin-bottom: 0px; padding:0; border: 2px solid #500082;"/>


## 3. John Snow's Natural Experiment (40 pts)



<font color='#B1008E'>
    
##### Learning objectives
- Practice permutation testing to answer a question about data.
</font>

We'll now revisit one of our first examples of causality: John Snow's natural experiment to identify the cause of cholera.

Recall that in 19th century London, there were two theories for how cholera spreads:

* **Miasma theory**: cholera caused by particles in the air, or "miasmata", which arise from decomposing matter

* **Germ theory**: cholera caused germ cell that had not yet been identified, but was transmitted through food or drink

John Snow believed in germ theory, contrary the prevailing wisdom of miasma hypothesis at the time. 
After his investigation of cholera outbreak due to the contaminated Broad Stree pump, John Snow realized that a natural experiment existed that would enable him to evaluate germ theory.  Below is a map of London indicating the water supplier servicing each part of the city:  

<img src="https://inferentialthinking.com/_images/snow_map2.jpg" width=50%></img>

Snow observed
that there was no systematic difference between the people who were supplied by S&V and those supplied by Lambeth:

> Each company supplies both rich and poor, both large houses and small; there is no difference either in the condition or occupation of the persons receiving the water of the different Companies … there is no difference whatever in the houses or the people receiving the supply of the two Water Companies, or in any of the physical conditions with which they are surrounded …

However, there was a key difference in the water supply: 

> one group being supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from impurity.

Specifically:

* the Southwark & Vauxhaul (S&V) company drew water from the Thames **below** the city's sewage discharge.
* the Lambeth company drew water from the Thames **above** the city's sewage discharge.

Your job is to use the data Snow collected to prove or disprove that tainted water causes cholera.   We'll use the same hypothesis testing steps we've used for each of the previous questions. 

#### Part 3.1 Exploring the Data (0 pts)



In this first part, explore the following data. We will not grade this part, but examine the contents of the table, transform/visualize it, etc. to help you understand the problem we're solving.

The following cell loads our data.  Each row in our table contains the water company for one individual, and whether or not that individual died of Cholera.  

In the `Cholera death` column, `0` indicates the individual *did not* die of Cholera during the time of Snow's natural experiment whereas `1` indicates the individual died of Cholera.

Our data is based on Snow's original data, with two simplifications: we assume only one person lives in each household in Snow's data, and we reduce the number of measurements from ~65,000 to 5,000 to ensure our simulations run in a reasonable amount of time.

In [None]:
# Load the small data set of 5,000 people
snow = Table.read_table("snow_data_small.csv")

In [None]:
# Explore the data here 
...

<!-- BEGIN QUESTION -->

#### Part 3.2 Step A: Null And Alt Hypothesis (5 pts)



State the Null and Alternative Hypothesis for Snow's experiment.  Keep in mind that you must be able to simulate the Null Hypothesis to generate random samples.  

*Hint:*  If you have trouble getting started, you may wish to review [Chapter 12](https://inferentialthinking.com/chapters/12/2/Causality.html) and our lectures on Randomized Controlled Experiments and Causality.


<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

Type your answer here:

* Null Hypothesis:  ...

* Alternative Hypothesis:  ...


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

#### Part 3.3 Data Type for Hypothesis Test (5 pts)

We have looked at four types hypothesis tests in our general framework, each tailored to a specific type of problem.  These are summarized in the following slide from [lecture](https://docs.google.com/presentation/d/1hYuDX5GI-veOwxOqDRnAVLGwPsg7f8ANc3gzbNmh4eM).

<img src="hypothesis-testing.png" width=80%></img>

In the cell below, set `data_type` to the number below that reflects the type of data we have in Snow's experiment:

1. Single Category
2. Multiple Categories
3. Numeric
4. Two groups

In [None]:
data_type = ...

In [None]:
grader.check("p3.3")

<!-- BEGIN QUESTION -->

#### Part 3.4 Step B & C: Test Statistic for Observed Data (5 pts)



We are interested in the difference in the proportion of deaths for the two water companies.  Deaths are indicated by 1's in our "Cholera death" column, and survival by 0's.  With that encoding, the proportion of deaths recorded in the "Cholera death" column for a water company is also the mean of the values in the "Cholera death" column for that water company.  As we did in lecture, we can thus use the `abs_difference_of_means` function from our library as the test statistic. 

In the cell below, compute this test statistic for Snow's observed data.

In [None]:
snow_observed_statistic = ...
snow_observed_statistic

In [None]:
grader.check("p3.4")

<!-- END QUESTION -->

#### Part 3.5 Permutation samples (5 pts)


In order to generate samples, we'll use our `permutation_sample` function. Read the documentation for this function and choose the appropriate `group_column_label`. Inspect the table that is the output from `permutation_sample` and make sure it makes sense to you. 

In [None]:
group_column_label = ...
sample = permutation_sample(snow, group_column_label)
sample

In [None]:
grader.check("p3.5")

<!-- BEGIN QUESTION -->

#### Part 3.6 Step D: Simulate samples (5 pts)


Use `simulate_permutation_statistic` from our inference library to simulate many samples from the null hypothesis. 

Let's start with just a few trials so we can debug our function. Does the output look reasonable? 

In [None]:
small_snow_stats = simulate_permutation_statistic(..., ..., ..., 5)
small_snow_stats


<!-- END QUESTION -->

#### Part 3.7 Step D: Simulate Many Samples (5 pts)


Now, run the simulation for 1000 trials (this may take a few minutes to run).

In [None]:
snow_simulated_statistics = ...

In [None]:
# Run this cell to plot 
results = Table().with_columns('Test Statistic', snow_simulated_statistics)
plot = results.hist()
plot.dot(snow_observed_statistic)

In [None]:
grader.check("p3.7")

#### Part 3.8 Step E: Compute p-value (5 pts)


Compute the p-value for your experiment using `empirical_pvalue`. 

In [None]:
snow_pvalue = ...
snow_pvalue

<!-- BEGIN QUESTION -->

#### Part 3.9 Step F: Draw Conclusion (5 pts)


State the conclusions of your experiment.

<hr style="margin:0; border: 1px solid #FFBE0A;"/><font color='#FFBE0A'>Written Answer:</font>

_Type your answer here, replacing this text._


<hr style="margin:0; border: 1px solid #FFBE0A;"/>

<!-- END QUESTION -->

<hr class="m-0" style="border: 3px solid #500082;"/>

# You're Done!
Follow these steps to submit your work:
* Run the tests and verify that they pass as you expect. 
* Choose **Save Notebook** from the **File** menu.
* **Run the final cell** and click the link below to download the zip file. 

Once you have downloaded that file, go to [Gradescope](https://www.gradescope.com/) and submit the zip file to 
the corresponding assignment. For Lab N, the assignment will be called "Lab N Autograder".

Once you have submitted, your Gradescope assignment should show you passing all the tests you passed in your assignment notebook.


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)