In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Module 3.2 Part 1: Introductions to Hypothesis Testing

In this lecture guide, you'll be introduced to hypothesis testing. Hypothesis testing is a cornerstone of statistical inference.

4 videos make up this notebook, for a total run time of 36:49.

1. [Introduction to Hypothesis Testing](#section1) *1 video, total runtime 3:11*
2. [Chance Models](#section2) *1 video, total runtime 13:57*
3. [Model Assessment](#section3) *1 video, total runtime 15:43*
4. [Check for Understanding](#section4) *1 video, total runtime 3:58*

Textbook readings:
- [Chapter 11: Testing Hypotheses](https://www.inferentialthinking.com/chapters/11/Testing_Hypotheses.html)
- [Chapter 11.1: Assessing Models](https://www.inferentialthinking.com/chapters/11/1/Assessing_Models.html)

<a id='section1'></a>
## 1. Introduction to Hypothesis Testing

In the following lecture video, Professor Adikhari introduces hypothesis testing and statistical models. You'll learn how
to use these tools to assess assumptions about data.

In [None]:
YouTubeVideo('wJ9Eov9Mdf0')

<a id='section2'></a>
## 2. Chance Models

Next, we'll dive into chance models. A random jury panel selection model serves as a motivational example,
with real data taken from the 1965 *Swain v. Alabama* Supreme Court case. It is a prime example of how statistics
can be used as a tool for social justice.

In [None]:
YouTubeVideo('OreWRDOb9fg')

<a id='section3'></a>
## 3. Model Assessment

In this video, the concept of model assessment is prompted by an example relating to mendelian genetics. You'll learn how to assess whether data that you've collected supports a pre-conceived chance model.

In [None]:
YouTubeVideo('OI4x1i_0kPU')

<a id='section4'></a>
## 4. Check For Understanding

**A. The Berkeley City Council claims that 5% of Berkeley residents aren't respecting social distancing guidelines. Generate a random sample of 500 Berkeley residents assuming that the Berkeley City Council is correct. Count the number of individuals in this sample that aren't following these guidelines.**

In [None]:
population_props = ...
sample_props = ...
sample_nums = ...
num_not_social_distancing = ...
num_not_social_distancing

<details>
    <summary>Solution</summary>
    
    population_props = make_array(0.05, 0.95)
    sample_props = sample_proportions(500, population_props)
    sample_nums = 500 * sample_props
    num_not_social_distancing = sample_nums.item(0)
    num_not_social_distancing
</details>
<br>

**B. Run your code from part A a few more times. Is the number of Berkeley residents that aren't respecting social distancing guidelines
identical from sample to sample? If not, why?**

<details>
    <summary>Solution</summary>
    No! We are taking a random sample from the population of Berkeley residents. The number of people that aren't
    respecting the guidelines in each sample is a random quantity.
</details>
<br>

**C. You suspect that Berkeley's municipal government is too optimistic about the number of residents respecting
public health guidelines. You randomly sample 500 Berkeley residents, and find that 107 individuals in this sample refuse
to wear a face mask in public.**

**Is there reason to believe that the Berkeley City Council's claim is incorrect?**

In [None]:
# create array to store statistics
distances = make_array()

# generate 1000 statistics
repetitions = 1000
for i in np.arange(repetitions):
    one_distance = ...
    distances = ...

# compute the distance in your sample
sample_distance = ...

# view the distribution of the distances    
Table().with_column('Distance from 5%', distances).hist()
plt.scatter(sample_distance, 0, color='red', s=30); # plot the test statistic

<details>
    <summary>Solution</summary>
    <b>Code</b>: <br>
    
    # create array to store statistics
    distances = make_array()

    # generate 1000 statistics
    repetitions = 1000
    for i in np.arange(repetitions):
        one_distance = abs((sample_proportions(500, make_array(0.05, 0.95))).item(0) - 0.05)
        distances = np.append(distances, one_distance)

    # compute the distance in your sample
    sample_distance = abs(107/500 - 0.05)

    # view the distribution of the distances
    Table().with_column('Distance from 5%', distances).hist()
    plt.scatter(sample_distance, 0, color='red', s=30); # plot the test statistic
    
<b>Interpretation</b>: <br>
    If the Berkeley City Council's assumption about the proportion of Berkeley residents respecting social distancing guidelines
    were true, then we would expect that roughly 5% of residents in most random samples of 500 residents to not be respecting
    the public health guidelines. However, our sample contains 107 residents not respecting these safety measures! We have
    reason to believe that the proportion of residents violating social distancing rules is much higher than 5%.
</details>
<br>

**D. In the following video, you'll have to choose the appropriate statistics for evaluating different viewpoints.**

In [None]:
YouTubeVideo('ybDvLbRR4UA')