In [None]:
# Run this cell to set up packages for lecture.
from lec13_imports import *

# Lecture 13 – Hypothesis Testing, p-values

## DSC 10, Summer 2025

### Agenda

- Statistical models
- Example: Jury selection
- Example: Is our coin fair?
    - Hypothesis tests.
    - Null and alternative hypotheses.
    - Test statistics, and how to choose them.
    - p-values

## Statistical models

### Models

- A model is a set of assumptions about how data was generated.

- Our goal is to **assess the quality of a model**.

- Given a dataset, our goal is to determine whether a model "explains" the patterns in the dataset.

- Our approach involves simulation. We **assume the model is true**, simulate many samples based on this model, and compare how frequently these samples look like our observed data.

- Finally, we use the results of our simulation to draw a conclusion about the validity of the model.

## Example: Jury selection

### Swain vs. Alabama, 1965

- Robert Swain was a Black man convicted of crime in Talladega County, Alabama.

- He appealed the jury's decision all the way to the Supreme Court, on the grounds that Talladega County systematically excluded Black people from juries.

<center>$\substack{\text{eligible} \\ \text{population}}
\xrightarrow{\substack{\text{representative} \\ \text{sample}}} 
\substack{\text{jury} \\ \text{panel}}
\xrightarrow{\substack{\text{selection by} \\ \text{judge/attorneys}}} 
\substack{\text{actual} \\ \text{jury}}$</center>

- At the time, only men 21 years or older were allowed to serve on juries. **26%** of this eligible population was Black.

- But of the 100 men on Robert Swain's jury panel, only **8** were Black.

### Supreme Court ruling

- About disparities between the percentages in the eligible population and the jury panel, the Supreme Court wrote:

> "... the overall percentage disparity has been small...”

- The Supreme Court denied Robert Swain’s appeal and he was sentenced to life in prison.

- We now have the tools to show **quantitatively** that the Supreme Court's claim was misguided.

- This "overall percentage disparity" turns out to be not so small, and is an example of racial bias.
    - Jury panels were often made up of people in the jury commissioner's professional and social circles.
    - Of the 8 Black men on the jury panel, **none** were selected to be part of the actual jury.

### Setup

- <span style="color:blue"><b>Model</b></span>: Jury panels consist of 100 men, **randomly** chosen from a population that is 26% Black.

- <span style="color:orange"><b>Observation</b></span>: On the actual jury panel, only 8 out of 100 men were Black.

- **Question**: Does the <span style="color:blue">model</span> explain the <span style="color:orange">observation</span>?

### Our approach: Simulation

- We'll start by assuming that the model is true.

- We'll generate many jury panels using this assumption.

- We'll count the number of Black men in each simulated jury panel to see how likely it is for a random panel to contain 8 or fewer Black men.
    - If we see 8 or fewer Black men often, then the model seems reasonable.
    - If we rarely see 8 or fewer Black men, then the model may not be reasonable. 

### Simulating statistics

Recall, a *statistic* is a number calculated from a sample.

Our plan:

1. Run an experiment once to generate one value of our chosen statistic.
    - In this case, sample 100 people randomly from a population that is 26% Black, and count **the number of Black men (statistic)**.

2. Run the experiment many times, generating many values of the statistic, and store these statistics in an array.

3. Visualize the resulting **empirical distribution of the statistic**.

### Step 1 – Running the experiment once

- How do we randomly sample a jury panel? 
    - `np.random.choice` won't help us, because we don't know how large the eligible population is.

- The function `np.random.multinomial` helps us sample at random from a **categorical distribution**.

```py
np.random.multinomial(sample_size, pop_distribution)
```

- `np.random.multinomial` samples at random from the population, **with replacement**, and returns a random array containing counts in each category.
    - `pop_distribution` needs to be an array containing the probabilities of each category.

**Aside: Example usage of `np.random.multinomial`**

On Halloween 👻, you trick-or-treated at 35 houses, each of which had an identical candy box, containing:
- 30% Starbursts.
- 30% Sour Patch Kids.
- 40% Twix.

At each house, you selected one candy blindly from the candy box.

To simulate the act of going to 35 houses, we can use `np.random.multinomial`:

In [None]:
np.random.multinomial(35, [0.3, 0.3, 0.4])

### Step 1 – Running the experiment once

In our case, a randomly selected member of our population is Black with probability 0.26 and not Black with probability 1 - 0.26 = 0.74.

In [None]:
demographics = [0.26, 0.74]

Each time we run the following cell, we'll get a new random sample of 100 people from this population.
- The first element of the resulting array is the number of Black men in the sample.
- The second element is the number of non-Black men in the sample.

In [None]:
np.random.multinomial(100, demographics)

### Step 1 – Running the experiment once

We also need to calculate the statistic, which in this case is the number of Black men in the random sample of 100.

In [None]:
np.random.multinomial(100, demographics)[0]

### Step 2 – Repeat the experiment many times

* Let's run 10,000 simulations.
* We'll keep track of the number of Black men in each simulated jury panel in the array `counts`.

In [None]:
counts = np.array([])

for i in np.arange(10000):
    new_count = np.random.multinomial(100, demographics)[0]
    counts = np.append(counts, new_count)
    
counts

### Step 3 – Visualize the resulting distribution

Was a jury panel with 8 Black men suspiciously unusual?

In [None]:
(bpd.DataFrame().assign(count_black_men=counts)
                .plot(kind='hist', bins = np.arange(9.5, 45, 1), 
                      density=True, ec='w', figsize=(10, 5),
                      title='Empiricial Distribution of the Number of Black Men in Simulated Jury Panels of Size 100'));
observed_count = 8
plt.axvline(observed_count, color='black', linewidth=4, label='Observed Number of Black Men in Actual Jury Panel')
plt.legend();

In [None]:
# In 10,000 random experiments, the panel with the fewest Black men had how many?
counts.min()

### Conclusion

- Our simulation shows that there's essentially no chance that a random sample of 100 men drawn from a population in which 26% of men are Black will contain 8 or fewer Black men.
- As a result, it seems that the model we proposed – that the jury panel was drawn at random from the eligible population – is flawed.
- There were likely factors **other than chance** that explain why there were only 8 Black men on the jury panel.

## Example: Is our coin fair?

### Example: Is our coin fair?

Let's suppose we find a coin on the ground and we aren't sure whether it's a fair coin.

Out of curiosity (and boredom), we flip it 400 times.

In [None]:
flips_400 = bpd.read_csv('data/flips-400.csv')
flips_400

In [None]:
flips_400.groupby('outcome').count()

- **Question**: Does our coin look like a fair coin, or not?<br>


- This question is posed similarly to the question "were jury panels selected at random from the eligible population?"

### Hypothesis tests

- In the examples we've seen so far, our goal has been to **choose between two views of the world, based on data in a sample**.<br>
<small markdown="1">

    - "This jury panel was selected at random" or "this jury panel was not selected at random, since there weren't enough Black men on it."
    - "This coin is fair" or "this coin is not fair."
    
</small>

- A **hypothesis test** chooses between two views of how data were generated, based on data in a sample.

- The views are called **hypotheses**.

- The test picks the hypothesis that is *better* supported by the observed data; it doesn't guarantee that either hypothesis is correct.

### Null and alternative hypotheses

- In our current example, our two hypotheses are "this coin is fair" and "this coin is not fair."

- In a hypothesis test:
    - One of the hypotheses needs to be a well-defined probability model about how the data was generated, so that we can use it for simulation. This hypothesis is called the **null hypothesis**.
    - The **alternative hypothesis**, then, is a different view about how the data was generated.

- We can simulate `n` flips of a fair coin using `np.random.multinomial(n, [0.5, 0.5])`, but we can't simulate `n` flips of an unfair coin.<br>
    <small>What does "unfair" mean? Does it flip heads with probability 0.51? 0.03?</small>

- As such, "this coin is fair" is **null hypothesis**, and "this coin is not fair" is our **alternative hypothesis**.

### Test statistics

- Once we've established our null and alternative hypotheses, we'll start by assuming the null hypothesis is true.

- Then, repeatedly, we'll generate samples under the assumption the null hypothesis is true (i.e. "under the null").<br>
<small>In the jury panel example, we repeatedly drew samples from a population that was 26% Black. In our current example, we'll repeatedly flip a fair coin 400 times.</small>

- For each sample, we'll calculate a single number – that is, a statistic.<br>
<small>In the jury panel example, this was the number of Black men. In our current example, a reasonable choice is the <b>number of heads</b>.</small>

- This single number is called the **test statistic** since we use it when "testing" which viewpoint the data better supports.<br>
<small>Think of the test statistic a number you write down each time you perform an experiment.</small>

- The test statistic evaluated on our observed data is called the **observed statistic**.<br>
<small>In our current example, the observed statistic is 188.</small>

- Our hypothesis test boils down to checking **whether our observed statistic is a "typical value" in the distribution of our test statistic.**

### Simulating under the null

- Since our null hypothesis is "this coin is fair", we'll repeatedly flip a fair coin 400 times.

- Since our test statistic is the **number of heads**, in each set of 400 flips, we'll count the number of heads.

- Doing this will give us an **empirical distribution of our test statistic**.

In [None]:
# Computes a single simulated test statistic.
np.random.multinomial(400, [0.5, 0.5])[0]

In [None]:
# Computes 10,000 simulated test statistics.

results = np.array([])
for i in np.arange(10000):
    result = np.random.multinomial(400, [0.5, 0.5])[0]
    results = np.append(results, result)
    
results

### Visualizing the empirical distribution of the test statistic

Let's visualize the empirical distribution of the test statistic $\text{number of heads}$.

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical Distribution of the Number of Heads in 400 Flips of a Fair Coin');
plt.legend();

If we observed close to 200 heads, we'd think our coin is fair. 

There are two cases in which we'd think our coin is unfair:

- If we observed lots of heads, e.g. 225.

- If we observed very few heads, e.g. 172.

This means that the histogram above is divided into three regions, where two of them mean the same thing (we think our coin is unfair).

It would be a bit simpler if we had a histogram that was divided into just two regions. How do we create such a histogram?

<center><img src="images/folding.png" width=1200></center>

### Choosing a test statistic

- We'd like our test statistic to be such that:
    - Large observed values side with one hypothesis.
    - Small observed values side with the other hypothesis.

- In this case, our statistic should capture how far our number of heads is from that of a fair coin.

- One idea: $| \text{number of heads} - 200 |$.
    - If we use this as our test statistic, the observed statistic is $| 188 - 200 | = 12$.
    - By simulating, we can quantify whether 12 is a reasonable value of the test statistic, or if it's larger than we'd expect from a fair coin.

### Simulating under the null, again

Let's define a function that computes a single value of our test statistic. We'll do this often moving forward.

In [None]:
def num_heads_from_200(arr):
    return abs(arr[0] - 200)

num_heads_from_200([188, 212])

Now, we'll repeat the act of flipping a fair coin 10,000 times again. The only difference is the test statistic we'll compute each time.

In [None]:
results = np.array([])
for i in np.arange(10000):
    result = num_heads_from_200(np.random.multinomial(400, [0.5, 0.5]))
    results = np.append(results, result)
    
results

### Visualizing the empirical distribution of the test statistic, again

Let's visualize the empirical distribution of our **new** test statistic, $| \text{number of heads} - 200 |$.

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(0, 60, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title=r'Empirical Distribution of | Num Heads - 200 | in 400 Flips of a Fair Coin');
plt.axvline(12, color='black', linewidth=4, label='observed statistic (12)')
plt.legend();

In black, we've drawn our observed statistic, 12. Does 12 seem like a reasonable value of the test statistic – that is, how rare was it to see a test statistic of 12 or larger in our simulations?

### Drawing a conclusion

- It's not uncommon for the test statistic to be 12, or even higher, when flipping a fair coin 400 times. So it looks like our observed coin flips could have come from a fair coin.

- We are *not* saying that the coin is definitely fair, just that it's reasonably plausible that the coin is fair.


- More generally, if we frequently see values in the empirical distribution of the test statistic that are as or more extreme than our observed statistic, the null hypothesis seems plausible. In this case, we **fail to reject** the null hypothesis.


- If we rarely see values as extreme as our observed statistic, we **reject** the null hypothesis.

### Our choice of test statistic depends on our alternative hypothesis!

- Suppose that our alternative hypothesis is now "this coin is biased **towards tails**."

- Now, the test statistic $| \text{number of heads} - 200 |$ **won't work**. Why not?

- In our current example, the value of the statistic $| \text{number of heads} - 200 |$ is 12. However, given just this information, we can't tell whether we saw:
    - 188 heads (212 tails), which would be evidence that the coin is biased towards tails.
    - 212 heads (188 tails), which would not be evidence that the coin is biased towards tails.

- As such, with this alternative hypothesis, we need another test statistic.

- Idea: $\text{number of heads}$.
    - Small observed values side with the alternative hypothesis ("this coin is biased towards tails").
    - Large observed values side with the null hypothesis ("this coin is fair").
    - We could also use $\text{number of tails}$.

### Simulating under the null, one more time

Since we're using the number of heads as our test statistic again, our simulation code is the same as it was originally.

In [None]:
results = np.array([])
for i in np.arange(10000):
    result = np.random.multinomial(400, [0.5, 0.5])[0]
    results = np.append(results, result)
    
results

### Visualizing the empirical distribution of the test statistic, one more time

Let's visualize the empirical distribution of the test statistic $\text{number of heads}$, one more time.

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical Distribution of the Number of Heads in 400 Flips of a Fair Coin');
plt.axvline(188, color='black', linewidth=4, label='observed statistic (188)')
plt.legend();

- We frequently saw 188 or fewer heads in 400 flips of a fair coin; our observation doesn't seem to be that rare.

- As such, the null hypothesis seems plausible, so we **fail to reject** the null hypothesis.

### Questions to consider before choosing a test statistic

- **Key idea**: Our choice of test statistic depends on the pair of viewpoints we want to decide between.

- Our test statistic should be such that **high observed values lean towards one hypothesis and low observed values lean towards the other**.

- We will avoid test statistics where both high and low observed values lean towards one viewpoint and observed values in the middle lean towards the other.
    - In other words, we will avoid "two-sided" tests.
    - To do so, we can take an absolute value.

- In our recent exploration, the null hypothesis was "the coin is fair."
    - When the alternative hypothesis was "the coin is biased," the test statistic we chose was $$|\text{number of heads} - 200 |.$$
    - When the alternative hypothesis was "the coin is biased towards tails," the test statistic we chose was $$\text{number of heads}.$$

### Summary

- In assessing a model, we are choosing between one of two viewpoints, or **hypotheses**.
    - The **null hypothesis** must be a well-defined probability model about how the data was generated.
    - The **alternative hypothesis** can be any other viewpoint about how the data was generated.
- To test a pair of hypotheses, we:
    1. Simulate the experiment many times under the assumption that the null hypothesis is true.
    1. Compute a **test statistic** on each of the simulated samples, as well as on the observed sample.
    1. Look at the resulting empirical distribution of test statistics and see where the observed test statistic falls. If it seems like an atypical value (too large or too small), we reject the null hypothesis; otherwise, we fail to reject the null.
- When selecting a test statistic, we want to choose a quantity that helps us distinguish between the two hypotheses, in such a way that large observed values favor one hypothesis and small observed values favor the other.

## Question

- **Question:** Where do we draw the line between rejecting the null and failing to reject the null? How many heads, exactly, is considered a small number of heads?

In [None]:
# Computes a single simulated test statistic.
np.random.multinomial(400, [0.5, 0.5])[0]

In [None]:
# Computes 10,000 simulated test statistics.
np.random.seed(8)
results = np.array([])
for i in np.arange(10000):
    result = np.random.multinomial(400, [0.5, 0.5])[0]
    results = np.append(results, result)
    
results

### Visualizing the empirical distribution of the test statistic

Let's visualize the empirical distribution of the test statistic $\text{number of heads}$.

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical Distribution of the Number of Heads in 400 Flips of a Fair Coin');
plt.legend();

- Our hypothesis test boils down to checking **whether our observed statistic is a "typical value" in the distribution of our test statistic.**
    - If we see a small number of heads, we'll **reject the null hypothesis** in favor of the alternative hypothesis, "the coin is biased towards tails." 
    - Otherwise we'll **fail to reject the null hypothesis** because it's plausible that "the coin is fair."

- **Question:** Where do we draw the line between rejecting the null and failing to reject the null? How many heads, exactly, is considered a small number of heads?

### Determining a cutoff

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical Distribution of the Number of Heads in 400 Flips of a Fair Coin');
plt.legend();

- The lower the number of heads we observe, the stronger the evidence for the alternative hypothesis, "the coin is biased towards tails."

- If we were to observe, say, 170 heads, even though such a result *could* happen with a fair coin, seeing 170 heads from a fair coin is sufficiently rare that we'd think the coin was biased towards tails. 

- If we were to observe, say, 195 heads, such a result happens frequently enough from a fair coin that we would not have good reason to think the coin was unfair.

- Let's say an outcome is **sufficiently rare** if it falls in the lowest **five percent** of outcomes in our simulation under the assumptions of the null hypothesis.
    - Five percent could be something else - five is just a nice round number. 

- We can determine the boundary between the lowest five percent of outcomes and the remaining outcomes using the concept of percentile.

In [None]:
np.percentile(results, 5)

In [None]:
draw_cutoff(results)

- Now we have a clear cutoff that tells us which hypothesis to side with.
    - If we find a coin, flip it 400 times, and observe less than 184 heads, we'll side with "the coin is biased towards tails."
    - If we find a coin, flip it 400 times, and observe 184 heads or more, we'll side with "the coin is fair."

### p-values

- To quantify how rare our observed statistic is, under the null hypothesis, we can compute what’s called a **p-value**.

- The p-value is defined as the probability, under the null hypothesis, that the test statistic **is equal** to the value that was observed in the data **or is even further in the direction of the alternative**. <br>
<small>Its formal name is the observed significance level.</small>

- p-values correspond to the "tail areas" of a histogram, starting at the observed statistic.

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical Distribution of the Number of Heads in 400 Flips of a Fair Coin');
plt.legend();

- For example, it's extremely rare to flip a fair coin 400 times and see 170 heads or fewer.

In [None]:
np.count_nonzero(results <= 170) / len(results)

- But it's much less rare to flip a fair coin 400 times and see 195 heads or fewer.

In [None]:
np.count_nonzero(results <= 195) / len(results)

- The larger the p-value, the more likely our observation is, if the null hypothesis is true. Therefore, 
    - Larger p-values mean we should fail to reject the null.
    - Smaller p-values mean we should reject the null.

### Conventions

- If the p-value is sufficiently large, we say the data is **consistent** with the null hypothesis and so we "**fail to reject the null hypothesis**".
    - We never say that we "accept" the null hypothesis! The null hypothesis may be plausible, but there are many other possible explanations for our data.

- If the p-value is below some cutoff, we say the data is **inconsistent** with the null hypothesis, and we **"reject the null hypothesis"**.
    - If a p-value is less than 0.05, the result is said to be "statistically significant".
    - If a p-value is less than 0.01, the result is said to be "highly statistically significant".
    - These conventions are historical and completely arbitrary! (And controversial.)

### Drawing a conclusion

- Recall that we found a coin, flipped it 400 times, and saw 188 heads. This is the **observed statistic**.

In [None]:
bpd.DataFrame().assign(results=results).plot(kind='hist', bins=np.arange(160, 240, 4), 
                                             density=True, ec='w', figsize=(10, 5),
                                             title='Empirical Distribution of the Number of Heads in 400 Flips of a Fair Coin');
plt.axvline(188, color='black', linewidth=4, label='observed statistic (188)')
plt.legend();

- How common is it to see 188 heads or fewer when we flip a fair coin 400 times? Let's calculate the p-value.

In [None]:
np.count_nonzero(results <= 188) / len(results)

- It happens about 12% of the time. It's not so rare. Since the p-value is at least 0.05, we **fail to reject** the null hypothesis at the standard 0.05 **significance level** and conclude that it's plausible that our coin is fair.

### Left or right? 👈👉

- In this example, we calculated the p-value as the area in the **left** tail of the histogram.

- But other times, when large values of the observed statistic indicate the alternative hypothesis, the p-value corresponds to the area in the **right** tail. 

- In order to calculate a p-value, we have to know whether larger or smaller values correspond to the alternative hypothesis. 

- We always calculate a p-value with `>=` or `<=` because it includes the probability that the test statistic exactly equals the observed statistic.

## Example: Midterm exam scores

### The problem

- In Fall 2022, there were four sections of DSC 10 – A, B, C, and D.

- One of the four sections – the one taught by Suraj – had a much lower average than the other three sections.
    - All midterms were graded by the same people, with the same rubrics.

In [None]:
# Midterm scores from DSC 10, Fall 2022, slightly perturbed for anonymity.
scores = bpd.read_csv('data/fa22-midterm-scores.csv')
scores

In [None]:
scores.plot(kind='hist', density=True, figsize=(10, 5), ec='w', bins=np.arange(10, 85, 5), title='Distribution of Midterm Exam Scores in DSC 10');

In [None]:
# Total number of students who took the exam.
scores.shape[0]

In [None]:
# Calculate the number of students in each section.
scores.groupby('Section').count()

In [None]:
# Calculate the average midterm score in each section.
scores.groupby('Section').mean()

### Thought experiment 💭🧪

- Suppose we _randomly_ place all 393 students into one of four sections and compute the average midterm score within each section. 

- One of the sections would _have to_ have a lower average score than the others (unless multiple are tied for the lowest).
    - In any set of four numbers, one of them has to be the minimum!

- But is Section C's average _lower_ than we'd expect due to chance? Let's perform a hypothesis test!

### Suraj's defense

- **Null Hypothesis:** Section C's scores are drawn randomly from the distribution of scores in the course overall. The observed difference between the average score in Section C and the average score in the course overall is due to random chance.

- **Alternative Hypothesis:** Section C's average score is too low to be explained by chance alone.




### Does this sample look like it was drawn from this population?

- This hypothesis test, like others we have seen, can be framed as a question of whether a certain sample (Suraj's section) appears to have been drawn randomly from a certain population (all students). 

- For the Robert Swain jury example, we asked if our sample of jury panelists looked like it was drawn from the population of eligible jurors.

- Even the fair coin example can be posed in this way. 
    - In this case, the population distribution is the uniform distribution (50% heads, 50% tails), and when we flip a coin, we are sampling from this distribution with replacement. 
    - We wanted to know if the sample of coin flips we observed looked like it came from this population, which represents a fair coin.

### What are the observed characteristics of Section C?

In [None]:
section_size = scores.groupby('Section').count().get('Score').loc['C']
observed_avg = scores.groupby('Section').mean().get('Score').loc['C']
print(f'Section C had {section_size} students and an average midterm score of {observed_avg}.')

### Simulating under the null hypothesis

- Model: There is no significant difference between the exam scores in different sections. Section C had a lower average purely due to chance.
    - To simulate: sample 108 students uniformly at random without replacement from the class. 

- Test statistic: the average midterm score of a section.
    - The observed statistic is the average midterm score of Section C (about 46.17).

In [None]:
# Sample 108 students from the class, independent of section, 
# and compute the average score.
scores.sample(int(section_size), replace=False).get('Score').mean()

In [None]:
averages = np.array([])
repetitions = 1000
for i in np.arange(repetitions):
    random_sample = scores.sample(int(section_size), replace=False)
    new_average = random_sample.get('Score').mean()
    averages = np.append(averages, new_average)
    
averages

### What's the verdict? 🤔

In [None]:
observed_avg

In [None]:
bpd.DataFrame().assign(RandomSampleAverage=averages).plot(kind='hist', bins=20, 
                                                          density=True, ec='w', figsize=(10, 5), 
                                                          title='Empirical Distribution of Midterm Averages for 108 Randomly Selected Students')
plt.axvline(observed_avg, color='black', linewidth=4, label='observed statistic')
plt.legend();

In [None]:
# p-value
np.count_nonzero(averages <= observed_avg) / repetitions

The p-value is below the standard cutoff of 0.05, and even below the 0.01 cutoff for being "highly statistically significant" so we **reject the null hypothesis.** It's not looking good for you, Suraj! 