In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st

from statsmodels.stats import proportion
from math import ceil
from IPython.display import display, Latex

# Reference

> [Unit: Significance tests (hypothesis testing)](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample)

---

# The idea of significance tests

> [Hypothesis Testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing)

A **statistical hypothesis test** is a method of [statistical inference](https://en.wikipedia.org/wiki/Statistical_inference "Statistical inference") used to determine a possible conclusion from two different, and likely conflicting, hypotheses.

In a statistical hypothesis test, a [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") and an [alternative hypothesis](https://en.wikipedia.org/wiki/Alternative_hypothesis "Alternative hypothesis") is proposed for the probability distribution of the data. If the sample obtained has a probability of occurrence less than the pre-specified threshold probability, the [significance level](https://en.wikipedia.org/wiki/Significance_level "Significance level"), given the null hypothesis is true, the difference between the sample and the null hypothesis is deemed [_statistically significant_](https://en.wikipedia.org/wiki/Statistically_significant "Statistically significant"). The hypothesis test may then lead to the rejection of null hypothesis and acceptance of alternate hypothesis.

The process of distinguishing between the null hypothesis and the alternative hypothesis is aided by considering [Type I error](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors "Type I and type II errors") and [Type II error](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors "Type I and type II errors") which are controlled by the pre-specified significance level.

Hypothesis tests based on statistical significance are another way of expressing [confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval "Confidence interval") (more precisely, confidence sets). In other words, every hypothesis test based on significance can be obtained via a confidence interval, and every confidence interval can be obtained via a hypothesis test based on significance.[[1]](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#cite_note-1)

---

## Simple hypothesis testing

---

### Example 1

Every day Ahmet buys a scratch-off lottery ticket with a $40\%$ chance of winning some prize. He noticed that whenever he wears his red shirt he usually wins. He decided to keep track of his winnings while wearing the shirt and found that he won $3$ out of $3$ times.

Let's test the hypothesis that **Ahmet's chance of winning while wearing the shirt is $40\%$ as always** versus the alternative that the chance is somehow _greater_.

**Assuming the hypothesis is correct, what is the probability of Ahmet winning $3$ times out of $3$? Round your answer, if necessary, to the nearest tenth of a percent.**

In [2]:
display(Latex(f"$P(X=3 | H_{0}) = {round(st.binom.pmf(3, 3, 0.4), 3)}$"))

<IPython.core.display.Latex object>

Let's agree that if the observed outcome has a probability _less_ than $1\%$ under the tested hypothesis, we will reject the hypothesis.

**What should we conclude regarding the hypothesis?**

We cannot reject the hypothesis.

Explain:
    
Let’s find the probability of winning $3$ times out of $3$.

Assuming the hypothesis is true, the probability of Ahmet winning a single time is $40\%$ (meaning $0.40$). Since we are looking for the probability of this happening $3$ times, we need to multiply $0.40$ by itself $3$ times.

We can use the calculator to find that $0.4^{3}$, cubed is $0.064$, which is $6.4\%$.

The probability we got is higher than $1\%$. Therefore, _we cannot reject the hypothesis_.

In other words, the probability of the observed outcome under the tested hypothesis is not small enough, according to the threshold we set, to support rejecting the hypothesis. However, if Ahmet keeps winning while wearing the red shirt, we might reconsider the hypothesis.

We were testing the hypothesis that Ahmet's chance of winning while wearing the shirt is $40\%$ as always.

Assuming the hypothesis is correct, the probability of him winning $3$ times out of $3$ is $6.4\%$.

Therefore, we _cannot_ reject the hypothesis.

---

### Example 2

Roy’s Toys Company received a huge shipment of rubber duckies from a factory. The factory guaranteed Roy that the percentage of defective toys won’t exceed $1.5\%$, but Roy suspects it does. He took a random sample of $200$ duckies, and found that $3\%$ of them were defective.

Let's test the hypothesis that **the actual percentage of defective duckies is $1.5\%$** versus the alternative that the actual percentage is _higher_ than that.

The table below sums up the results of $1000$ simulations, each simulating a sample of $200$ duckies, assuming there are $1.5\%$ defective duckies.

|Measured % of defective duckies|Frequency|
|:-:|:-:|
|0|54|
|0.5|132|
|1|225|
|1.5|241|
|2|162|
|2.5|108|
|3|50|
|3.5|21|
|4|4|
|4.5|3|

**According to the simulations, what is the probability of getting a sample with $3\%$ defective duckies or more?**


In [3]:
display(Latex(f"$P(X \geq 3\% | H_{0}) = {round((50 + 21 + 4 + 3) / 1000, 3)}$"))

<IPython.core.display.Latex object>

We cannot reject the hypothesis.

Explain:
    
According to the table, out of $1000$ simulated samples:

- $50$ had $3\%$ defective duckies
- $21$ had $3.5\%$ defective duckies
- $4$ had $4\%$ defective duckies
- $3$ had $4.5\%$ defective duckies

In total, these sum up to $78$ simulations out of $1000$. Therefore, the simulations imply that the probability of having a sample with $3\%$ defective duckies or more is:

$\displaystyle \frac{78}{1000}=7.8\%$

The probability we got is higher than $1\%$. Therefore, _we cannot reject the hypothesis_.

In other words, the probability of the observed outcome under the tested hypothesis is not small enough, according to the threshold we set, to support rejecting the hypothesis. A larger sample would have given more decisive results.

We were testing the hypothesis that the actual percentage of defective duckies is $1.5\%$.

Assuming the hypothesis is correct, the probability of getting a sample with $3\%$ defective duckies or more is $7.8\%$.

Therefore, we _cannot_ reject the hypothesis.


Let's agree that if the observed outcome has a probability _less_ than $1\%$ under the tested hypothesis, we will reject the hypothesis.

**What should we conclude regarding the hypothesis?**



---

## Writing null and alternative hypotheses

---

### Example 1

A healthcare provider saw that $48\%$ of their members received their flu shot in a recent year. The healthcare provider tried a new advertising strategy in the following year, and they took a sample of members to test if the proportion who received their flu shot had changed.

**What are appropriate hypotheses for their significance test?**

$H_{0}: p = 48\%$

$H_{a}: p \neq 48\%$

(where $p$ is the proportion of members who received the flu shot)

---

### Example 2

A restaurant owner installed a new automated drink machine. The machine is designed to dispense $530\text{ mL}$ of liquid on the medium size setting. The owner suspects that the machine may be dispensing too much in medium drinks. They decide to take a sample of $30$ medium drinks to see if the average amount is significantly greater than $530\text{ mL}$.

**What are appropriate hypotheses for their significance test?**

$H_{0}: \mu = 530$

$H_{a}: \mu > 530$

(where $\mu$ is the average amount of liquid dispensed on this setting)

---

## Estimating P-values from simulations

> [p-value](https://en.wikipedia.org/wiki/P-value)

In [null-hypothesis significance testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing "Statistical hypothesis testing"), the **_p_-value**[[note 1]](https://en.wikipedia.org/wiki/P-value#cite_note-2) is the probability of obtaining test results at least as extreme as the [results actually observed](https://en.wikipedia.org/wiki/Realization_(probability) "Realization (probability)"), under the assumption that the [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") is correct.[[2]](https://en.wikipedia.org/wiki/P-value#cite_note-3)[[3]](https://en.wikipedia.org/wiki/P-value#cite_note-ASA-4) A very small _p_-value means that such an extreme observed [outcome](https://en.wikipedia.org/wiki/Outcome_(probability) "Outcome (probability)") would be very unlikely under the null hypothesis. Reporting _p_-values of statistical tests is common practice in [academic publications](https://en.wikipedia.org/wiki/Academic_publishing "Academic publishing") of many quantitative fields. Since the precise meaning of _p_-value is hard to grasp, [misuse is widespread](https://en.wikipedia.org/wiki/Misuse_of_p-values "Misuse of p-values") and has been a major topic in [Metascience](https://en.wikipedia.org/wiki/Metascience "Metascience").[[4]](https://en.wikipedia.org/wiki/P-value#cite_note-5)[[5]](https://en.wikipedia.org/wiki/P-value#cite_note-6)

---

### Example 1: One-tailed

An employee at an aquarium monitors how much their sea otters eat. The amount of food a particular otter eats daily is approximately normally distributed with a mean of $17$ pounds and a standard deviation of $1$ pound. They suspected this otter was not eating enough, so they took a random sample $n=10$ days and observed a sample mean of $\bar x=16.5$ pounds of food per day.

To see how likely a sample like this was to occur by random chance alone, the employee performed a simulation. They simulated $40$ samples of $n=10$ values from a normal population with a mean of $17$ pounds and a standard deviation of $1$ pound. They recorded the mean of the values in each sample. Here are the sample means from their $40$ samples:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220426111403.png)

They want to test $H_0: \mu=17 \text{ lbs}$ vs. $H_\text{a}: \mu<17 \text{ lbs}$ where $\mu$ is the true mean amount of food per day.

**Based on these simulated results, what is the approximate $p$-value of the test?**  
_Note: The sample result was $\bar x=16.5 \text{ lbs}$._

$\displaystyle p\text{-value} \approx \frac{5}{40} \approx 0.125$

Explain:
    
The $n=10$ days in the sample had a mean of $\bar x=16.5\text{ lbs}$.

Since the alternative hypothesis is $H_\text{a}:\mu<17\text{ lbs}$, we can find the approximate $p$-value of this result by looking at how often a sample result _as low or lower than_ $16.5\text{ lbs}$ occurred in the simulation.

The simulation produced a sample mean at or below $16.5\text{ lbs}$t in $5$ out of $40$ samples:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220426112152.png)

$\displaystyle p-\text{value} \approx \frac{5}{40} \approx 0.125$

This $p$-value says that when we sample $10$ values from a normal population with a mean of $17\text{ lbs}$ and standard deviation of $1\text{ lb}$, there is about a $12.5\%$ chance that we see a sample mean as low or lower than $16.5\text{ lbs}$.

---

### Example 2: Two-tailed

A large school district knows that $75\%$ of students in previous years rode the bus to school. Administrators wondered if that figure was still accurate, so they took a random sample of $n=80$ students and found that $\hat p=65\%$ of those sampled rode the bus to school.

To see how likely a sample like this was to happen by random chance alone, the school district performed a simulation. They simulated $120$ samples of $n=80$ students from a large population where $75\%$ of the students rode the bus to school. They recorded the proportion of students who rode the bus in each sample. Here are the sample proportions from their $120$ samples:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220428115845.png)

They want to test $H_0: p=75\%$ vs. $H_\text{a}: p \neq 75\%$ where $p$ is the true proportion of students in this district that ride the bus to school.

**Based on these simulated results, what is the approximate ppp-value of the test?**  
_Note: The sample result was $\hat p=65\%$._

$\displaystyle p-\text{value} \approx \frac{7}{21} \approx 0.058$

Explain:

The district observed $\hat p=65\%$ of the sample of $n=80$ students rode the bus.

Since the alternative hypothesis is $H_\text{a}:p\neq 75\%$, we can find the approximate $p$-value of this result by looking at how often a sample proportion _as far or farther than_ $65\%$ occurred in the simulation. We need to look for sample proportions this far _above or below_ the hypothesized proportion.

The sample proportion $\hat p=65\%$ is $10\%$ below the hypothesized proportion of $75\%$. The simulation produced a sample proportion as far or farther than this distance in $7$ out of $120$ samples:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220428120239.png)

$\displaystyle p-\text{value} \approx \frac{7}{21} \approx 0.058$

This $p$-value says that when we take a sample of $80$ students from a large population where $75\%$ of the students ride the bus, there is about a $5.8\%$ chance that we see a sample proportion as far or farther away from $75\%$ as $65\%$.

---

## Using P-values to make conclusions

> [Using P-values to make conclusions](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/a/p-value-conclusions)

We use $p$-values to make conclusions in significance testing. More specifically, we compare the $p$-value to a significance level $\alpha$ to make conclusions about our hypotheses.

If the $p$-value is lower than the significance level we chose, then we reject the null hypothesis $H_0$ in favor of the alternative hypothesis $H_\text{a}$. If the $p$-value is greater than or equal to the significance level, then we fail to reject the null hypothesis $H_0$, but this doesn't mean we accept $H_0$. To summarize:

$\displaystyle p \text{-value} < \alpha \Rightarrow \text{reject } H_0 \Rightarrow \text{accept }H_\text{a}$

$p \text{-value} \geq \alpha \Rightarrow \text{fail to reject } H_0$

Let's try a few examples where we use $p$-values to make conclusions.

---

### Example 1

Alessandra designed an experiment where subjects tasted water from four different cups and attempted to identify which cup contained bottled water. Each subject was given three cups that contained regular tap water and one cup that contained bottled water (the order was randomized). She wanted to test if the subjects could do better than simply guessing when identifying the bottled water.

Her hypotheses were $H_0: p=0.25$ vs. $H_\text{a}: p>0.25$ (where $p$ is the true likelihood of these subjects identifying the bottled water).

The experiment showed that $20$ of the $60$ subjects correctly identified the bottle water. Alessandra calculated that the statistic $\hat p=\frac{20}{60}=0.\bar3$ had an associated P-value of approximately $0.068$.

**What conclusion should be made using a significance level of $\alpha=0.05$?**

Fail to reject $H_0$

Since the $p$-value of $0.068$ is greater than $\alpha=0.05$.

**In context, what does this conclusion say?**

We don't have enough evidence to say that these subjects can do better than guessing when identifying the bottled water.

The null hypothesis $H_0: p=0.25$ says their likelihood is no better than guessing, and we failed to reject the null hypothesis.

**How would the conclusion have changed if Alessandra had instead used a significance level of $\alpha=0.10$?**

She would have rejected $H_0$.

Changing the significance level would not change the results of the experiment or the P-value. Since $0.068$ is less than $\alpha=0.10$, this significance level would have led Alessandra to reject $H_0$ and accept $H_\text{a}$.

---

### Example 2

A certain bag of fertilizer advertises that it contains $7.25\text{ kg}$, but the amounts these bags actually contain is normally distributed with a mean of $7.4\text{ kg}$ and a standard deviation of $0.15\text{ kg}$.

The company installed new filling machines, and they wanted to perform a test to see if the mean amount in these bags had changed. Their hypotheses were $H_0: \mu=7.4\text{ kg}$ vs. $H_\text{a}: \mu \neq 7.4$ (where $\mu$ is the true mean weight of these bags filled by the new machines).

They took a random sample of $50$ bags and observed a sample mean and standard deviation of $\bar x=7.36\text{ kg}$ and $s_x=0.12\text{ kg}$. They calculated that these results had a P-value of approximately $0.02$.

**What conclusion should be made using a significance level of $\alpha=0.05$?**

Reject $H_0$ and accept $H_\text{a}$

Since the $p$-value of $0.02$ is less than $\alpha=0.05$, we should reject $H_0$ and accept $H_\text{a}$.

**In context, what does this conclusion say?**

The evidence suggests that these bags are being filled with a mean amount that is different than $7.4\text{ kg}$.

The P-value was low enough to reject $H_0: \mu=7.4\text{ kg}$, so we can accept $H_\text{a}: \mu \neq 7.4\text{ kg}$.

**How would the conclusion have changed if they had instead used a significance level of $\alpha=0.01$?**

They would have failed to reject $H_0$.

Changing the significance level would not change the results of the experiment or the P-value. Since $0.02$ is greater than $\alpha=0.01$, this significance level would have led them to fail to reject $H_0$.

---

### Ethics and the significance level $\alpha$

These examples demonstrate how we may arrive at different conclusions from the same data depending on what we choose as our significance level $\alpha$. In practice, we should make our hypotheses and set our significance level before we collect or see any data. Which specific significance level we choose depends on the consequences of various errors.

---

# Error probabilities and power

---

## Introduction to Type I and Type II errors

> [Type I and type II errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors)

In [statistical hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing "Statistical hypothesis testing"), a **type I error** is the mistaken rejection of an actually true [null hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a **type II error** is the failure to reject a null hypothesis that is actually false (also known as a "false negative" finding or conclusion; example: "a guilty person is not convicted"). Much of statistical theory revolves around the minimization of one or both of these errors, though the complete elimination of either is a statistical impossibility if the outcome is not determined by a known, observable causal process. By selecting a low threshold (cut-off) value and modifying the alpha (α) level, the quality of the hypothesis test can be increased. The knowledge of Type I errors and Type II errors is widely used in [medical science](https://en.wikipedia.org/wiki/Medical_science "Medical science"), [biometrics](https://en.wikipedia.org/wiki/Biometrics "Biometrics") and [computer science](https://en.wikipedia.org/wiki/Computer_science "Computer science").

<table class="wikitable" style="border: 1px solid black">


<tbody><tr>
<th rowspan="2" colspan="2" style="border: 1px solid black">&nbsp;Table of error types
</th>
<th colspan="2" style="border: 1px solid black"><br>Null hypothesis (<i>H</i><sub>0</sub>) is<br>&nbsp;
</th></tr>
<tr>
<th style="border: 1px solid black">True
</th>
<th style="border: 1px solid black">False
</th></tr>
<tr style="border: 1px solid black">
<th rowspan="2" style="border: 1px solid black">Decision<br>about null<br>hypothesis (<i>H</i><sub>0</sub>)
</th>
<th style="border: 1px solid black">Don't<br>reject
</th>
<td style="text-align:center;" style="border: 1px solid black"><br>Correct inference <br>(true negative)
<p>(probability = 1−<i>α</i>)<br>
</p>
</td>
<td style="text-align:center;" style="border: 1px solid black">Type II error <br>(false negative)<br>(probability = <i>β</i>)&nbsp;
</td></tr>
<tr>
<th style="border: 1px solid black" style="border: 1px solid black">Reject
</th>
<td style="text-align:center;" style="border: 1px solid black">Type&nbsp;I error <br>(false positive)<br>(probability = <i>α</i>)&nbsp;
</td>
<td style="text-align:center;" style="border: 1px solid black"><br>Correct inference<br>(true positive)
<p>(probability = 1−<i>β</i>)<br>&nbsp;
</p>
</td></tr></tbody></table>

---

## Type I vs Type II error

> [Type I vs Type II error](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/error-probabilities-and-power/e/type-i-error-type-ii-error-power)

---

### Example 1: Type II error

A quality control expert wants to test the null hypothesis that a new solar panel is no more effective than the older model.

**Under which of the following conditions would the expert commit a Type II error?**

The new panel is actually more effective, and they don't conclude that it is more effective.

This is a Type II error—$H_{\text{a}}$ is true, but we fail to reject $H_0$.

---

### Example 2: Type I error

According to a report from the United States Environmental Protection Agency, burning one gallon of gasoline typically emits about $8.9\text{ kg}$ of $\text{CO}_2$. A fuel company wants to test a new type of gasoline designed to have lower $\text{CO}_2$ emissions. Here are their hypotheses:

- $H_0: \mu =8.9 \text{ kg}$
- $H_{\text{a}}: \mu < 8.9 \text{ kg}$

(where $\mu$ is the mean amount of $\text{CO}_2$ emitted by burning one gallon of this new gasoline).

**Under which of the following conditions would the company commit a Type I error?**

The mean amount of $\text{CO}_2$ emitted by the new fuel is actually $8.9\text{ kg}$, and they conclude it is lower than $8.9\text{ kg}$.

This is a Type I error—$H_0$ is true, but they reject it.

---

### Example 3: Type II error

A large university is curious if they should build another cafeteria. They plan to survey a sample of their students to see if there is strong evidence that the proportion interested in a meal plan is higher than $40\%$, in which case they will consider building a new cafeteria.

Let $p$ represent the proportion of students interested in a meal plan. Here are the hypotheses they'll use:

- $H_0: p \leq 0.40$
- $H_{\text{a}}: p > 0.40$

**What would be the consequence of a Type II error in this context?**

They don't consider building a new cafeteria when they should.

Explain:

**In this setting**

**Type I error: Rejecting a true null hypothesis**

If $H_0$ is true, then at most $40\%$ of students are actually interested. So a Type I error would occur if the sample result is significantly higher than $40\%$, and they consider building the new cafeteria when they shouldn't.

**Type II error: Failing to reject a false null hypothesis**

If $H_0$ is false, then $H_{\text{a}}$ is true, and more than $40\%$ of students are actually interested. So a Type II error would occur if the sample result is not significantly higher than $40\%$, and they don't consider building the new cafeteria when they should.

---

## Introduction to power in significance tests

A perfect test would have zero false positives and zero false negatives. However, statistical methods are probabilistic, and it cannot be known for certain whether statistical conclusions are correct. Whenever there is uncertainty, there is the possibility of making an error. Considering this nature of statistics science, all statistical hypothesis tests have a probability of making type I and type II errors.

-   The type I error rate or significance level is the probability of rejecting the null hypothesis given that it is true. It is denoted by the Greek letter $\alpha$ (alpha) and is also called the alpha level. Usually, the significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of incorrectly rejecting the true null hypothesis.
-   The rate of the type II error is denoted by the Greek letter $\beta$ (beta) and related to the [power of a test](https://en.wikipedia.org/wiki/Power_(statistics) "Power (statistics)"), which equals $1−\beta$.

Notions:
- $\beta$ = probability of a Type II error, known as a "false negative"
- $1 - \beta$ = probability of a "true positive", i.e., correctly rejecting the null hypothesis. **"$1 - \beta$" is also known as the power of the test.**
- $\alpha$ = probability of a Type I error, known as a "false positive"
- $1 - \alpha$ = probability of a "true negative", i.e., correctly not rejecting the null hypothesis

Impact:
- $\alpha \uparrow$, $\beta \downarrow$, $1 - \beta \uparrow$
- sample size $n \uparrow$, power $\uparrow$
- Variability $\downarrow$, power $\uparrow$
- True parameter far from $H_0$, power $\uparrow$

---

## Error probabilities and power

> [Error probabilities and power](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/error-probabilities-and-power/e/error-probabilities-power)

---

### Example 1

A significance test is going to be performed using a significance level of $\alpha=0.05$. Suppose that the null hypothesis is actually false.

**If the significance level was lowered to $\alpha=0.01$, which of the following would be true?**

The probability of a Type II error would increase and the power of the test would decrease.

Explain:

Since we are assuming that the null hypothesis is false, the correct conclusion would be to reject the null hypothesis.

Rejecting a false null hypothesis is more likely to happen with a higher significance level and less likely to happen with a lower significance level, since rejecting with lower significance requires our sample result to be farther away from the null hypothesis.

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220428174836.png)

A Type II error would occur if we failed to reject the false null hypothesis. Lowering the significance level makes it harder to reject a false null hypothesis, so a lower significance level would increase the probability of a Type II error.

Power is the likelihood that the test rejects the false null hypothesis. Lowering the significance level makes it harder to reject a false null hypothesis, so a lower significance level would decrease the power of the test.

**If we use a lower significance level, the probability of a Type II error would increase and the power of the test would decrease.**

---

### Example 2

A manufacturer makes chocolate squares that have a target weight of $8\text{ g}$. Quality control engineers sample chocolate squares from a batch to test the hypotheses $H_0: \mu = 8\text{ g}$ vs. $H_\text{a}: \mu \neq 8\text{ g}$, where $\mu$ is the true mean weight of the chocolate squares in that batch.

Suppose that $H_0$ is actually true.

**Which situation below would have the lowest probability of a Type I error?**

- $n=20$ and $\alpha=0.10$
- $n=50$ and $\alpha=0.05$
- **$n=50$ and $\alpha=0.01$**

Explain:

**What is a Type I error?**

The engineers would make a Type I error if they rejected a true null hypothesis.

**Impact of significance level**

Lowering the significance level $\alpha$ reduces the likelihood of a Type I error, because a lower alpha makes it harder to reject a true null hypothesis by random chance alone.

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220428193927.png)

**Impact of sample size**

**Sample size doesn't impact the likelihood of a Type I error.** Larger samples are still preferred since they produce less variable results, but we'll still reject a true $H_0$ at a rate equal to the significance level $\alpha$.


---

### Example 3

Ricky is testing soil for a contaminant at a building site. He'll take action to stop construction if there's strong evidence that the soil has more than $400$ parts per million (ppm) of the contaminant. He plans on using soil from $n=30$ randomly selected locations at the building site. His hypotheses are $H_0: \mu \leq 400 \text{ ppm}$ and $H_{\text{a}}: \mu > 400 \text{ ppm}$, where $\mu$ is the mean amount of the contaminant in the soil at this site.

Suppose that in reality, $H_\text{a}$ is actually true.

**Which situation below would result in the lowest probability of a Type II error?**

- The true mean is actually $405 \text{ ppm}$, and he uses a significance level of $\alpha =0.05$.
- The true mean is actually $405 \text{ ppm}$, and he uses a significance level of $\alpha =0.10$.
- The true mean is actually $420 \text{ ppm}$, and he uses a significance level of $\alpha =0.05$.
- **The true mean is actually $420 \text{ ppm}$, and he uses a significance level of $\alpha =0.10$.**

$420\text{ ppm}$ is relatively far above $400\text{ ppm}$, and $\alpha=0.10$ is the largest significance level given.

Explain:

**What is a Type II error?**

Ricky would make a Type II error if he failed to reject a false null hypothesis.

**Impact of significance level**

Rejecting a false null hypothesis is more likely to happen with a higher significance level and less likely to happen with a lower significance level, since rejecting with lower significance requires our sample result to be farther away from the null hypothesis.

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220428174836.png)

**Impact of actual value**

The probability of a Type II error is lower when the actual value is farther away from the value in the null hypothesis and in favor of the alternative.

-   If the mean is actually $400\text{ ppm}$, then the null isn't false, and Ricky shouldn't reject the null.
-   If the mean is actually $405\text{ ppm}$, then the null is false, but the sample mean could plausibly come out to be close to $400\text{ ppm}$, which would lead him to fail to reject the null and make a Type II error.
-   If the mean is actually $420\text{ ppm}$, then the null is false, and it would be less likely for the sample result to be close to $400\text{ ppm}$. The sample mean will probably come out closer to $420\text{ ppm}$, and that would lead him to reject $400\text{ ppm}$.

---

## Consequences of errors and significance

### Introduction

Significance tests often use a significance level of $\alpha=0.05$, but in some cases it makes sense to use a different significance level. Changing $\alpha$ impacts the probabilities of Type I and Type II errors. In some tests, one kind of error has more serious consequences than the other. We may want to choose different values for $\alpha$ in those cases.

### Review: Error probabilities and $\alpha$

A Type I error is when we reject a true null hypothesis. Lower values of $\alpha$ make it harder to reject the null hypothesis, so choosing lower values for $\alpha$ can reduce the probability of a Type I error. The consequence here is that if the null hypothesis is false, it may be more difficult to reject using a low value for $\alpha$. So using lower values of $\alpha$ can increase the probability of a Type II error.

A Type II error is when we fail to reject a false null hypothesis. Higher values of $\alpha$ make it easier to reject the null hypothesis, so choosing higher values for $\alpha$ can reduce the probability of a Type II error. The consequence here is that if the null hypothesis is true, increasing $\alpha$ makes it more likely that we commit a Type I error (rejecting a true null hypothesis).

Let's look at a few examples to see why it might make sense to use a higher or lower significance level.

---

### Example 1: Type II error

Employees at a health club do a daily water quality test in the club's swimming pool. If the level of contaminants are too high, then they temporarily close the pool to perform a water treatment.

We can state the hypotheses for their test as $H_0:$ The water quality is acceptable vs. $H_\text{a}:$ The water quality is not acceptable.

**What would be the consequence of a Type I error in this setting?**

The club closes the pool when it doesn't need to be closed.

Explain:

A Type I error is when we reject a true $H_0$. In this setting, if $H_0$ is true, then the water quality is acceptable, and the pool doesn't need to be closed. A Type I error would occur if they close the pool when the water quality is actually acceptable.

**What would be the consequence of a Type II error in this setting?**

The club doesn't close the pool when it needs to be closed.

Explain:

A Type II error is when we fail to reject a false $H_0$. In this setting, if $H_0$ is false, then the water quality is not acceptable, and the pool should be closed. A Type II error would occur if they don't close the pool when the water quality is not actually acceptable.

**In terms of safety, which error has the more dangerous consequences in this setting?**

Type II

The consequence here is that people swim in contaminated water. This is more dangerous than a Type I error.

Explain:

The consequence of a Type I error is that the pool is closed for treatment that it doesn't necessarily need.

The consequence of a Type II error is that people swim in contaminated water.

In terms of safety, a Type II error is more dangerous in this setting.

Since one error involves greater safety concerns, the club is considering using a value for $\alpha$ other than $0.05$ for the water quality significance test.

**What significance level should they use to reduce the probability of the more dangerous error?**

- $\alpha = 0.01$
- $\alpha = 0.025$
- <mark>$\alpha = 0.10$</mark>

Using a higher significance level increases the probability of a Type I error, but decreases the probability of a Type II error (which is more dangerous in this setting).

---

### Example 2: Type I error

Seth is starting his own food truck business, and he's choosing cities where he'll run his business. He wants to survey residents and test whether or not the demand is high enough to support his business before he applies for the necessary permits to operate in a given city. He'll only choose a city if there's strong evidence that the demand there is high enough.

We can state the hypotheses for his test as $H_0:$ The demand is not high enough vs. $H_\text{a}:$ The demand is high enough.

**What would be the consequence of a Type I error in this setting?**

He chooses a city where demand isn't actually high enough.

Explain:

A Type I error is when we reject a true $H_0$. In this setting, if $H_0$ is true, then demand in the city is not high enough, and Seth shouldn't choose that city. A Type I error would occur if he chooses a city where the demand is not actually high enough.

**What would be the consequence of a Type II error in this setting?**

He doesn't choose a city where demand is actually high enough.

Explain:

A Type II error is when we fail to reject a false $H_0$. In this setting, if $H_0$ is false, then demand in the city is high enough, and Seth should choose that city. A Type II error would occur if he doesn't choose a city where the demand is actually high enough.

Seth has determined that a Type I error is more costly to his business than a Type II error. He wants to use a significance level other than $\alpha=0.05$ to reduce the likelihood of a Type I error.

**Which of these significance levels should Seth choose?**

- <mark>$\alpha=0.01$</mark>
- $\alpha=0.08$
- $\alpha=0.10$

Using a lower significance level decreases the probability of a Type I error, since it makes it more difficult to reject $H_0$ based on random chance alone.

---

# Tests about a population proportion

---

## Writing hypotheses for a test about a proportion

### Guidelines for hypotheses

-   We write hypotheses in terms of population parameters, not sample statistics.
-   The null hypothesis should have a statement of equality.
-   The direction of the alternative hypothesis $(<{,}>{,}\neq)$ depends on the context of the test.

### The parameter $p$ vs. the statistic $\hat p$

In a significance test about a proportion, we are looking to draw conclusions about the true population proportion, $p$. So, it would be incorrect to express our hypotheses in terms of a sample statistic such as $\hat p$.

### The null hypothesis

In a test of significance, we are looking for evidence against the null hypothesis $H_0$.

### The alternative hypothesis

The alternative hypothesis $H_\text{a}$ is the claim we are trying to find evidence in favor of.

---

### Example 1: One-tailed

A professor gives a multiple choice exam where each question has four choices. The professor decides not to count a question against students if the class as a whole does worse on the question than they would have done simply by guessing. That is, if significantly less than $25\%$ of the students answer a question correctly, then that question won't count against them.

Let $p$ represent the proportion of students that would correctly answer the question.

**Which of the following is an appropriate set of hypotheses for such a significance test?**

- $H_{0}: p = 0.25$
- $H_\text{a}: p < 0.25$

---

### Example 2: Two-tailed

A manufacturer knows that $2\%$ of its microchips are produced with a certain defect. They decide to change their process to make it more efficient, and they want to test if the new process has the same defect rate or not.

Let $p$ represent the proportion of microchips made with a defect under this new system.

**Which of the following is an appropriate set of hypotheses for their significance test?**

- $H_{0}: p = 0.02$
- $H_\text{a}: p \neq 0.02$

The null hypothesis has a statement of equality, and the direction of alternative hypothesis $(\neq)$ matches their desire to test if the new process has the same defect rate or not (they're curious if the percent is higher or lower than $2\%$).

---

## Conditions for inference on a proportion

> [Conditions for inference on a proportion](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-proportion/a/conditions-inference-one-proportion)

When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren't actually valid.

The conditions we need for inference on one proportion are:

- **Random**: The data needs to come from a random sample or randomized experiment.
- **Normal**: The sampling distribution of $\hat p$ needs to be approximately normal — needs at least $10$ expected successes and $10$ expected failures.
- **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population.

---

### The random condition

Random samples give us unbiased data from a population. When samples aren't randomly selected, the data usually has some form of bias, so using data that wasn't randomly selected to make inferences about its population can be risky.

More specifically, sample proportions are unbiased estimators of their population proportion. For example, if we have a bag of candy where $50\%$ of the candies are orange and we take random samples from the bag, some will have more than $50\%$ orange and some will have less. But on average, the proportion of orange candies in each sample will equal $50\%$. We write this property as $\mu_{\hat p}=p$, which holds true as long as our sample is random.

This won't necessarily happen if our sample isn't randomly selected though. Biased samples lead to inaccurate results, so they shouldn't be used to create confidence intervals or carry out significance tests.

---

### The normal condition

The sampling distribution of $\hat p$ is approximately normal as long as the expected number of successes and failures are both at least $10$. This happens when our sample size $n$ is reasonably large. The proof of this is beyond the scope of AP statistics, but our tutorial on sampling distributions can provide some intuition and verification that this condition indeed works.

So we need:

- $\displaystyle \text{expected success: } np \geq 10$
- $\displaystyle \text{expected failures: } n(1 - p) \geq 10$

If we are building a confidence interval, we don't have a value of $p$ to plug in, so we instead count the observed number of successes and failures in the sample data to make sure they are both at least $10$. If we are doing a significance test, we use our sample size $n$ and the hypothesized value of $p$ to calculate our expected numbers of successes and failures.

---

### The independence condition

To use the formula for standard deviation of $\hat p$, we need individual observations to be independent. When we are sampling without replacement, individual observations aren't technically independent since removing each item changes the population.

But the $10\%$ condition says that if we sample $10\%$ or less of the population, we can treat individual observations as independent since removing each observation doesn't significantly change the population as we sample. For instance, if our sample size is $n=150$, there should be at least $N=1500$ members in the population.

This allows us to use the formula for standard deviation of $\hat p$:

$\displaystyle \sigma_{\hat p} = \sqrt{\frac{\sigma^{2}}{n}} = \frac{\sigma}{\sqrt{n}}$

In a significance test, we use the sample size $n$ and the hypothesized value of $p$.

If we are building a confidence interval for $p$, we don't actually know what $p$ is, so we substitute $\hat p$ as an estimate for $p$. When we do this, we call it the **standard error** of $\hat p$ to distinguish it from the standard deviation.

So our formula for standard error of $\hat p$ is

$\displaystyle SE = \sigma_{\hat p} \approx \sqrt{\frac{\sigma_{\hat p}^{2}}{n}} = \frac{\sigma_{\hat p}}{\sqrt{n}}$

---

### Example 1

Jules works on a small team of $40$ employees. Each employee receives an annual rating, the best of which is "exceeds expectations." Management claimed that $10\%$ of employees earn this rating, but Jules suspected it was actually less common. She obtained an anonymous random sample of $10$ ratings for employees on her team. She wants to use the sample data to test $H_0:p=0.1$ versus $H_\text{a}:p<0.1$, where $p$ is the proportion of all employees on her team who earned "exceeds expectations."

**Which conditions for performing this type of test did Jules' sample meet?**

The data is a random sample from the population of interest.

This condition is met; the problem says she obtained a random sample of $10$ ratings for employees on her team.

---

### Example 2

Here are two different samples drawn from two different populations:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220501235308.png)

**Which sample satisfies the normal condition for performing a $t$ test?**

Sample B only

Even though the sample is small, the sample data are roughly symmetric with no outliers, so it satisfies the normal condition.

---

## Calculating the test statistic in a z test for a proportion

> [Test statistic](https://en.wikipedia.org/wiki/Test_statistic#:~:text=A%20test%20statistic%20is%20a,to%20perform%20the%20hypothesis%20test.)

A **test statistic** is a [statistic](https://en.wikipedia.org/wiki/Statistic "Statistic") (a quantity derived from the [sample](https://en.wikipedia.org/wiki/Sample_(statistics) "Sample (statistics)")) used in [statistical hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing "Statistical hypothesis testing").[[1]](https://en.wikipedia.org/wiki/Test_statistic#cite_note-CasellaBerger-1) A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the [null](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") from the [alternative hypothesis](https://en.wikipedia.org/wiki/Alternative_hypothesis "Alternative hypothesis"), where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.

- Assumed population probability(parameter): $\displaystyle p_{0}$
- Standard Deviation(parameter): $\displaystyle \sigma$
- Sample proportion(statistic): $\displaystyle \hat p$
- Standard Error(statistic): $\displaystyle SE = \sigma_{\hat p} \approx \sqrt{\frac{\sigma_{\hat p}^{2}}{n}} = \frac{\sigma_{\hat p}}{\sqrt{n}}$
- Z-Score(parameter): $\displaystyle z = \frac{\hat p - p}{\sigma}$
- test statistic(statistic): $\displaystyle z = \frac{\text{statistics - parameter}}{\text{standard deviation of statistic (standard error)}} = \frac{\hat p - p_{0}}{SE}$

---

### Example 1

A large poll showed that $42\%$ of adults approved of their nation's prime minister. Margot wanted to test if it had decreased, so she took a random sample of $500$ adults in that nation and found that $160$ of those them approved of the prime minister.

She wants to test $H_0:p=0.42$ versus $H_\text{a}:p < 0.42$, were $p$ is the proportion of adults in this nation who approve of the prime minister.

**Assuming that the conditions for inference have been met, identify the correct test statistic for Margot's significance test.**

$\displaystyle z = \frac{0.32 - 0.42}{\sqrt{\frac{0.42(0.58)}{500}}}$

---

### Example 2

A professor gives a multiple choice exam where each question has five choices. The professor decides not to count a question against students if the class as a whole does significantly worse on the question than they would have done simply by guessing. In other words, the teacher tests $H_0: p=0.20$ versus $H_\text{a}: p<0.20$ for each question, where $p$ is the proportion of students who would correctly answer the question.

Suppose that $14$ of $100$ students correctly answer a particular question.

**Assuming that the conditions for inference have been met, calculate the test statistic for the professor's significance test.**  
_You may round to two decimal places._

In [4]:
k, n, p = 14, 100, 0.2
prop = k / n
SE = st.bernoulli.std(p) / np.sqrt(n)
zstat = (prop - p) / SE
zstat # test statistic

-1.5

In [5]:
# OR
k, n, p = 14, 100, 0.2

stat, pval = proportion.proportions_ztest(count=k, nobs=n, value=p, alternative='smaller', prop_var=p)
stat

-1.5

---

## Calculating the P-value in a z test for a proportion

---

### Example 1: Right-tailed

Fay read an article that said $26\%$ of Americans can speak more than one language. She was curious if this figure was higher in her city, so she tested $H_0:p=0.26$ vs. $H_\text{a}:p>0.26$, where $p$ represents the proportion of people in her city that can speak more than one language.

She found that $40$ of $120$ people sampled could speak more than one language. The test statistic for these results was $z\approx1.83$.

**Assuming that the necessary conditions are met, what is the approximate P-value for Fay's test?**  
_You may round to three decimal places._

In [6]:
k, n, p = 40, 120, 0.26
p_hat = k / n
precision = 3

SE = st.bernoulli.std(p) / np.sqrt(n)
zstat = (p_hat - p) / SE
pval = st.norm.sf(zstat)
# OR 
# pval = st.norm(loc=p, scale=SE).sf(x=p_hat)
display(Latex(f"$P(p > {p}) = P(z > {round(zstat, precision)}) = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

In [7]:
# OR
k, n, p = 40, 120, 0.26
zstat = 1.83
precision = 3

stat, pval = proportion.proportions_ztest(count=k, nobs=n, value=p, alternative='larger', prop_var=p)
display(Latex(f"$P(p > {p}) = P(z > {round(zstat, precision)}) = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

---

### Example 2: Two-tailed

Elliot read a report from a previous year saying that $6\%$ of adults in his city biked to work. He wanted to test whether this had changed, so he took a random sample of $240$ adults in his city to test $H_0:p=0.06$ versus $H_\text{a}:p \neq 0.06$, where $p$ is the proportion of adults in Elliot's city that bike to work.

The sample results showed $21$ adults who biked to work, and the corresponding test statistic was $z \approx 1.79$.

**Assuming that the necessary conditions are met, what is the approximate P-value for Elliot's significance test?**  
_You may round to three decimal places._

In [8]:
k, n, p = 21, 240, 0.06
z = 1.79
precision = 3

stat, pval = proportion.proportions_ztest(count=k, nobs=n, value=p, alternative='two-sided', prop_var=p)
display(Latex(f"$P(p != {p}) = P(z != {round(z, precision)}) = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

---

### Example 3: Left-tailed

The mayor of a town read an article that claimed the national unemployment rate was $8\%$. They suspected that the unemployment rate was lower in their town, so they took a sample of $128$ residents to test $H_0: p=0.08$ versus $H_\text{a}:p<0.08$, where $p$ is the proportion of residents that are unemployed.

They found that $6$ residents in the sample were unemployed, and the corresponding test statistic was $z \approx -1.38$.

**Assuming that the necessary conditions are met, what is the approximate P-value for this significance test?**  
_You may round to three decimal places._

In [9]:
k, n, p = 6, 128, 0.08
z = -1.38
precision = 3

stat, pval = proportion.proportions_ztest(count=k, nobs=n, value=p, alternative='smaller', prop_var=p)
display(Latex(f"$P(p < {p}) = P(z < {round(z, precision)}) = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

---

## Making conclusions in a z test for a proportion

---

### Example 1

According to a large poll in a previous year, about $80\%$ of homes in a certain county had access to high-speed internet. The following year, researchers wanted to test $H_0: p=0.8$ versus $H_\text{a}:p<0.8$, where $p$ is the proportion of homes in this county with high-speed internet access.

They took a random sample of $250$ homes from that county and found that $191$ of them had access to high-speed internet. The test statistic for these results was $z\approx -1.42$, and the corresponding P-value was approximately $0.077$.

**Assuming the conditions for inference were met, which of these is an appropriate conclusion?**

At the $\alpha=0.10$ significance level, they should conclude that less than $80\%$ of homes in the county had access to high-speed internet.

Since the P-value $0.077$ is less than $\alpha=0.10$, they can reject $H_0$ and conclude $H_\text{a}$.

---

### Example 2

An online retailer had a satisfaction guarantee, where they accepted a return and gave a refund for any reason within 30 days of a purchase. About 4\%4%4, percent of orders were returned under this policy. The company changed the timeframe to within 15 days, and they were curious if that would also change what proportion of orders were returned. They took a sample of orders to test $H_0: p=0.04$ versus $H_\text{a}: p\neq0.04$, where $p$ is the proportion of all orders that would result in a return under the new policy.

The sample of $500$ orders showed that $15$ orders were returned. Using these results, they calculated a test statistic of $z=-1.14$ and a P-value of approximately $0.25$.

**Assuming the conditions for inference were met, what is an appropriate conclusion at the $\alpha=0.10$ significance level?**

Fail to reject $H_0$. This isn't enough evidence to conclude that the return rate is no longer $4\%$.
    
Since the P-value $0.25$ is greater than $\alpha=0.10$, they should fail to reject $H_0$ (they don't have enough evidence to conclude $H_\text{a}$).

---

# Test about a population mean

---

## Writing hypotheses for a test about a mean

### Guidelines for hypotheses

-   We write hypotheses in terms of population parameters, not sample statistics.
-   The null hypothesis should have a statement of equality.
-   The direction of the alternative hypothesis $(<{,}>{,}\neq)$ depends on the context of the test.

### The null hypothesis

In a test of significance, we are looking for evidence against the null hypothesis $H_0$.

### The alternative hypothesis

The alternative hypothesis $H_\text{a}$ is the claim we are trying to find evidence in favor of.

---

### Example 1: One-tailed

A restaurant advertises that its burritos weigh $250\text{ g}$. A consumer advocacy group doubts this claim, and they obtain a random sample of these burritos to test if the mean weight is significantly lower than $250\text{ g}$.

Let $\mu$ be the mean weight of the burritos at this restaurant and $\bar x$ be the mean weight of the burritos in the sample.

**Which of the following is an appropriate set of hypotheses for their significance test?**

- $H_{0}: \mu = 250\text{ g}$
- $H_\text{a}: \mu < 250\text{ g}$

---

### Example 2: Two-tailed

A pharmaceutical company produces caffeine pills that are each supposed to contain $200\,\text{mg}$ of caffeine. A quality control expert took a random sample of pills from a batch and measured the amount of caffeine in each pill in the sample. They want to test if the mean amount is significantly different than $200\text{ mg}$.

Let $\mu$ be the mean amount in each pill in the entire batch and $\bar x$ be the mean amount in each pill in the sample.

**Which of the following is an appropriate set of hypotheses for their significance test?**

- $H_{0}: \mu = 200\text{ mg}$
- $H_\text{a}: \mu \neq 200\text{ mg}$

---

## Conditions for inference on a mean

> [Conditions for inference on a mean](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/a/reference-conditions-inference-one-mean)

When we want to carry out inference (build a confidence interval or do a significance test) on a mean, the accuracy of our methods depends on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met. Otherwise the calculations and conclusions that follow may not be correct.

The conditions we need for inference on a mean are:

-   **Random**: A random sample or randomized experiment should be used to obtain the data.
-   **Normal**: The sampling distribution of $\bar x$ (the sample mean) needs to be approximately normal. This is true if our parent population is normal or if our sample is reasonably large $(n \geq 30)$.
-   **Independent**: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than $10\%$ of the population.

Let's look at each of these conditions a little more in-depth.

---

### The random condition

Random samples give us unbiased data from a population. When we don't use random selection, the resulting data usually has some form of bias, so using it to infer something about the population can be risky.

More specifically, sample means are unbiased estimators of their population mean. For example, suppose we have a bag of ping pong balls individually numbered from $0$ to $30$, so the population mean of the bag is $15$. We could take random samples of balls from the bag and calculate the mean from each sample. Some samples would have a mean higher than $15$ and some would be lower. But on average, the mean of each sample will equal $15$. We write this property as $\mu_{\bar x}=\mu$, which holds true as long as we are taking random samples.

This won't necessarily happen if we use a non-random sample. Biased samples can lead to inaccurate results, so they shouldn't be used to create confidence intervals or carry out significance tests.

---

### The normal condition

The sampling distribution of $\bar x$ (a sample mean) is approximately normal in a few different cases. The shape of the sampling distribution of $\bar x$ mostly depends on the shape of the parent population and the sample size $n$.

#### Case 1: Parent population is normally distributed

If the parent population is normally distributed, then the sampling distribution of $\bar x$ is approximately normal regardless of sample size. So if we know that the parent population is normally distributed, we pass this condition even if the sample size is small. In practice, however, we usually don't know if the parent population is normally distributed.

#### Case 2: Not normal or unknown parent population; sample size is large $(n \geq 30)$

The sampling distribution of $\bar x$ is approximately normal as long as the sample size is reasonably large. Because of the central limit theorem, when $n \geq 30$, we can treat the sampling distribution of $\bar x$ as approximately normal regardless of the shape of the parent population.

There are a few rare cases where the parent population has such an unusual shape that the sampling distribution of the sample mean $\bar x$ isn't quite normal for sample sizes near $30$. These cases are rare, so in practice, we are usually safe to assume approximately normality in the sampling distribution when $n \geq 30$.

#### Case 3: Not normal or unknown parent population; sample size is small $(n<30)$

As long as the parent population doesn't have outliers or strong skew, even smaller samples will produce a sampling distribution of $\bar x$ that is approximately normal. In practice, we can't usually see the shape of the parent population, but we can try to infer shape based on the distribution of data in the sample. If the data in the sample shows skew or outliers, we should doubt that the parent is approximately normal, and so the sampling distribution of $\bar x$ may not be normal either. But if the sample data are roughly symmetric and don't show outliers or strong skew, we can assume that the sampling distribution of $\bar x$ will be approximately normal.

_The big idea is that we need to graph our sample data when $n<30$ and then make a decision about the normal condition based on the appearance of the sample data._

---

### The independence condition

To use the formula for standard deviation of $\bar x$, we need individual observations to be independent. In an experiment, good design usually takes care of independence between subjects (control, different treatments, randomization).

In an observational study that involves sampling without replacement, individual observations aren't technically independent since removing each observation changes the population. However the $10\%$ condition says that if we sample $10\%$ or less of the population, we can treat individual observations as independent since removing each observation doesn't change the population all that much as we sample. For instance, if our sample size is $n=30$, there should to be at least $N=300$ members in the population for the sample to meet the independence condition.

Assuming independence between observations allows us to use this formula for standard deviation of $\bar x$ when we're making confidence intervals or doing significance tests:

$\displaystyle \sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}$

We usually don't know the population standard deviation $\sigma$, so we substitute the sample standard deviation $s_x$ as an estimate for $\sigma$. When we do this, we call it the **standard error** of $\bar x$ to distinguish it from the standard deviation.

So our formula for standard error of $\bar x$ is:

$\displaystyle \sigma_{\bar x} \approx \frac{s_x}{\sqrt{n}}$

---

### Summary


If all three of these conditions are met, then we can we feel good about using $t$ distributions to make a confidence interval or do a significance test. Satisfying these conditions makes our calculations accurate and conclusions reliable.

The random condition is perhaps the most important. If we break the random condition, there is probably bias in the data. The only reliable way to correct for a biased sample is to recollect the data in an unbiased way.

The other two conditions are important, but if we don't meet the normal or independence conditions, we may not need to start over. For example, there is a way to correct for the lack of independence when we sample more than $10\%$ of a population, but it's beyond the scope of what we're learning right now.

The main idea is that it's important to verify certain conditions are met before we make these confidence intervals or do these significance tests.

---

### Example 1

A gym advertises that its $1{,}000$ members, on average, lost $5 \text{ kg}$ in their first month of membership. A skeptical employee suspects that the actual average is lower than this, so they take an SRS of $10$ customers and look at each of their weight losses in their first month of membership.

The sample data is skewed to the right with an average weight loss of $4.25 \text{ kg}$ and a standard deviation of about $3 \text{ kg}$. The employee wants to use these sample data to conduct a $t$ test about the mean.

**Which conditions for performing this type of significance test have been met?**

- The data is a random sample from the population of interest.
- Individual observations are independent.

---

### Example 2

Here are two different samples drawn from two different populations:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220501111921.png)

**Which sample satisfies the normal condition for performing a $t$ test?**

Sample A only

The sample size is large enough ($n=43 \geq 30$) for the central limit theorem to compensate for the skew.

---

### Example 3

Here are two different samples drawn from two different populations:

![](https://raw.githubusercontent.com/ZacksAmber/PicGo/master/img/20220501123656.png)

**Which sample satisfies the normal condition for constructing a $t$ interval?**

Sample A only

Even though the sample is small, the sample data are roughly symmetric with no outliers, so it satisfies the normal condition.

---

## Calculating the test statistic in a t test for a mean

> [Test statistic](https://en.wikipedia.org/wiki/Test_statistic#:~:text=A%20test%20statistic%20is%20a,to%20perform%20the%20hypothesis%20test.)

A **test statistic** is a [statistic](https://en.wikipedia.org/wiki/Statistic "Statistic") (a quantity derived from the [sample](https://en.wikipedia.org/wiki/Sample_(statistics) "Sample (statistics)")) used in [statistical hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing "Statistical hypothesis testing").[[1]](https://en.wikipedia.org/wiki/Test_statistic#cite_note-CasellaBerger-1) A hypothesis test is typically specified in terms of a test statistic, considered as a numerical summary of a data-set that reduces the data to one value that can be used to perform the hypothesis test. In general, a test statistic is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the [null](https://en.wikipedia.org/wiki/Null_hypothesis "Null hypothesis") from the [alternative hypothesis](https://en.wikipedia.org/wiki/Alternative_hypothesis "Alternative hypothesis"), where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis.

- Assumed population mean(parameter): $\displaystyle \mu_{0}$
- Standard Deviation(parameter): $\displaystyle \sigma$
- Sample mean(statistic): $\displaystyle \bar x$
- Sample standard deviation(statistic): $\displaystyle s$
- Standard Error(statistic): $\displaystyle SE = s_{\bar x} \approx \sqrt{\frac{s^2}{n}} = \frac{s}{\sqrt{n}}$
- Z-Score(parameter): $\displaystyle z = \frac{\bar x - \mu}{\sigma}$
- test statistic(statistic): $\displaystyle t = \frac{\text{statistics - parameter}}{\text{standard deviation of statistic (standard error)}} = \frac{\bar x - \mu_{0}}{SE}, df = n - 1$

---

### Example 1

A company advertises that its cans of caviar each contain $100\text{ g}$ of their product. A consumer advocacy group doubts this claim, and they obtain a random sample of $8$ cans to test if the mean weight is significantly lower than $100\text{ g}$. They calculate a sample mean weight of $99\text{ g}$ and a sample standard deviation of $0.9\text{ g}$.

The advocacy group wants to use these sample data to conduct a $t$ test on the mean. Assume that all conditions for inference have been met.

**Identify the correct test statistic for their significance test.**

$\displaystyle t = \frac{99 - 100}{\sqrt{\frac{0.9}{8}}}$

---

### Example 2

Rory suspects that teachers in his school district have less than $5$ years of experience on average. He decided to test $\text{H}_0: \mu=5$ versus $\text{H}_\text{a}:\mu<5$ using a sample of $25$ teachers. His sample mean was $4$ years and his sample standard deviation was $2$ years.

Rory wants to use these sample data to conduct a $t$ test on the mean. Assume that all conditions for inference have been met.

**Calculate the test statistic for Rory's test.**  
_You may round your answer to two decimal places._

In [10]:
n, mu_0, mu_1, sd_1 = 25, 5, 4, 2
SE = sd_1 / np.sqrt(n)
tstat = (mu_1 - mu_0) / SE
tstat

-2.5

---

## Calculating the P-value in a t test for a mean

---

### Example 1

Jamarion was testing $H_0: \mu=45$ versus $H_\text{a}: \mu<45$ with a sample of $5$ observations. His sample mean was $40$ and his sample standard deviation was $3$. Assume that the conditions for inference were met.

**Which of the following represents the P-value for Jamarion's test?**

$\displaystyle P(t < \frac{40 - 45}{\frac{3}{\sqrt{5}}}) \text{ with 4 degrees of freedom}$

---

### Example 2: Left-tailed

Evelynn was testing $H_0: \mu=52$ versus $H_\text{a}: \mu<52$ with a sample of $9$ observations. Her test statistic was $t=-2.83$. Assume that the conditions for inference were met.

**What is the approximate P-value for Evelynn's test?**

In [11]:
n, tstat = 9, -2.83
precision = 3
pval = st.t.cdf(x=tstat, df=n-1)

display(Latex(f"$P-value = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

---

### Example 3: Two-tailed

Caterina was testing $H_0: \mu=0$ versus $H_\text{a}: \mu\neq0$ with a sample of $6$ observations. Her test statistic was $t=2.75$. Assume that the conditions for inference were met.

**What is the approximate P-value for Caterina's test?**

In [12]:
n, tstat = 6, 2.75
precision = 3
if tstat > 0:
    pval = st.t.sf(x=tstat, df=n-1) * 2
else:
    pval = st.t.cdf(x=tstat, df=n-1) * 2

display(Latex(f"$P-value = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

---

### Example 4: Right-tailed

Samuel was testing $H_0: \mu=12$ versus $H_\text{a}: \mu>12$ with a sample of $8$ observations. His test statistic was $t=2.411$. Assume that the conditions for inference were met.

**What is the approximate P-value for Samuel's test?**

In [13]:
n, tstat = 8, 2.411
precision = 3
pval = st.t.sf(x=tstat, df=n-1)

display(Latex(f"$P-value = {round(pval, precision)}$"))

<IPython.core.display.Latex object>

---

### Example 5: Calculate the P-value with or without tstat

$n = 12, \bar x = 127.2, s = 2.1$

- $H_{0}: \mu = 128$
- $H_\text{a}: \mu < 128$

In [14]:
# with tstat
n, mu_0, mu_1, sd_1 = 12, 128, 127.2, 2.1
SE = sd_1 / np.sqrt(n)
tstat = (mu_1 - mu_0) / SE
precision = 3
pval = st.t.cdf(x=tstat, df=n-1)
pval

0.10687758177880972

In [15]:
# OR without tstat
n, mu_0, mu_1, sd_1 = 12, 128, 127.2, 2.1
SE = sd_1 / np.sqrt(n)

pval = st.t(loc=mu_0, scale=SE, df=n-1).cdf(x=mu_1)
pval

0.10687758177880972

In [16]:
# OR confidence interval
# If we see the confidence interval of mu_0(128), mu_1(127.2) is located with the 95% (2 sigma) range.
# That's the other way that fail to reject the null hypothesis.
CL = .95
st.t.interval(CL, df=n-1, loc=mu_0, scale=SE)

(126.66572365661092, 129.33427634338906)

In [17]:
# OR critical value. 
# As we can see, tstat is larger than the critical value for H0: mean = 128, so we cannot reject H0 with 5% significance level.
tstat, st.t.interval(CL, df=n-1)[0]

(-1.3196577581477114, -2.200985160082949)

---

## Making conclusions in a t test for a mean

---

### Example 1

A website streams movies and television shows to its subscribers. Employees know that the average time a user spends per session on their website is $2$ hours. The website changed its design, and they wanted to know if the average session length was longer than $2$ hours. They randomly sampled $50$ users to test $H_0: \mu=2$ versus $H_\text{a}: \mu > 2$, where $\mu$ is the mean session length.

Users in the sample had a mean session length of $2.49$ hours and a standard deviation of $1.55$ hours. These results produced a test statistic of $t\approx2.24$ and a P-value of approximately $0.015$.

**Assuming the conditions for inference were met, what is an appropriate conclusion at the $\alpha=0.05$ significance level?**

The evidence suggests that the mean session length is longer than $2$ hours.

Since the P-value $0.015$ is less than $\alpha=0.05$, they can reject $H_0$ and conclude $H_\text{a}$.

---

### Example 2

Jumbo eggs in Australia, on average, are supposed to weigh $68\text{ g}$. Tala is in charge of a quality control test that involves weighing a sample of eggs to test $H_0:\mu=68$ versus $H_\text{a}: \mu \neq 68 \text{ g}$, where $\mu$ is the mean weight of the eggs in a batch.

Tala sampled $12$ eggs from a batch and found a sample mean weight of $68.5\text{ g}$ and a standard deviation of $1\text{ g}$. She calculated a test statistic of $t \approx 1.73$ and an approximate P-value of $0.111$. Assume that the conditions for inference were met.

**Is there sufficient evidence at the $\alpha=0.10$ level to conclude that the mean weight of the eggs in this batch is not equal to $68\text{ g}$?**

No, because $0.111>0.10$.

Since the P-value is greater than the significance level $\alpha$, the correct conclusion is to fail to reject $H_0$ (there is not sufficient evidence to conclude $H_\text{a}: \mu \neq 68\text{ g}$).

---

### Example 3

A quality control engineer is testing the battery life of a new smartphone. The company is advertising that the battery lasts $24$ hours on a full charge, but the engineer suspects that the battery life is actually less than that. They take a random sample of $30$ of these phones to test $H_0: \mu=24$ hours versus $H_\text{a}: \mu < 24$, where $\mu$ is the mean battery life of these phones.

The sample data had a mean of $21$ hours and a standard deviation of 161616 hours. These results produced a test statistic of $t\approx-1.03$ and a P-value of approximately $0.156$.

**Assuming the conditions for inference were met, what is an appropriate conclusion at the $\alpha=0.10$ significance level?**

They cannot conclude the mean battery life is less than $24$ hours.

Since the P-value $0.156$ is greater than $\alpha=0.10$, they should fail to reject $H_0$ (they don't have enough evidence to conclude $H_\text{a}$).

---

# More significance testing

---

## Z-statistics vs. T-statistics

---

### Example 1: Small sample hypothesis test

$data: 15.6, 16.2, 22.5, 20.5, 16.4, 19.4, 16.6, 17.9, 12.7, 13.9$

- $H_{0}: \mu = 20$
- $H_\text{a}: \mu < 20$

In [18]:
mu_0 = 20
arr = np.array([15.6, 16.2, 22.5, 20.5, 16.4, 19.4, 16.6, 17.9, 12.7, 13.9])

st.ttest_1samp(a=arr, popmean=20, alternative='less')

Ttest_1sampResult(statistic=-3.001649525885985, pvalue=0.0074582071244487635)

In [19]:
# with tstat
n, mu_0, mu_1, sd_1 = len(arr), 20, arr.mean(), arr.std(ddof=1)
SE = sd_1 / np.sqrt(n)
tstat = (mu_1 - mu_0) / SE
precision = 3
pval = st.t.cdf(x=tstat, df=n-1)
pval

0.0074582071244487635

In [20]:
# OR without tstat
n, mu_0, mu_1, sd_1 = len(arr), 20, arr.mean(), arr.std(ddof=1)
SE = sd_1 / np.sqrt(n)

pval = st.t(loc=mu_0, scale=SE, df=n-1).cdf(x=mu_1)
pval

0.0074582071244487635

---

### Example 2: Large sample proportion hypothesis testing

- $\alpha = 5\%, k = 57, n = 150$

- $H_{0}: p \leq 30\%$
- $H_\text{a}: p > 30\%$

In [21]:
k, n, p = 57, 150, 0.3

zstat, pval = proportion.proportions_ztest(count=k, nobs=n, value=p, alternative='larger', prop_var=p)
pval

0.016254722322859756

In [22]:
# OR critical value
# As we can see the zstat is larger than the critical value for H0: p <= 30%, so we can reject H0 with 5% significance level.
CL = .95
critical_value = st.norm.ppf((1 - CL) / 2 + CL)
zstat, critical_value

(2.138089935299395, 1.959963984540054)