# **Week 8: Large Sample Inference - Hypothesis Testing**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```

Through the following examples, we will explore the concepts of hypothesis testing and examine their practical implications.


## **Pre-Configurating the Notebook**

### **Switching to the R Kernel on Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Importing Required Packages**
**Run the following lines of code**:

In [21]:
#Do not modify

setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

#
invisible(source("R/preConfigurated.R"))

Loading required package: ggplot2

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: tidyr

Loading required package: stringr

Loading required package: magrittr


Attaching package: ‘magrittr’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: IRdisplay

Loading required package: png

“there is no package called ‘png’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: grid

Loading required package: knitr



**Do not modify the following**

In [22]:
if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

  expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr", "knitr") %in% loadedNamespaces()))

})

Loading required package: testthat


Attaching package: ‘testthat’


The following objects are masked from ‘package:magrittr’:

    equals, is_less_than, not


The following object is masked from ‘package:tidyr’:

    matches


The following object is masked from ‘package:dplyr’:

    matches




[32mTest passed[39m 🌈


## **Making Sense of Hypothesis Testing**


In [23]:
smpl_data = c(7.2, 8.53, 8.07, 7.99, 7.79, 7.77, 8.9, 7.64, 7.35, 8.45, 9.14, 7.93, 7.35, 7.52, 7.41, 8.27,
7.55, 7.5, 8.53, 8.37, 8.17, 8.15, 8.02, 7.63,7.64, 8.83, 8.17, 7.41, 7.7, 8.21)

### **Toy Example**

A phone company advertises that the average battery life of their phones (when continuously watching videos), denoted as $\mu$, is 8 hours.

To verify this claim, an independent random sample of 30 phones was tested. Battery life is assumed to follow a normal distribution, and the population standard deviation is known to be 1 hour.

**Hint**:
- Use the asymptotic properties of the sample mean
- Replace the unknown standard deviation $\sigma$ with its estimate

**Write down the asymptotic sampling distribution of sample mean.**

Given i.i.d. $x_1, ⋯, x_n \sim \mathcal{N}(\mu, \sigma^2)$, we have:

$$\bar{x} \sim \mathcal{N}\Big(\mu,\frac{\sigma^2}{30}\Big)$$

This is the exact sampling distribution as $x_1, ⋯, x_n$ are i.i.d. Gaussian random variables.

**Write down the null and alternative hypotheses for testing whether the company’s claim is correct.**


Since we are testing whether or not there is evidence *against* the company’s claim that the average battery life is 8 hours, the alternative hypothesis should challenge this claim.  

Because we are not concerned if the battery lasts longer than 8 hours (that would be favorable to consumers), we only test if it is **less** than 8 hours.  

$$
\begin{align}
H_0: \mu &= 8 \\
H_1: \mu &< 8
\end{align}
$$

This is a **left-tailed test** of the mean.

**Approximate the sampling distribution of the sample mean under the null hypothesis.**

In [24]:
var(smpl_data)

$$\bar{x} \sim \mathcal{N}\Big(7.9,\frac{0.253}{30}\Big)$$

**Define the z-test statistic for testing the null hypothesis and derive the rejection region.**


$$
z = \frac{\bar{x} - \mu_{H_o}}{\sigma_{\bar{x}}} \approx \frac{\bar{x} - 8}{\sqrt{\frac{0.253}{30}}}
$$


At $\alpha = 0.05$, we reject the null hypothesis if $z < z_{0.05} = -1.645$. Thus, the rejection region is $(-\infty, -1.645)$.

**Why is this approach valid?**

Under the null hypothesis (i.e., if $H_0$ is true), the test statistic follows a standard Gaussian distribution:  

$$
z \mid H_0 \sim \mathcal{N}(0,1)
$$  

Here, the probability of observing $z < -1.645$ under $H_0$ is 0.05, which is relatively unlikely. If we observe a test statistic less than -1.645, this provides evidence **against** the null hypothesis that $\mu = 8$.  

For example, if the true population mean is substantially smaller than 8, the sample mean is likely to be smaller, resulting in a more negative test statistic $z$.  




**Given the sample data, compute the test statistic and state the Neyman-Pearson decision.**

In [25]:
xbar = mean(smpl_data)
s = sd(smpl_data)
n = 30
z = (xbar-8)/(s/sqrt(n))
z

As $z_{observed} ≈ -0.294 ∉ (-∞, -1.645)$, there is no evidence against the null hypothesis that $\mu = 8$. Do not reject the null hypothesis.

### **Key Ideas About Hypothesis Testing**

#### **Key Idea #1**

Hypothesis testing is based on the philosophy that if an event is unlikely under scenario A, but we still observe it in reality, this serves as evidence against scenario A (calling into question the validity or existence of A).

By convention, the **null hypothesis** is set to represent the idea that "nothing special is happening," while the **alternative hypothesis** is the one that *challenges* this assumption.  

For example:  
- If you want to test whether the average battery life is 8 hours, the null hypothesis would be:  
  $$
  H_0: \mu = 8
  $$  
  This is the "nothing special" scenario.  

- If you suspect someone might have malicious intent, the null hypothesis would be:  
  $$
  H_0: \text{No malicious intent}
  $$
  The alternative would be:  
  $$
  H_1: \text{Malicious intent}
  $$  

Of course, if you start observing lots of *suspicious* actions, those observations serve as **evidence against the null**, which may lead you to favour the alternative.



#### **Key Idea #2**

Hypothesis testing cannot establish whether a hypothesis is true or false—it only assesses whether the data provide sufficient evidence to reject the null hypothesis.

Even if we reject the null hypothesis, this does **not** mean that $H_0$ is false. This is because we may still commit a **Type I error**, which occurs when we reject the null hypothesis even though it is actually true.  

Therefore, we should **never say**:  
- "The null hypothesis is wrong."  
- "The alternative hypothesis is correct."
- "We accept the alternative hypothesis."

Instead, say:

- "There is evidence against the null hypothesis."
- "We reject the null hypothesis in favour of the alternative."

The good news is that the probability of a Type I error is something we can control.  Most conventional hypothesis testing procedures (such as Neyman-Pearson or Fisher’s p-value approach) are based on pre-specifying a Type I error probability, often denoted by $\alpha$. A common choice is $\alpha = 0.05$, which serves as the threshold for deciding whether the observed data provide sufficient evidence against $H_0$.  

Back to this example:

$$
\begin{align}
H_0: &\text{ No malicious intent} \\
H_1: &\text{ Malicious intent}
\end{align}
$$

What if we observe no suspicious actions? Does that mean $H_0$ is `true`? Not necessarily — they may simply be waiting for an opportunity. In hypothesis testing, we also have **Type 2 error** - failing to reject the null hypothesis when the alternative is `true`. As a result, failing to reject the null does not imply:

- "The null hypothesis is correct."
- "We accept the null hypothesis."

Instead, we should conclude that
- "There is no evidence against the null hypothesis."

Unfortunately, there is an inherent trade-off between Type I and Type II errors. The Neyman–Pearson lemma shows how to construct the most powerful test for a given size (i.e., a fixed Type I error rate). This test minimises the Type II error among all tests with that Type I error. However, for any fixed Type I error rate, you cannot reduce the Type II error further. In practice, you first choose the Type I error rate you are willing to tolerate, and then apply Neyman–Pearson to obtain a test that achieves the best possible power against a given alternative.

How to actually construct such tests is beyond the scope of this unit. Instead, we will only state a corollary of the Neyman–Pearson lemma, which shows how to determine the rejection region in the simple case of testing hypotheses about the sample means.

#### **Key Idea #3**

We do hypothesis testing not to confirm whether or not $H_0$ is `true` but rather find evidence whether or not data provide evidence of a substantial deviation from the null hypothesis.

For example, suppose the true mean battery life is 7.999 hours instead of 8. Such a tiny difference is practically indistinguishable, so it does not matter. What matters is whether the observed data show a meaningful departure from the claimed value of 8 hours. In this case, it is very likely that we will fail to reject the null hypothesis $\mu_0 = 8$, because the deviation is too small to detect.$\mu_0 = 8$.

#### **Key Idea #4**

It is very bad practice to adjust the hypothesis after looking at the data.

Hypothesis testing assumes that the null and alternative hypotheses are specified before collecting or examining the data.
Changing your hypothesis after observing the data (sometimes called “data snooping” or “p-hacking”) inflates the Type I error rate and makes your conclusions unreliable.
Connection to the battery-life example:
Suppose you originally want to test

$$
\begin{align}
H_0: \mu &= 8 \\
H_1: \mu &< 8
\end{align}
$$

If you peek at the data and see a mean around 7.95 hours, and then decide to only test a smaller deviation (say $\mu < 7.9$) to get “nicer” results, you are **post-hoc adjusting the hypothesis**.  This biases the test: your Type I error is no longer controlled.

The correct approach: decide in advance what deviation you want to detect (e.g., battery life shorter than 8 hours) and stick with it, regardless of what the observed sample mean turns out to be. If you want to change the hypotheses, you need to collect new data.

## **Workshop Questions**
Through out this section, we assume a Type 1 error rate of 0.05.




### **Question 1**

The following questions are based on the `episodes` dataset. While you are expected to use R to compute the answers, the underlying concepts are identical to those in pen-and-paper confidence interval calculations.

In [27]:
episodes = read.csv("./datasets/episodes.csv")
episodes %>% str()

'data.frame':	704 obs. of  57 variables:
 $ Series                        : chr  "TOS" "TOS" "TOS" "TOS" ...
 $ Series.Name                   : chr  "The Original Series" "The Original Series" "The Original Series" "The Original Series" ...
 $ Season                        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Episode                       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ IMDB.Ranking                  : num  7.3 7.2 7.8 8 7.8 6.9 7.6 7.1 7.5 8.2 ...
 $ Title                         : chr  "The Man Trap" "Charlie X" "Where No Man Has Gone Before" "The Naked Time" ...
 $ Star.date                     : chr  "1513.1" "1533.6" "1312.4" "1704.2" ...
 $ Air.date                      : chr  "8/9/66" "15/9/66" "22/9/66" "29/9/66" ...
 $ Bechdel.Wallace.Test          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Director                      : chr  "Marc Daniels" "Lawrence Dobkin" "James Goldstone" "Marc Daniels" ...
 $ Writer.1                      : chr  "George Clayton Johnson" "Gene Rode

#### **Question 1.1**

Test whether the mean IMDB rating of episodes of Star Trek: The Original Series is greater than 7.7. Interpret the results for a non-statistician stakeholder.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 1.2**

Test whether the mean IMDB rating of episodes of Star Trek: The Original Series differs from the mean rating of Star Trek: The Next Generation. Interpret the results for a non-statistician stakeholder.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 1.3**

Test whether the proportion of Star Trek: The Next Generation episodes that pass the Bechdel-Wallace Test is equal to 0.4. Interpret the results for a non-statistician stakeholder.


**Note**: While these series have ended and you technically have the full “population” of some values (e.g., results of the Bechdel-Wallace test), we still ask you to test whether or not the proportion of episodes that pass the test is equal to 0.4. This may seem counter-intuitive, but you can think of it as follows:

- The episode test results are treated as realisations from an unknown probability distribution $f$ (here, Bernoulli(p)).
- Although the episodes are released, we are interested in the underlying process that generates these values. This includes not-yet-released episodes or hypothetical similar episodes. Simply examining the “complete” population of Bechdel test results is not sufficient; instead, we rely on a statistical model to quantify uncertainty.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 1.4**

Test whether the proportion of episodes that pass the Bechdel-Wallace Test differs between Star Trek: The Next Generation and Star Trek: Voyager. Interpret the results for a non-statistician stakeholder.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


### **Question 2**

The following questions are based on the `epa_data` dataset. While you are expected to use R to compute the answers, the underlying concepts are identical to those in pen-and-paper confidence interval calculations.


In [29]:
epa_data = read.csv("./datasets/epa_data.csv")
epa_data %>% str()

'data.frame':	13569 obs. of  9 variables:
 $ city : int  16 15 16 19 19 19 19 19 19 19 ...
 $ hwy  : int  24 22 22 27 29 24 26 27 29 24 ...
 $ cyl  : int  8 8 8 4 4 4 4 4 4 4 ...
 $ disp : num  5 5 5 2 2 2.4 2.4 2 2 2.4 ...
 $ drive: chr  "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" "Rear-Wheel Drive" ...
 $ make : chr  "Jaguar" "Jaguar" "Jaguar" "Pontiac" ...
 $ model: chr  "XK" "XK" "XK Convertible" "Solstice" ...
 $ trans: chr  "Automatic" "Automatic" "Automatic" "Automatic" ...
 $ year : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...


#### **Question 2.1**

Test the hypothesis that there is no difference between the city mileage for cars manufactured in 2015 and 2020. Interpret the results for a non-statistician stakeholder.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 2.2**

Test the hypothesis that the proportion of cars produced with manual transmissions in 2010 is less than 0.5. Interpret the results for a non-statistician stakeholder.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


#### **Question 2.3**

Test the hypothesis that the proportion of cars produced with manual transmissions for the years 1990 and 2010 has decreased. Interpret the results for a non-statistician stakeholder.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>
