# **Week 9: Small Sample Inference - Hypothesis Testing**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```

Through the following examples, we will explore the concepts of (small-sample) hypothesis testing (SSHT) and examine their practical implications.

**Why focus on small samples?**

- Less data means we cannot rely on the “law of large numbers” or asymptotic properties. Every conclusion depends heavily on the assumptions of our statistical model. For instance, if we assume data are normally distributed, that assumption has a bigger impact when the sample is small.

- In contrast, with large samples, many statistical methods rely on asymptotic properties, meaning they behave approximately correctly ***almost*** regardless of the underlying model. Essentially, large-sample inference is almost model-free.

## **Pre-Configurating the Notebook**

### **Switching to the R Kernel on Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Importing Required Packages**
**Run the following lines of code**:

In [None]:
#Do not modify

setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

#
invisible(source("R/preConfigurated.R"))

**Do not modify the following**

In [None]:
if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been loaded", {

  expect_true(all(c("ggplot2", "tidyr", "dplyr", "stringr", "magrittr", "knitr") %in% loadedNamespaces()))

})

## **Reference Tables for SSHT for Sample Means**

| Scenario | Parameter | Null hypothesis | Test statistic (t) | Degrees of freedom ($\nu$) under $H_0$ |
|----------|-----------|-----------------|--------------------|--------------------------------|
| 1, One-sample mean | $\mu$ | $\mu = \mu_0$ | $t = \dfrac{\bar{x} - \mu_0}{s / \sqrt{n}}$ | $n - 1$ |
| 2, Paired sample (dependent) | $\mu_D$ (mean difference) | $\mu_D = d_0$ | $t = \dfrac{\bar{d}-d_0}{s_d / \sqrt{n}}$ | $n - 1$ |
| 3, Two-sample mean, equal variances (pooled) | $\mu_1 - \mu_2$ | $\mu_1 - \mu_2 = d_0$ | $s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$. <br> $t = \dfrac{\bar{x}_1 - \bar{x}_2 - d_0}{s_p \sqrt{\tfrac{1}{n_1} + \tfrac{1}{n_2}}}$ | $n_1 + n_2 - 2$ |
| 4, Two-sample mean, unequal variances (Welch's) | $\mu_1 - \mu_2$ | $\mu_1 - \mu_2 = d_0$ | $t = \dfrac{\bar{x}_1 - \bar{x}_2 - d_0}{\sqrt{\tfrac{s_1^2}{n_1} + \tfrac{s_2^2}{n_2}}}$ | $\dfrac{\big(\tfrac{s_1^2}{n_1} + \tfrac{s_2^2}{n_2}\big)^2}{\tfrac{s_1^4}{n_1^2(n_1-1)} + \tfrac{s_2^4}{n_2^2(n_2-1)}}$ |

**NOTE THAT `R` BY DEFAULT USES FRACTIONAL NUMBER OF DEGREES OF FREEDOM FOR WELCH'S T-TEST, WHICH IS NOT AVAILABLE IN MOST STATISTICAL TABLES. IF NOT AVAILABLE, CONSIDER THE FLOOR VALUE**:

$$\left\lfloor\dfrac{\big(\tfrac{s_1^2}{n_1} + \tfrac{s_2^2}{n_2}\big)^2}{\tfrac{s_1^4}{n_1^2(n_1-1)} + \tfrac{s_2^4}{n_2^2(n_2-1)}}\right\rfloor,$$

**WHICH LEADS TO A MORE CONSERVATIVE TEST. IN THIS UNIT, WE FOLLOW THIS APPROACH FOR PEN-AND-PAPER QUESTIONS. `R` CAN HANDLE FRACTIONAL NUMBER OF DEGREES OF FREEDOMS.**



Assuming that data are i.i.d. generated from a Gaussian distribution (two dependent Gaussians for scenario 2; and two **independent** Gaussians for scenario 3 and 4), under the null hypothesis, the $t$ test statistics in the reference tables are approximately distributed as a Student's $T$ distribution with a number of degrees of freedom specified in the reference table. This implies a significant deviation from Gaussianity might weaken the tests (especially when data are heavily skewed or contain outliers).


In fact, the validity of the $t$-test depends on how severely the data deviate from Gaussianity. We do not require the data to be truly normal to use the $t$-test, as it often remains useful when deviations from normality are mild. However, if the deviation is substantial (e.g., outliers, extreme values), the test can become unreliable.

These $t$ tests are particularly useful when sample sizes are small, since the heavier tails of the $t$ distribution provide more accurate critical values. For large samples, however, the $T$ distribution converges to the standard normal distribution, and the tests reduce to the $z$-tests considered earlier in LSHT. Usually, n > 30 is sufficient.

Any substantial deviation from the null hypothesis will tend to produce $t$ values that are unlikely under these $T$ distributions, which is why extreme values of $t$ provide evidence against $H_0$.

| Test Type | Alternative Hypothesis | Rejection Region |
|-----------|----------------------|----------------|
| One-sided (right) | $H_1: \theta > \theta_0$ | Reject $H_0$ if $t > t_{\nu
  , 1-\alpha}$ |
| One-sided (left) | $H_1: \theta < \theta_0$ | Reject $H_0$ if $t < t_{\nu, \alpha}$ |
| Two-sided | $H_1: \theta \neq \theta_0$ | Reject $H_0$ if $|t| > t_{\nu, 1-\alpha/2}$ |

Even though any deviation from $H_0$ can provide evidence against it, the choice between a one-sided and a two-sided test depends on our research goal and the direction of interest.

If we specifically care about deviations in one direction — for example, testing whether the average battery life is less than 8 hours — a one-sided test is appropriate. Allocating all of the Type I error $\alpha$ to that direction increases the test’s ability to detect deviations that matter in practice.

On the other hand, if deviations in either direction are meaningful — for instance, testing whether the average rating of a show differs from 7.7, whether higher or lower — a two-sided test is necessary. Splitting $\alpha$ between both tails ensures we properly account for evidence against $H_0$ in either direction.




#### **Reference Table for Small Sample Confidence Intervals**


##### **Symmetric CI**

| Scenario | Parameter | Confidence Interval (CI) | Degrees of freedom ($\nu$) |
|----------|-----------|--------------------------|---------------------------|
| 1, One-sample mean | $\mu$ | $\bar{x} \pm t_{\nu, 1-\alpha/2} \frac{s}{\sqrt{n}}$ | $n - 1$ |
| 2, Paired sample (dependent) | $\mu_D$ (mean difference) | $\bar{d} \pm t_{\nu, 1-\alpha/2} \frac{s_d}{\sqrt{n}}$ | $n - 1$ |
| 3, Two-sample mean, equal variances (pooled) | $\mu_1 - \mu_2$ | $(\bar{x}_1 - \bar{x}_2) \pm t_{\nu, 1-\alpha/2} \cdot s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$ <br> where $s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$| $n_1 + n_2 - 2$ |
| 4, Two-sample mean, unequal variances (Welch's) | $\mu_1 - \mu_2$ | $(\bar{x}_1 - \bar{x}_2) \pm t_{\nu, 1-\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$ | $\dfrac{\big(\tfrac{s_1^2}{n_1} + \tfrac{s_2^2}{n_2}\big)^2}{\tfrac{s_1^4}{n_1^2(n_1-1)} + \tfrac{s_2^4}{n_2^2(n_2-1)}}$ |


##### **Right-tailed CI**


| Scenario | Parameter | Confidence Interval (CI) | Degrees of freedom ($\nu$) |
|----------|-----------|--------------------------|---------------------------|
| 1, One-sample mean | $\mu$ | $\bar{x} - t_{\nu, 1-\alpha} \frac{s}{\sqrt{n}}, \infty)$ | $n - 1$ |
| 2, Paired sample (dependent) | $\mu_D$ (mean difference) | $\bar{d} - t_{\nu, 1-\alpha} \frac{s_d}{\sqrt{n}}, \infty)$ | $n - 1$ |
| 3, Two-sample mean, equal variances (pooled) | $\mu_1 - \mu_2$ | $(\bar{x}_1 - \bar{x}_2) - t_{\nu, 1-\alpha} \cdot s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, \infty)$ <br> where $s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$ | $n_1 + n_2 - 2$ |
| 4, Two-sample mean, unequal variances (Welch's) | $\mu_1 - \mu_2$ | $(\bar{x}_1 - \bar{x}_2) - t_{\nu, 1-\alpha} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}, \infty)$ | $\dfrac{\big(\tfrac{s_1^2}{n_1} + \tfrac{s_2^2}{n_2}\big)^2}{\tfrac{s_1^4}{n_1^2(n_1-1)} + \tfrac{s_2^4}{n_2^2(n_2-1)}}$ |




##### **Left-tailed CI**


| Scenario | Parameter | Confidence Interval (CI) | Degrees of freedom ($\nu$) |
|----------|-----------|--------------------------|---------------------------|
| 1, One-sample mean | $\mu$ | $(-\infty, \bar{x} + t_{\nu, 1-\alpha} \frac{s}{\sqrt{n}}]$ | $n - 1$ |
| 2, Paired sample (dependent) | $\mu_D$ (mean difference) | $(-\infty, \bar{d} + t_{\nu, 1-\alpha} \frac{s_d}{\sqrt{n}}]$ | $n - 1$ |
| 3, Two-sample mean, equal variances (pooled) | $\mu_1 - \mu_2$ | $(-\infty, (\bar{x}_1 - \bar{x}_2) + t_{\nu, 1-\alpha} \cdot s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}]$ <br> where $s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$ | $n_1 + n_2 - 2$ |
| 4, Two-sample mean, unequal variances (Welch's) | $\mu_1 - \mu_2$ | $(-\infty, (\bar{x}_1 - \bar{x}_2) + t_{\nu, 1-\alpha} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}]$ | $\dfrac{\big(\tfrac{s_1^2}{n_1} + \tfrac{s_2^2}{n_2}\big)^2}{\tfrac{s_1^4}{n_1^2(n_1-1)} + \tfrac{s_2^4}{n_2^2(n_2-1)}}$ |



###**When to Use The Welch's $t$-test?**


Welch's $t$-test is more robust than the pooled-variance two-sample $t$-test because it does not assume equal population variances. It should be used when the two samples are independent and there is evidence that the variances differ.

- **Advantage:** Handles unequal variances and sample sizes reliably.  
- **Disadvantage:** Slightly less efficient (less power) than the pooled test if the population variances are actually equal.  

**Rule of Thumb:**  
If  
$$
\frac{\text{larger } s^2}{\text{smaller } s^2} > 3,
$$  
then  use  Welch's $t$-test; otherwise, the pooled-variance $t$-test is usually **acceptable**.

### **Connection Between Confidence Intervals and Hypothesis Testing**

Confidence intervals (CIs) provide a range of plausible values for a parameter. They are closely linked to hypothesis tests: whether a null hypothesis $H_0$ is rejected at significance level $\alpha$ can often be inferred from the corresponding CI.

**Right-tailed test**  
$$
H_0: \theta = \theta_0 \\
H_1: \theta > \theta_0
$$  

If $\theta_0$ lies **below** the $(1-\alpha) \times 100\%$ one-sided right-tailed CI ($\text{LB}, -\infty$), we reject $H_0$ at level $\alpha$.  

**Left-tailed test**  
$$
H_0: \theta = \theta_0 \\
H_1: \theta < \theta_0
$$  

If $\theta_0$ lies **above** the $(1-\alpha) \times 100\%$ one-sided left-tailed CI ($\infty, \text{UB}$), we reject $H_0$ at level $\alpha$.  

**Two-sided (symmetric) test**  
$$
H_0: \theta = \theta_0 \\
H_1: \theta \neq \theta_0
$$  

If $\theta_0$ lies **outside** the $(1-\alpha) \times 100\%$ two-sided CI, we reject $H_0$ at level $\alpha$.  



## **Performing `t.test` in R**

R provides a built-in function `t.test()` to perform hypothesis testing on means using the Student’s t-distribution. It can handle:

- One-sample t-tests
- Two-sample t-tests (equal or unequal variance)
- Paired-sample t-tests

It also handles small-sample confidence intervals on means (Gaussianity assumed).




### **Vector Interface `t.test` - Default Usage**

**Usage:**

```r
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0,
       paired = FALSE,
       var.equal = FALSE,
       conf.level = 0.95)
```

**Arguments:**

- `x`: numeric vector of data (one-sample or first sample)
- `y`: numeric vector of second sample (for two-sample tests)
- `mu`: Hypothesised mean for one-sample tests, or hypothesised difference for paired/two-sample tests
- `alternative`: `"two.sided"` (default), `"less"`, `"greater"`
- `paired`: TRUE for paired-sample test, FALSE for independent samples
- `var.equal`: TRUE for pooled-variance two-sample test, FALSE for Welch’s test
- `conf.level`: confidence level for interval (default 0.95)

#### **Reference Table**
| Scenario              | x             | y             | paired | var.equal | mu / difference under H0 |
|------------------------|---------------|---------------|--------|-----------|--------------------------|
| One-sample             | data          | -             | FALSE  | -         | hypothesised mean        |
| Two-sample Pooled      | sample1       | sample2       | FALSE  | TRUE      | hypothesised difference  |
| Two-sample Welch's       | sample1       | sample2       | FALSE  | FALSE     | hypothesised difference  |
| Paired-sample          | before        | after         | TRUE   | -         | hypothesised mean difference   |



### **Formula Interface**

The `t.test` function also provides a **formula interface**, which is often more convenient when working with `data.frame`s.  

- It is still the **same function** as the vector interface, but behaves differently depending on the input type.  
- Internally, `t.test` is an **S3 generic** in R, meaning it has different methods depending on whether you pass a numeric vector or a formula.  
- This is why we saw the **vector interface** earlier, and now we have the **formula interface**.

**Usage:**

```r
t.test(formula, data,
       subset,
       na.action,
       alternative = c("two.sided", "less", "greater"),
       mu = 0,
       paired = FALSE,
       var.equal = FALSE,
       conf.level = 0.95)
```

**Arguments:**

- `formula`: A formula describing the model, e.g. `response ~ group` for two-sample tests, or `response` for one-sample tests  
- `data`: The data frame containing the variables in the formula  
- `subset`: Optional logical vector to select a subset of the data  
- `na.action`: How to handle missing values (`na.omit` by default)  
- `alternative`: `"two.sided"` (default), `"less"`, or `"greater"`  
- `mu`: Hypothesised mean for one-sample tests, or hypothesised difference for paired/two-sample tests  
- `paired`: `TRUE` for paired-sample test, `FALSE` for independent samples  
- `var.equal`: `TRUE` for pooled-variance two-sample test, `FALSE` for Welch’s test  
- `conf.level`: Confidence level for the interval (default 0.95)

A **`formula`** specifies which column contains the response variable (numeric values) and which column contains the grouping variable (factor).  
  - Typical form: `response ~ group`  
  - For a one-sample test, you can just use `response` (no grouping variable needed).  
- Data should ideally be in **long format**, where one column specifies the measured values, and another column specifies the group/label. This is especially useful for multiple groups.


#### **Reference Table**

| Scenario              | Formula                         | paired | var.equal | mu / difference under H0 |
|-----------------------|---------------------------------|--------|-----------|--------------------------|
| One-sample            | response ~ 1                     | FALSE  | -         | hypothesised mean        |
| Two-sample Pooled     | response ~ group, var.equal=TRUE | FALSE  | TRUE      | hypothesised difference  |
| Two-sample Welch's    | response ~ group                 | FALSE  | FALSE     | hypothesised difference  |
| Paired-sample         | (response_after - response_before) ~ 1 | TRUE   | -         | hypothesised mean difference |





### **Examples**

The `sleep` dataset contains extra hours of sleep gained by 20 patients under two different drugs.

Through out this section, we assume a 95% significance level.

In [None]:
sleep %>% str()

#### **Example 1: One-Sample t-test**

Is there any evidence that the average extra sleep (in hours) of the first group different from 1?



Let $x_1, \dots, x_{n_1}$ be the extra sleep hours gained by patients in the group `"1"` in the `sleep` dataset. We assume that

$$
x_1, \dots, x_{n_1} \sim \text{i.i.d. } \mathcal{N}(\mu, \sigma^2)
$$

We want to test whether there is any evidence that the average extra sleep (in hours) of the first group is different from 1. Formally, the hypotheses are:
$$
\begin{align}
H_0: \mu &= 1 \\
H_1: \mu &\neq 1
\end{align}
$$

Since the sample size is small, a $t$-test might be appropriate here.

In [None]:
sleep %>% filter(group == "1") -> group1
group1 %>% str()

The following code cells give equivalent `t.test` results.

In [None]:
t.test(group1$extra, mu = 1, alternative = "two.sided")

In [None]:
t.test(extra~1, data = group1, mu = 1, alternative = "two.sided")

##### **What Does the Summary from `t.test` Tell Us?**

`t.test()` by default generates a **statistical summary**, which includes:

- $t$-statistic: the value of the test statistic for the t-test  
- Degrees of freedom (df): for the $t$-distribution under the null hypothesis  
- p-value: the probability of observing a value as extreme (or more extreme) than the observed test statistic, assuming the null hypothesis is true  
  - If the **p-value is smaller than the significance level** $\alpha$, it is equivalent to rejecting the null hypothesis in the Neyman-Pearson framework. Quite handy, isn't it?

- Confidence interval (CI): a plausible range for population mean (by default, a symmetric CI at 95% confidence level).
  - If the hypothesised value under `H0` lies **outside** this interval, it corresponds to rejecting `H0` at the same significance level. **This is a nice connection between confidence interval and hypothesis testing**.

**Note that we have not spent a great deal of time talking about p-values in this unit (yet).** To make a decision (whether or not to reject the null hypothesis), we may need to check whether or not the $t$ test statistics in in the rejection region. You can save the `t.test` output to a variable and then use that to extract quantities of interest.





In [None]:
t.test(group1$extra, mu = 1, alternative = "two.sided") -> output
output$statistic #test-statistics
output$parameter #DF

We need to compare the absolute value of the $t$-statistic (as it is a two-sided test) to a critical value.  In this case, the critical value is the 97.5th percentile of the Student’s t-distribution with 9 degrees of freedom.  

If the absolute value of the t-statistic exceeds this critical value, we reject the null hypothesis at the 95% significance level.


In [None]:
abs(output$statistic) > qt(0.975, df = 9)
#Do not reject the null hypothesis

#### **Is The Gaussian Assumption Valid?**

We can use `qqplot` to check the **validity of the Gaussian assumption**. (There are formal tests for normality, but these are out of the scope of this unit.)  

However, this approach is not perfect, especially with small sample sizes (here, n = 10). Looking at the QQ-plot below, the points are not perfectly aligned with the theoretical line, but this could simply be due to random variation. In this example, we do **not see any substantially different values**, which would indicate extreme values or outliers (hence, non-Gaussianity).  

Reading a QQ-line is more like an **art than a strict rule**. You need to be flexible enough to recognise differences, while considering that small deviations may be due to chance.  

To master QQ-plot reading, try different distributions, explore various sample sizes, and observe how point alignment and deviations change. This experience will help you interpret QQ-plots more effectively.


In [None]:
qqnorm(group1$extra)
qqline(group1$extra)

#### **Example 2: Two-Sample t-test**

Is there a difference in the mean extra sleep hours between patients taking Drug `"1"` and Drug `"2"`?


Note that the `sleep` dataset is already in the long format, where `extra` is the response variable and `group` shows the labels.

In [None]:
sleep %>% kable()

Let $x_1, \dots, x_{n_1}$ and $y_1, \dots, y_{n_2}$  be the extra sleep hours gained by patients in the groups `"1"` and `"2"` in the `sleep` dataset. We assume that

$$
x_1, \dots, x_{n_1} \sim \text{i.i.d. } \mathcal{N}(\mu_1, \sigma_1^2)
$$

$$
y_1, \dots, y_{n_2} \sim \text{i.i.d. } \mathcal{N}(\mu_2, \sigma_2^2)
$$


We want to test whether there is any evidence that the average extra sleep hours of the two groups are diffent. Formally, the hypotheses are:
$$
\begin{align}
H_0: \mu_1 - \mu_2 &= 0 \\
H_1: \mu_1 - \mu_2 &\neq 0
\end{align}
$$

This is **not a paired sample**, since patients are allocated into two separate groups, and no patient takes both drugs.  It is also reasonable to assume that these two populations are **independent**, at least in this example, where we assume the study is well-signed, but this assumption may not always hold in practice.  

*A quick detour from what we are doing…* Various factors can compromise independence, for example:  

- If patients know which drug they are taking, their behavior or reporting may be affected (placebo effect, expectations, or bias).  
- If there is some systematic way in which patients are assigned to groups, the two samples may not be truly independent.  

Always think critically about whether the **independence assumption** is valid for your study.


***Back to the question***, given the sample sizes, we can use a two-sample t-test. Which one to use? Let's take a look at box plots of `extra` for the two groups.

In [None]:
sleep %>%
  ggplot(aes(x = group, y = extra)) +
  geom_boxplot()

Group 2 seems to have a higher average of extra hours of sleep, and the IQR appears to be roughly double that of group 1. There do not seem to be any outliers in either group.  The median in group 2 appears to be slightly shifted towards the 25% quantile, which could suggest non-Gaussianity.

However, these might also be due to random variation, especially since the sample size for each group is quite small. We can also check the **rule of thumb** discussed earlier to decide whether to use the pooled-variance t-test or Welch's test.


In [None]:
sleep %>%
  group_by(group) %>%
  summarise(smpl_var = var(extra)) -> group_vars

group_vars

group_vars %>% pull(smpl_var) %>% max / group_vars %>% pull(smpl_var) %>% min

As

$$
\frac{\text{larger } s^2}{\text{smaller } s^2} = \frac{4.009000}{3.200556} \approx 1.25 < 3,
$$  

We can use the pooled-variance t-test in this example. The following code cells give the same outputs.

In [None]:
t.test(x = sleep %>% filter(group == "1") %>% pull(extra),
       y = sleep %>% filter(group == "2") %>% pull(extra),
       mu = 0, alternative = "two.sided", var.equal = TRUE)

Here, the p-value suggests there is no evidence to reject $H_0$ at the $\alpha = 0.05$ significance level. We can also compare the absolute value of test statistic (two-sided test) to  the 97.5th percentile of the Student’s t-distribution with 18 degrees of freedom (10 + 10 - 2).

In [None]:
t.test(extra ~ group, data = sleep, mu = 0, alternative = "two.sided", var.equal = TRUE) -> output
abs(output$statistic) > qt(0.975, df = 18)
# Do not reject the null hypothesis

### **Exercise**

Is there a difference in the mean extra sleep hours between patients taking Drug "1" and Drug "2"?  

Perform a **Welch’s t-test** to test this hypothesis using `t.test` function and compare the results to the previous two-sample t-test obtained using the pooled variance.




<details>
<summary>▶️ Click to show the solution</summary>

```r
t.test(extra ~ group, data = sleep, mu = 0, alternative = "two.sided", var.equal = FALSE) -> output
output
abs(output$statistic) > qt(0.975, df = output$parameter)
#Do not reject the null hypothesis
```

</details>


## **The T Statistical Table**



This table shows the critical t-values for the Student’s t-distribution at various degrees of freedom (df) and significance levels ($\alpha$). Historically, such tables were used in textbooks and labs before computers to quickly determine whether a t-statistic was extreme enough to reject the null hypothesis.  

Mathematically, each value corresponds to a quantile of the cumulative distribution function (CDF) of the $t$-distribution. For example, the entry for df = 10 and $\alpha = 0.05$ is the 95th percentile of the t-distribution with 10 degrees of freedom, which is the critical value for a **one-sided right-tailed test at the 5% significance level**, or equivalently, for a **two-sided test at the 10% significance level**.

This allows you to compare your observed t-statistic directly to the table value to decide whether to reject the null hypothesis.

**VERY IMPORTANT:** Note that variations of t-statistic tables exist. Some show **two-sided critical values** ($1 - \alpha/2$ quantile), some show **one-sided critical values** ($1 - \alpha$ quantile), and some show both. Be careful when using these tables, know which convention is being used, and ask for clarification if it is unclear.

**Note**: This table only shows integer $\nu$. If a fractional number of degrees of freedom is encountered (as in Welch’s test), take the floor value.

**UPDATE: YOU WON'T NEED TO USE ANY STATISTICAL TABLES IN THE FINAL EXAM**

In [None]:
generateTTable()

## **Workshop Questions**




### **Question 1: Which t-test to Use?**

For each of the following scenarios, decide which type of t-test should be used (one-sample, two-sample pooled, Welch’s, or paired-sample) and explain your reasoning.

**Scenarios:**

1. Measure the heart rate of participants before and after swimming.

2. A factory claims that the average weight of cereal boxes is 500 grams. You collect a sample of boxes and measure their weights.

3. Compare math test scores between students taught with method A and students taught with method B.

4. Compare math test scores of students who take both specialist maths and regular maths.

5.  Measure the heart rate of participants after drinking Coffee 1. One week later, the same participants drink Coffee 2 and their heart rate is measured again. Compare the heart rates between the two types of coffee.






<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!



</details>


### **Question 2**

A geologist collected twenty ore samples and randomly divided them into two separate groups. They then used two different techniques to measure the amount of titanium present in the samples. The data are:

In [None]:
group1 = c(0.011,0.013,0.013,0.015,0.014,0.013,0.010,0.013,0.011,0.012)
group2 = c(0.008,0.018,0.015,0.017,0.017,0.012,0.012,0.015,0.016,0.016)

Perform hypothesis testing to evaluate whether these two methods of taking measurements are equivalent. Now multiply the measurements by 1000, do you expect your test results to be any different? Repeat the test and compare the results. Use $\alpha = 0.05$.



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!

</details>


### **Question 3: Oh no!!!**


Oh no!!! The geologist in **Question 2** realises they made a mistake: each element of the same rock was measured using two different instruments (`group1` and `group2`). Given this new information, perform hypothesis testing to evaluate whether these two methods of taking measurements are equivalent. Use $\alpha = 0.05$.

In [None]:
group1 = c(0.011,0.013,0.013,0.015,0.014,0.013,0.010,0.013,0.011,0.012)
group2 = c(0.008,0.018,0.015,0.017,0.017,0.012,0.012,0.015,0.016,0.016)



<details>
<summary>▶️ Click to show the solution</summary>


Solution will be released at the end of the week!
</details>


### **Question 4**


Consider the data on weights (in grams) of packages of mince, is there any evidence that the average weight of a package of mince is less than 500g, assuming a Type 1 error of $0.01$?

In [None]:
smpl_data = c(490.32, 449.46, 440.38, 535.72, 640.14, 581.12, 376.82, 481.24, 517.56, 626.52, 340.50, 435.84,490.32, 394.98, 404.06, 404.06, 435.84, 508.48, 508.48, 422.22, 562.96, 404.06, 517.56, 417.68, 535.72, 531.18)



<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!
</details>


### **Question 5**

The classical $t$-test assumes that the data are normally distributed. In practice, this assumption often does not hold, which can substantially affect the Type I error rate (the probability of incorrectly rejecting a true null hypothesis).

Suppose we generate 10 independent observations from an exponential distribution with rate parameter $\lambda = \frac{1}{2}$. The expected value of this exponential random variable is $\frac{1}{\lambda} = 2$.

We want to test:

$$
\begin{align}
H_0: \mu &= 2 \text{ (the true mean)}\\
H_1: \mu &\neq 2 \\
\end{align}
$$

At a 95% significance level, if the $t$-test worked perfectly, we would expect the Type I error rate to be exactly 0.05.

#### **Question 5.1**

At a 95% significance level, repeat this process many times (e.g., 10,000 repetitions) to estimate the empirical Type I error rate.

In [None]:
smpl_data = rexp(10, rate = 2)
conf.int = t.test(smpl_data, mu = 2)$conf.int
conf.int[1] < 2 & 2 < conf.int[2]


<details>
<summary>▶️ Click to show the solution</summary>

Solution will be released at the end of the week!
</details>

#### **Question 5.2**

Repeat the simulation from **Question 5.1**, but now generate 10 observations from a Poisson distribution with $\lambda = 20$ and estimate the Type I error of the one-sample $t$-test at the 95% significance level. Compare the estimated Type I error to the one obtained before and explain the difference.

**Hint**: Examine the histograms of large samples from these two distributions to see their approximate shapes.


<details>
<summary>▶️ Click to show the solution</summary>
Solution will be released at the end of the week!

</details>