##  What is Hypothesis Testing?

**Hypothesis Testing** is a statistical method to decide whether there’s enough evidence in your data to support a specific claim about a population.

* **Purpose:** It helps you decide between “nothing interesting is going on” vs. “there is a real effect.”
* **Process:**

  1. Make a claim (Null Hypothesis $H_0$ and Alternative Hypothesis $H_1$).
  2. Collect data.
  3. Use a statistical test (t-test, z-test, etc.).
  4. Compare the test result to a threshold (**p-value** or **critical value**).
  5. Conclude whether to reject $H_0$ or not.

---

##  Null Hypothesis ($H_0$)

The **Null Hypothesis** is the default assumption — there’s no difference, no effect, or nothing unusual happening.

In our example:

$$
H_0: \mu_A = \mu_B
$$

Meaning: **Drug A and Drug B have the same average effect** on patients.

---

##  Alternative Hypothesis ($H_1$ or $H_a$)

The **Alternative Hypothesis** is what you suspect might be true — it’s the opposite of $H_0$.

For example:

* **Two-tailed test**:

  $$
  H_1: \mu_A \neq \mu_B
  $$

  (Drug A and Drug B have different effects, but we don’t specify which is better.)

* **One-tailed test**:

  $$
  H_1: \mu_A > \mu_B
  $$

  (Drug A is more effective than Drug B.)

---

##  Example — Comparing Drug A and Drug B

**Scenario:**
We have two groups of patients. One group takes **Drug A**, the other takes **Drug B**. The outcome measure is a reduction in blood pressure (in mmHg).

| Group  | Sample Size $n$ | Mean Reduction $\bar{x}$ | Standard Deviation $s$ |
| ------ | --------------- | ------------------------ | ---------------------- |
| Drug A | 10              | 8.2 mmHg                 | 2.5                    |
| Drug B | 10              | 6.5 mmHg                 | 2.0                    |

---

### Step 1: Set Hypotheses

Two-tailed test:

$$
H_0: \mu_A = \mu_B
$$

$$
H_1: \mu_A \neq \mu_B
$$

---

### Step 2: Choose Significance Level

We choose $\alpha = 0.05$.

---

### Step 3: Calculate the Test Statistic (Independent t-test)

Formula:

$$
t = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}
$$

Plug in:

* Difference in means: $8.2 - 6.5 = 1.7$
* Variances: $s_A^2 = 6.25,\ s_B^2 = 4.00$
* Standard error:

$$
SE = \sqrt{\frac{6.25}{10} + \frac{4.00}{10}} = \sqrt{0.625 + 0.4} = \sqrt{1.025} \approx 1.012
$$

* $t$-value:

$$
t = \frac{1.7}{1.012} \approx 1.68
$$

---

### Step 4: Determine the Critical Value

For $df \approx 18$ and $\alpha = 0.05$ (two-tailed), the critical $t$ is about **±2.101**.

---

### Step 5: Make the Decision

* Our $t = 1.68$ is **less** than 2.101 in absolute value.
* ✅ **We fail to reject $H_0$** → No statistically significant difference detected at 5% level.

---

### Step 6: Interpretation

* **Statistical view:** The observed difference (1.7 mmHg) could be due to random variation.
* **Practical view:** This doesn’t *prove* the drugs are identical, only that we don’t have enough evidence to declare a difference with 95% confidence.

---

 **Quick summary:**

* **Hypothesis testing** = a structured decision-making process using data.
* **Null hypothesis** = “No difference/effect.”
* **Alternative hypothesis** = “Difference/effect exists.”
* In our example, we compared Drug A vs Drug B using a t-test and concluded no significant difference at $\alpha=0.05$.

---



## **Practical, Hypothesis Testing Example In a Deep Learning Workflow**


You’re building a **deep learning classifier** (e.g., ResNet50) for an image dataset.
You try two models:

* **Model A** – Your current baseline (ResNet50 with standard training)
* **Model B** – A new approach (ResNet50 with label smoothing + mixup)

After training both models **five times** (different random seeds), you get the test accuracies:

| Run | Model A Accuracy (%) | Model B Accuracy (%) |
| --- | -------------------- | -------------------- |
| 1   | 85.1                 | 87.0                 |
| 2   | 84.7                 | 86.5                 |
| 3   | 85.3                 | 87.4                 |
| 4   | 84.9                 | 86.9                 |
| 5   | 85.0                 | 87.1                 |

---

## **Step 1 – Define the Hypotheses**

* **H₀ (Null Hypothesis):** There is **no difference** in mean accuracy between Model A and Model B.
  $\mu_A = \mu_B$

* **H₁ (Alternative Hypothesis):** There **is** a difference in mean accuracy.
  $\mu_A \neq \mu_B$

---

## **Step 2 – Compute the differences**

Since both models are run on the **same seeds**, we can use a **paired t-test** (reduces variance).

Differences (B - A):

| Run | Difference (%) |
| --- | -------------- |
| 1   | 1.9            |
| 2   | 1.8            |
| 3   | 2.1            |
| 4   | 2.0            |
| 5   | 2.1            |

---

## **Step 3 – Compute Mean & Std of Differences**

Mean difference:

$$
\bar{d} = \frac{1.9 + 1.8 + 2.1 + 2.0 + 2.1}{5} = \frac{9.9}{5} = 1.98
$$

Sample variance of differences:

* Deviations from mean: $(-0.08, -0.18, 0.12, 0.02, 0.12)$
* Squared deviations: $(0.0064, 0.0324, 0.0144, 0.0004, 0.0144)$
* Variance:

$$
s_d^2 = \frac{\sum (d_i - \bar{d})^2}{n - 1} = \frac{0.068}{4} = 0.017
$$

Std deviation:

$$
s_d = \sqrt{0.017} \approx 0.130
$$

---

## **Step 4 – Paired t-test statistic**

Formula:

$$
t = \frac{\bar{d}}{s_d / \sqrt{n}}
$$

$$
t = \frac{1.98}{0.130 / \sqrt{5}} = \frac{1.98}{0.0581} \approx 34.1
$$

---

## **Step 5 – Critical value & decision**

Degrees of freedom: $df = n - 1 = 4$
At α = 0.05 (two-tailed), $t_{\text{critical}} \approx 2.776$

Since **34.1 > 2.776**, we **reject H₀**.

---

## **Step 6 – Conclusion**

We have **strong statistical evidence** that **Model B outperforms Model A** in mean accuracy.

---

##  Why this matters in deep learning

This kind of hypothesis testing helps you:

* Avoid claiming improvement from **random noise** (due to weight initialization, mini-batch order, etc.)
* Make **scientifically sound** comparisons between models
* Justify model changes for deployment or publication

---

##  Possible Variations

* If you compare **three or more models**, use **ANOVA** instead of a t-test.
* If accuracy isn’t normally distributed, use a **Wilcoxon signed-rank test** instead.
* If comparing across **different datasets**, adjust for multiple comparisons (Bonferroni, Holm-Bonferroni).

---