###  **What is a p-value?**

A **p-value** is a number between **0 and 1** that helps us decide **how confident we can be that a result is not due to random chance**.

It comes into play when comparing two things—like **drug A vs. drug B**—and asking:

> “Is the difference in outcomes real, or could it just be due to random variation?”

---

### **Example from the video**

1. Initially, Drug A cures 1 person, and Drug B doesn’t cure another.

   * Can we say Drug A is better? **No.** Too small a sample.
2. Eventually, we try both drugs on **a lot of people**:

   * Drug A cures **1043/1046** (≈ 99.7%)
   * Drug B cures **2/1434** (≈ 0.1%)

Now, the difference is **too large** to be due to chance. We say Drug A is better.

But what if the numbers were:

* Drug A: **37% cured**
* Drug B: **31% cured**

Now it’s **less clear**. There is a difference, but **is it statistically significant**? That’s where the **p-value** helps.

---

###  **Interpreting the p-value**

* A **small p-value** (close to 0) means:

  * It's **unlikely** the observed difference is due to chance.
  * We can **reject the null hypothesis** (which says "no difference").

* A **large p-value** (close to 1) means:

  * The results are **likely due to chance**.
  * We **fail to reject the null hypothesis**.

---

###  **Common Threshold (α level)**

The most common threshold is **0.05**:

* If **p < 0.05**, we say the difference is **statistically significant**.
* This means there's less than a **5% chance** that the observed difference is due to randomness.

 **But!**

* This does **not** mean there is a 95% chance the result is true.
* It means: "If there were no real difference, there's a 5% chance you'd still get these results by random chance."

---

###  **False Positives**

* A p-value **below 0.05** when there's **actually no difference** is called a **false positive**.
* Using p=0.05 means you'd expect **5 false positives in every 100 experiments** where there's no actual difference.

You can lower this risk:

* Use **p = 0.01** or **p = 0.00001** for stricter confidence (e.g., in medicine).
* Use **p = 0.2** for more tolerance (e.g., guessing when an ice cream truck arrives).

---

###  **Two final key points:**

1. **P-value ≠ Effect Size**

   * A small p-value doesn’t mean the difference is big.
   * With **large samples**, even tiny differences can be statistically significant.

2. **Hypothesis Testing**

   * The p-value is used to **test a hypothesis**.
   * The **null hypothesis** assumes "no difference."
   * A small p-value suggests we **reject the null hypothesis**.

---

### Summary:

| Concept                    | Meaning                                                              |
| -------------------------- | -------------------------------------------------------------------- |
| **P-value**                | Probability of observing your data (or more extreme) if null is true |
| **Small p-value (< 0.05)** | Evidence **against** null → suggests a **real difference**           |
| **Large p-value**          | Not enough evidence to say there's a difference                      |
| **Doesn’t measure**        | Size of the effect or how important it is                            |

---

## **How to Calculate p-value**


The way you calculate the p-value depends on:

1. The **test** you're using (e.g., t-test, z-test, chi-square test, etc.)
2. Your **null hypothesis** (H₀)
3. The **distribution** of the test statistic under the null

---

##  **Example: Two-sample t-test** (comparing means)

Let’s say:

* Group A (Drug A) has: $n_1 = 30$, mean $\bar{x}_1 = 70$, std dev $s_1 = 10$
* Group B (Drug B) has: $n_2 = 30$, mean $\bar{x}_2 = 65$, std dev $s_2 = 12$

You want to test if **Drug A is significantly better than Drug B** (i.e., if means differ).

###  Step-by-step:

---

### **Step 1: Define hypotheses**

* H₀: $\mu_1 = \mu_2$ (no difference)
* H₁: $\mu_1 \neq \mu_2$ (two-sided)

---

### **Step 2: Compute the t-statistic**

$$
t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
= \frac{70 - 65}{\sqrt{\frac{10^2}{30} + \frac{12^2}{30}}}
= \frac{5}{\sqrt{3.33 + 4.8}} = \frac{5}{\sqrt{8.13}} \approx \frac{5}{2.85} \approx 1.75
$$

---

### **Step 3: Degrees of freedom** (Welch’s approximation)

$$
df \approx \frac{
\left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2
}{
\frac{\left( \frac{s_1^2}{n_1} \right)^2}{n_1 - 1}
+ \frac{\left( \frac{s_2^2}{n_2} \right)^2}{n_2 - 1}
}
$$

Let’s skip manual computation and say: **df ≈ 57**

---

### **Step 4: Calculate the p-value**

Now use the **t-distribution** with 57 degrees of freedom:

* Two-tailed p-value:

$$
p = 2 \cdot P(T > |1.75|) \approx 2 \cdot 0.043 = 0.086
$$

---

###  **Interpretation**

* Since **p = 0.086 > 0.05**, you **fail to reject** the null.
* There's **not enough evidence** to say the drugs are significantly different.

---




In [2]:
import scipy.stats as stats

# Sample data
mean1, std1, n1 = 70, 10, 30
mean2, std2, n2 = 65, 12, 30

# Two-sample t-test (unequal variance)
t_stat, p_value = stats.ttest_ind_from_stats(
    mean1=mean1, std1=std1, nobs1=n1,
    mean2=mean2, std2=std2, nobs2=n2,
    equal_var=False  # Welch's t-test
)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")


T-statistic: 1.75
P-value: 0.0850


---

##  Other Common Tests

| Scenario                            | Test Name             | How p-value is computed           |
| ----------------------------------- | --------------------- | --------------------------------- |
| Compare two means                   | t-test                | From t-distribution               |
| Compare two proportions             | z-test                | From standard normal distribution |
| Compare categorical distributions   | Chi-square test       | From chi-square distribution      |
| Compare survival curves             | Log-rank test         | From chi-square distribution      |
| Regression coefficient significance | t-test for regression | From t-distribution               |
| Model fit or independence           | Likelihood ratio test | From chi-square or F-distribution |


## P values

The **$ p $-value** used in statistics to measure how surprising or unlikely your data is, under  null hypothesis.

### The intuition behind P values
Imagine you have a coin and you suspect it might be biased towards heads. To test this, you decide to flip the coin 100 times. Your null hypothesis (the assumption you're testing against) is that the coin is fair, meaning it has an equal chance of landing heads or tails.

After flipping the coin 100 times, suppose you get an unusually high number of heads, say 70 heads and 30 tails. You might start to think this result is pretty strange if the coin were truly fair.

Here's where the **$ p $-value** comes in. The **$ p $-value** is a number between 0 and 1 that tells you how likely it is to see a result as extreme as yours (or more extreme) if the null hypothesis were true. In our example, if the **$ p $-value** is very low (let's say 0.01), it means that getting 70 heads out of 100 flips would be very unlikely if the coin were fair. A low **$ p $-value** suggests that maybe your assumption (the null hypothesis) that the coin is fair might not be right.


- A low **$ p $-value** (typically, a threshold like 0.05 or 5% is used), it suggests that the observed data is inconsistent with the null hypothesis, so you might reject the null hypothesis in favor of the alternative hypothesis (which is the hypothesis that there is an effect or a difference).
- A high **$ p $-value** means you don't have enough statistical evidence to reject the null hypothesis.

However, a low p-value doesn't prove the alternative hypothesis is true. It only suggests that the data you observed are unlikely under the assumption that the null hypothesis is true. Other factors, like the design of the experiment and assumptions of the statistical test, also play critical roles in the interpretation of **$ p $-value**.

### Technical way of describing the p-value

The **$ p $-value** is the probability of observing a test statistic as extreme as, or more extreme than, the statistic computed from the data, assuming that the null hypothesis is true. 


**Test Statistic**: 
Test statistics are numerical values used in statistical testing to decide whether to reject the null hypothesis. The choice of test statistic depends on the type of data and the hypothesis being tested. Here's a list of common test statistics used in various statistical tests:

1. **Z-statistic**: Used in Z-tests when testing hypotheses concerning population proportions or means, particularly when the sample size is large and the population variance is known.

2. **T-statistic**: Used in t-tests when testing hypotheses about means, especially when the population variance is unknown and the sample size is small. There are different forms of t-tests, including one-sample, two-sample, and paired t-tests, each with its own t-statistic formula.

3. **Chi-square statistic $\chi^2$**: Employed in chi-square tests for independence, goodness of fit, or homogeneity. It tests hypotheses about frequency counts of categorical data to see if observed frequencies differ from expected frequencies.

4. **F-statistic**: Used in ANOVA (Analysis of Variance) tests to compare the variances across multiple groups to see if at least one sample mean differs significantly from others.

5. **U-statistic**: Utilized in Mann-Whitney U tests (a non-parametric test) to compare differences between two independent groups when the assumption of normality is not met.

For instance, in our coin flip example, the test statistic could be the number of heads observed.



**Observing a Test Statistic as Extreme as, or More Extreme Than, the Statistic Computed from the Data**: comparing what you actually observed in your experiment or study to what you would expect under the null hypothesis. "As extreme as, or more extreme than" refers to outcomes that are at least as unlikely as the actual outcome you got, given the null hypothesis is true.

**Assuming That the Null Hypothesis is True**: The calculation of the p-value is done under the assumption that the null hypothesis is correct. This is crucial because the p-value is meant to test the strength of evidence against the null hypothesis.

**Probability**: The **$ p $-value** itself is a probability. It measures the likelihood of observing your actual test statistic (or one more extreme) purely by chance if the null hypothesis were true. 

## Example Drug A and Drug B

Imagine you have **Drug A** and **Drug B** and you test them on two patients, can we say because it worked on **patient1** and didn't work on **patient2** it is working? There might be several factors that contributed to that result. So now let's try it on $2000$ patients and **Drug A** cured $97\%$ of people while **Drug B** cured only $3\%$, so now the chance the result was random and there is no difference between them is unrealistic. Now imagine the success of **Drug A** is $37%$ and **Drug B** is  $31%$ on $50$ patients.

So given that no study is perfect and there are always a few random things that change the result, how can we become confident that Drug A is superior?

That's where the **$ p $-value** comes in. **$ p $-value** are numbers between  0 and 1 and quantify how confident we should be **Drug A** is different from **Drug B**.
The closer a **$ p $-value** is to 0 the more confident we are that **Drug A** and **Drug B** are different. 


In practice, the commonly used threshold is $0.05$, meaning if there is no difference between **Drug A** and **Drug B**, and we did the exact same experiment then only $5\%$ of those experiments would result is the wrong decision. Now let's repeat the experiment repeatedly, and we get the following (**$ p $-value** calculated using the Fisher test):



| Drug A  |           | Drug B |           | p-value |
|---------|-----------|--------|-----------|---------|
| Cured   | Not Cured | Cured  | Not Cured |         |
| 73      | 125       | 71     | 127       | 0.9     |
| 71      | 127       | 72     | 126       | 1.0     |
| 75      | 123       | 70     | 128       | 0.7     |


## Example Effects of a new Fertilizer on Plant Growth

Imagine you're a botanist studying the effects of a new fertilizer on plant growth. You have two groups of plants:

1. **Control Group**: Plants not given the fertilizer.
2. **Treatment Group**: Plants given the fertilizer.

You want to know if the fertilizer has a significant effect on plant growth. To do this, you measure the height of the plants after a fixed period.

**Hypotheses**:
- $ H_0 $: The fertilizer has no effect on plant growth. (Mean height of Control Group = Mean height of Treatment Group)
- $ H_a $: The fertilizer has an effect on plant growth. (Mean height of Control Group ≠ Mean height of Treatment Group)

**Data**:
Let's assume you measured the height (in cm) of 10 plants from each group:

- Control Group: [15, 17, 16, 14, 15, 16, 17, 15, 16, 17]
- Treatment Group: [18, 19, 20, 19, 18, 21, 19, 20, 19, 18]

We'll use a two-sample t-test to determine if there's a significant difference in the means of the two groups.

To calculate the p-value using a t-test for comparing the means of the control group and the treatment group, we'll go through the following steps:

1. **Calculate the mean** of each group.
2. **Calculate the standard deviation** of each group.
3. **Calculate the standard error of the mean (SEM)** for each group.
4. **Calculate the t-statistic** using the means, SEMs, and sample sizes of both groups.
5. **Calculate the degrees of freedom** needed to look up the p-value.
6. **Calculate the p-value** based on the t-statistic and degrees of freedom.



For a two-tailed test (which checks for any difference between the means, not specifying direction), the p-value can be conceptualized as:

$ \text{p-value} = 2 \times (1 - \text{CDF}(t, df)) $

Where:
- $ \text{CDF} $ refers to the cumulative distribution function for the t-distribution.
- $ t $ is the observed t-statistic calculated from your data.
- $ df $ are the degrees of freedom, which, for an independent samples t-test, are usually calculated as $ n_1 + n_2 - 2 $ for equal variances, or using a more complex formula for unequal variances and sample sizes.

This formula is calculating the probability of observing a t-statistic as extreme as, or more extreme than, the observed t-statistic under the null hypothesis. The "2 ×" part accounts for both tails of the distribution since we're interested in differences in either direction (higher or lower).

In practice, this calculation is not done manually but through  software. These functions internally use the properties of the t-distribution to find the p-value corresponding to the calculated t-statistic and the degrees of freedom for your specific test scenario.



### Control Group
- **Mean**: 15.8 cm
- **Standard Deviation**: 1.033 cm
- **Standard Error of the Mean (SEM)**: 0.327 cm

### Treatment Group
- **Mean**: 19.1 cm
- **Standard Deviation**: 0.994 cm
- **Standard Error of the Mean (SEM)**: 0.314 cm

### T-test Results
- **T-statistic**: -7.279
- **P-value**: approximately 0.0000009162

The T-statistic is -7.279, which indicates a significant difference between the control and treatment groups, given the very low P-value (less than 0.001). This means there is a statistically significant difference in plant height between the control group and the treatment group, favoring the hypothesis that the new fertilizer has a positive effect on plant growth.



Refs [1](https://www.youtube.com/watch?v=vemZtEM63GY), [2](https://www.youtube.com/watch?v=udyAvvaMjfM), [3](https://www.youtube.com/watch?v=p0W1oKPP6eQ), [4](https://www.youtube.com/watch?v=0oc49DyA3hU), [5](https://www.youtube.com/watch?v=JQc3yx0-Q9E), [6](https://www.youtube.com/watch?v=5koKb5B_YWo)



In [2]:
from scipy.stats import ttest_ind
import numpy as np

# Data
control_group = np.array([15, 17, 16, 14, 15, 16, 17, 15, 16, 17])
treatment_group = np.array([18, 19, 20, 19, 18, 21, 19, 20, 19, 18])

# Step 1 & 2: Calculate mean and standard deviation
mean_control = np.mean(control_group)
std_dev_control = np.std(control_group, ddof=1)  # Sample standard deviation
mean_treatment = np.mean(treatment_group)
std_dev_treatment = np.std(treatment_group, ddof=1)  # Sample standard deviation

# Step 3: Calculate the Standard Error of the Mean (SEM) for each group
n_control = len(control_group)
n_treatment = len(treatment_group)
sem_control = std_dev_control / np.sqrt(n_control)
sem_treatment = std_dev_treatment / np.sqrt(n_treatment)

# Step 4 & 5: Calculate t-statistic and degrees of freedom
# Using scipy to calculate t-statistic and p-value directly
t_stat, p_value = ttest_ind(control_group, treatment_group)

mean_control, std_dev_control, mean_treatment, std_dev_treatment, sem_control, sem_treatment, t_stat, p_value


(15.8,
 1.0327955589886444,
 19.1,
 0.9944289260117533,
 0.3265986323710904,
 0.31446603773522014,
 -7.278624758728698,
 9.162003368633656e-07)

## Hypothesis testing in deep learning
Hypothesis testing in deep learning can be applied in several scenarios, especially when you're interested in comparing models, understanding the significance of model improvements, or analyzing the behavior of models under specific conditions. Here are some common use cases:

1. **Model Comparison**: When you have two or more different models or approaches and want to determine if one model is significantly better than the others. Hypothesis testing can help you assess if the differences in performance metrics (like accuracy, precision, recall) are statistically significant or just due to random chance.

2. **Feature Importance**: To evaluate the impact of certain features on the model's predictions. Hypothesis testing can be used to determine if removing or adding a specific feature significantly affects the model's performance, helping in feature selection and model simplification.

3. **Regularization and Hyperparameter Tuning**: When adjusting model hyperparameters (like learning rate, dropout rate, or regularization strength), hypothesis testing can help in determining if the changes in the hyperparameters lead to a statistically significant improvement or degradation in model performance.

4. **Transfer Learning and Domain Adaptation**: In scenarios involving transfer learning or domain adaptation, where a model trained on one domain is adapted for use in another, hypothesis testing can be used to assess if the adaptation leads to significant improvements in performance on the new domain.

5. **Fairness and Bias Assessment**: Hypothesis testing can be instrumental in identifying biases in model predictions across different groups or demographics. It can help in determining if disparities in model outcomes are statistically significant, which is crucial for developing fair and unbiased models.

6. **A/B Testing**: In the deployment phase, especially for models integrated into products or services, A/B testing with hypothesis testing can evaluate the real-world impact of using one model version over another on user behavior or other key performance indicators (KPIs).

7. **Robustness and Generalization**: To test the robustness of models against variations in input data, including adversarial examples, noise, or data from different distributions. Hypothesis testing can help determine if a model is significantly more robust or generalizes better to unseen data compared to others.

8. **Time Series and Sequence Models**: For models dealing with time series or sequential data, hypothesis testing can be used to assess the significance of temporal features or the impact of different sequence modeling techniques (like RNNs, GRUs, or LSTMs) on prediction accuracy.

In practice, implementing hypothesis testing in deep learning involves choosing the right statistical test based on the data distribution and experiment design, defining null and alternative hypotheses, calculating the test statistic, and interpreting the p-value to make decisions. It's a powerful tool to add rigor and confidence to the conclusions drawn from deep learning experiments.