# Hypothesis Testing Fundamentals

### What is Hypothesis Testing?
Imagine you’re a detective trying to figure out whether something is true or not. Hypothesis testing is like running an experiment to check if your guess (hypothesis) about something is actually correct. 

---

### Example 1: Video Game Experiment (A/B Testing)
In 2013, the company that made the video game *SimCity 5* wanted more people to pre-order their game (buy it before it was released). They had two ideas for their game’s website:

1. **Version 1 (Control):** The website had a banner saying, “Buy now and get money off your next purchase.”
2. **Version 2 (Treatment):** The website had no banner—just a simple page to pre-order the game.

They tested both versions by randomly showing half of their visitors Version 1 and the other half Version 2. Then, they counted the number of people who actually bought the game on each page. 

**Surprise!** More people (43% more!) bought the game from the version without the banner. Weird, right? The company thought the banner would help, but it actually hurt sales! 

The big question they had was: _“Is this 43% difference real or just a random fluke?”_ That's where hypothesis testing comes in.

---

### Example 2: Data Scientist Salaries
Now imagine someone wants to figure out how much data scientists get paid, on average. They guess (hypothesize) the average salary is **$110,000**. But then they look at a small sample of data scientists and find their average salary is **$120,000**.

The next question is: _“Is this difference in salary ($120,000 vs. $110,000) real, or did it happen just by chance?”_ 

---

### How Do They Check if the Difference is Real?
They use a few tools and steps:

1. **Bootstrap Distribution**: This is like shaking up a jar of data, pulling out random samples over and over, and writing down each sample's average. This helps figure out what "normal" differences in salary might look like.
   
   - Imagine it like baking cookies and tasting different batches to see how much their taste varies.

2. **Standard Error**: This is a fancy way of measuring how much numbers in the jar (sample) tend to jump around. If the jar’s values don’t jump much, even a small difference is meaningful.

3. **Z-Score**: This tells you how far your result (like $120,000) is from your guess ($110,000), compared with what you consider normal (the standard error). 

   - If the z-score is very big, it means your result is far from what you guessed, and you might need to rethink your guess. 

---

### Why Does This Matter?
Hypothesis testing helps answer two important questions:
1. Is something meaningful or just a coincidence? In the EA example, did the sales difference happen by chance, or was removing the banner really better?
2. Should we believe what we guessed? In the salary example, is $110,000 a good guess for the average salary, or does the data say otherwise?

---

Think of it like a fairness check for decisions:
- It helps companies make better choices (like which website version works better).
- It helps people understand whether numbers mean something or are just random.

## 📊 Calculating the Sample Mean

The `late_shipments` dataset contains supply chain data on the delivery of medical supplies. Each row represents one delivery of a part. The `late` column denotes whether or not the part was delivered late. A value of `"Yes"` means that the part was delivered late, and a value of `"No"` means the part was delivered on time.

You'll begin your analysis by calculating a point estimate (or sample statistic), namely the proportion of late shipments.

In pandas, a value's proportion in a categorical DataFrame column can be quickly calculated using the syntax:

```python
prop = (df['col'] == val).mean()


In [8]:
import pandas as pd
import numpy as np
late_shipments = pd.read_feather('late_shipments.feather')

In [9]:
late_shipments.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
1,30998.0,Botswana,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test,...,25.0,800.0,32.0,1.6,"Trinity Biotech, Plc",Yes,10.0,559.89,reasonable,1.72
2,69871.0,Vietnam,PMO - US,Direct Drop,EXW,Air,0.0,No,ARV,Adult,...,22925.0,110040.0,4.8,0.08,Hetero Unit III Hyderabad IN,Yes,3723.0,19056.13,expensive,181.57
3,17648.0,South Africa,PMO - US,Direct Drop,DDP,Ocean,0.0,No,ARV,Adult,...,152535.0,361507.95,2.37,0.04,"Aurobindo Unit III, India",Yes,7698.0,11372.23,expensive,779.41
4,5647.0,Uganda,PMO - US,Direct Drop,EXW,Air,0.0,No,HRDT,HIV test - Ancillary,...,850.0,8.5,0.01,0.0,Inverness Japan,Yes,56.0,360.0,reasonable,0.01


In [10]:
# Calculate the proportion of late shipments
late_prop_samp = (late_shipments['late']=='Yes').mean()

# Print the results
print(late_prop_samp)

0.061


>The proportion of late shipments in the sample is 0.061, or 6.1%.

In [12]:
# Generate a bootstrap distribution of 5000 replicates
late_shipments_boot_distn = []

for i in range(5000):
    late_shipments_boot_distn.append(
         np.mean(late_shipments.sample(frac=1, replace=True)['late']=='Yes')
    )

late_shipments_boot_distn[0:10]

[0.058, 0.067, 0.063, 0.06, 0.065, 0.057, 0.055, 0.059, 0.059, 0.053]

In [13]:
len(late_shipments_boot_distn)

5000

## 📊 Calculating a z-score
Since variables have arbitrary ranges and units, we need to standardize them. For example, a hypothesis test that gave different answers if the variables were in Euros instead of US dollars would be of little value. Standardization avoids that.

One standardized value of interest in a hypothesis test is called a z-score. To calculate it, you need three numbers: the sample statistic (point estimate), the hypothesized statistic, and the `standard error` of the statistic (estimated from the bootstrap distribution).

### 🎯 What is a Z-score?

Think of a **z-score** as a way to measure **how far** something is from what you expected — using a special ruler that always uses the same size no matter what you're measuring.

---

### 👟 Example: Shoe Size Game

Let’s say you and your friends play a game where you try to guess the **average shoe size** of all the kids in your school.

1. **You guess** the average is **size 5** (this is your *hypothesized statistic* — what you expect).
2. Then you go ask **20 kids** and find the **real average from your sample** is **size 6** (this is your *sample statistic* — what you got).
3. But you know every time you ask different kids, the number changes a little. So you figure out how much it usually changes — that’s called the **standard error**. Let’s say it’s **0.5**.

---

### 📏 How to Use the Z-score

Now, the z-score tells you:

> “How many steps away is your result from what you expected?”

We use this formula:

$$
\text{z-score} = \frac{\text{sample statistic} - \text{hypothesized statistic}}{\text{standard error}}
$$

Let’s plug in the shoe size numbers:

$$
z = \frac{6 - 5}{0.5} = \frac{1}{0.5} = 2
$$

🔢 The **z-score is 2**.
That means your real result (average shoe size 6) is **2 steps away** from what you expected (average shoe size 5).

---

### 🚦Why It’s Useful

Let’s say someone told you, “Hey, your result is 10 steps away!” You’d probably think, “Whoa, that’s really far off!”
But if it’s just 1 or 2 steps (like here), maybe it’s not that big a deal. That’s how scientists know if something surprising happened or if it’s just normal randomness.

---

### 💡 And What About Different Units?

If you used **Euros** instead of **US Dollars**, or **cm** instead of **inches**, the numbers would look different.

But the **z-score** always tells you the same thing: *how far* the result is from what you expected, in standard "steps" — no matter the unit.

In [16]:
# Hypothesize that the proportion is 6%
late_prop_hyp = 0.06

# Calculate the standard error
std_error = np.std(late_shipments_boot_distn, ddof=1)

# Find z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp) / std_error

# Print z_score
print(z_score)

0.13189038692642163


>The z-score is a standardized measure of the difference between the sample statistic and the hypothesized statistic.

## 📊 Calculating p-values

### 🎓 Test Score Example: Is This Kid a Genius?

Let’s say the **average score** in your school on a math test is **70 out of 100**.

One student — let’s call her **Ada** — takes the same test and scores **95**. Whoa! 🔥

Now the question is:

> **“Is Ada just lucky, or is she really way smarter than average?”**

To answer that, scientists would ask:

> “If Ada were just an average student like everyone else, how likely is it that she would get a score as high as 95?”

This is where the **p-value** comes in.

---

### 🔍 What the P-value Tells Us:

Let’s pretend we run a bunch of simulations — like asking 1,000 average students to take the test again — and we see:

* Most average students score around 70.
* A few score 80, 85…
* But **only 1 out of 1,000** scores **95 or more**.

So the **p-value** is **0.001** (that means 1 in 1,000 chance).

---

### 🧠 What Does That Mean?

* A **p-value of 0.001** means:

  > "If Ada were just an average student, this kind of score almost never happens."
  > So, we say: “Hmm... maybe Ada really *is* above average. She might be a genius!”

But...

* If Ada had scored **73**, and the p-value was **0.4**, that would mean:

  > "Lots of average students get that score. Nothing special here."

---

### 🎯 Final Takeaway (Kid Version):

* **P-value = How rare or surprising a score is if the person is “normal.”**
* **Small p-value** (like 0.05 or smaller) = "Whoa! That’s not normal! Something’s up!"
* **Big p-value** = "Nah, this happens all the time. Nothing strange."

### 🧠 What’s a Null Hypothesis?

The **null hypothesis** is just a fancy way of saying:

> “We believe nothing unusual is going on.”

It’s like the “default” or “normal” idea — what we assume is true **unless we have strong evidence** to say otherwise.

---

### 🍎 In the Test Score Example

* Everyone usually scores around **70** on the math test.
* **Ada** scores **95**.

So, our **null hypothesis** would be:

> **“Ada is just a regular student like everyone else. Her true average score is 70.”**

That’s what we start by assuming — even if she scored 95, we pretend for a moment that she’s average and just got lucky.

---

### 🎯 Then What?

We ask:

> “If the null hypothesis is true (Ada is average), how likely is it that she’d score 95?”

That’s what the **p-value** helps us answer.

* **If it’s very unlikely**, we start to doubt the null hypothesis and say:

  > “Maybe Ada isn’t average after all.”

* **If it’s pretty common**, we say:

  > “Looks like Ada just got lucky. No reason to believe she’s different.”

---

### ✅ Final Answer:

In this case, the **null hypothesis** is:

> **“Ada is not special — her average score is 70, just like everyone else.”**


### 🧠 What’s the Alternative Hypothesis?

If the **null hypothesis** says:

> “Nothing special is going on,”

Then the **alternative hypothesis** says:

> **“Something special *is* going on!”**

It’s like the **opposite** of the null hypothesis.

---

### 🧪 Back to Ada’s Test Score

Let’s review:

* **Null hypothesis** (H₀):

  > “Ada is just an average student. Her true average score is 70.”

* **Alternative hypothesis** (H₁ or Hₐ):

  > **“Ada is better than average. Her true score is higher than 70.”**

That’s what we’re *trying to prove* with our data.

---

### 🎯 Why We Need Both

Science always starts by **assuming the null** — that nothing special is happening.

Then we collect data and ask:

> “Is this data so surprising that we should reject the null and believe the alternative instead?”

---

### 🧁 Simple Example with Cake

Let’s say your friend claims they can bake the **best cake ever** — way better than average.

* **Null hypothesis**: Their cake tastes just like any other.
* **Alternative hypothesis**: Their cake tastes **better** than average.

You taste it. If it’s AMAZING and rare to find something that good, you might say:

> “Wow, this is so good, I reject the null — I believe the alternative!”

But if it’s okay or just a little better, you say:

> “Eh, could just be luck. I stick with the null.”

---

### ✅ Summary

* **Null hypothesis (H₀)**: Nothing special, no difference.
* **Alternative hypothesis (H₁)**: Something special is going on, there *is* a difference.
* You start by assuming H₀ is true.
* Your data (like a test score or cake rating) helps you decide whether to stick with H₀ or switch to H₁.


### 🎯 First, What’s a "Tail"?

Imagine the results of a lot of students' test scores are lined up in a big curve (called a **bell curve** or **normal distribution**). Most scores are in the **middle**, and fewer scores are at the **ends** — the ends are called the **tails**.

* The **left tail** is the very low scores.
* The **right tail** is the very high scores.

These tails are where **rare or surprising** results live.

---

### 🧪 Hypothesis Testing and Tails

When we test a hypothesis, we ask:

> “Is our result far enough into the tail that it’s super rare?”

Whether we care about the **left tail**, **right tail**, or **both tails** depends on how we phrase our **alternative hypothesis** (H₁).

---

### 🧭 Three Types of Tests (with Easy Examples)

#### 1. **Right-tailed test**

👉 We care only about **big numbers** (results **greater** than expected).

* **H₀**: The average score is 70.
* **H₁**: The average score is **more than** 70.

🧠 We’re only looking for scores in the **right tail** — the high side.

**Example**: “Ada is better than average.”

---

#### 2. **Left-tailed test**

👉 We care only about **small numbers** (results **less** than expected).

* **H₀**: The average score is 70.
* **H₁**: The average score is **less than** 70.

🧠 We’re only looking for scores in the **left tail** — the low side.

**Example**: “Ada is doing worse than expected.”

---

#### 3. **Two-tailed test**

👉 We care about **any extreme result**, whether too low or too high.

* **H₀**: The average score is 70.
* **H₁**: The average score is **not 70** (it could be higher *or* lower).

🧠 We’re checking **both tails** — we just want to know if something is **different**, not whether it's more or less.

**Example**: “Ada is not like the others — maybe she’s way better or way worse.”

---

### 🔍 Quick Visual

```
Left Tail    Center     Right Tail
|------------|---------|------------|
       Low            Average           High
```

* **Left-tailed test** = We care about the left end
* **Right-tailed test** = We care about the right end
* **Two-tailed test** = We care about both ends

---

### ✅ Summary

* The **alternative hypothesis** tells us which direction to look.
* The **tails** are where surprising results live.
* We check the **p-value** in the tail(s) we care about.


## Calculating p-values

In order to determine whether to choose the null hypothesis or the alternative hypothesis, you need to calculate a p-value from the z-score.

You'll now return to the late shipments dataset and the proportion of late shipments.

The null hypothesis, $H_0$, is that the proportion of late shipments is six percent.

The alternative hypothesis, $H_A$, is that the proportion of late shipments is greater than six percent.

The observed sample statistic, `late_prop_samp`, the hypothesized value, `late_prop_hyp` (6%), and the bootstrap standard error, `std_error` are available. `norm` from `scipy.stats` has also been loaded without an alias.

In [23]:
from scipy.stats import norm

# Calculate the z-score of late_prop_samp
z_score = (late_prop_samp - late_prop_hyp) / std_error

# Calculate the p-value
# norm.cdf(z_score) tells us: “What’s the chance of getting a result less than or equal to this z-score?”
# But since we’re doing a right-tailed test (we’re asking if the result is bigger than expected)
p_value = 1 - norm.cdf(z_score)
                 
# Print the p-value
print(p_value) 

0.4475354961625917


## Calculating a confidence interval
If you give a single estimate of a sample statistic, you are bound to be wrong by some amount. For example, the hypothesized proportion of late shipments was 6%. Even if evidence suggests the `null hypothesis` that the proportion of late shipments is equal to this, for any new sample of shipments, the proportion is likely to be a little different due to sampling variability. Consequently, it's a good idea to state a confidence interval. That is, you say, "we are 95% 'confident' that the proportion of late shipments is between A and B" (for some value of A and B).

Sampling in Python demonstrated two methods for calculating confidence intervals. Here, you'll use quantiles of the bootstrap distribution to calculate the confidence interval.

In [25]:
import numpy as np
# Calculate 95% confidence interval using quantile method
lower = np.quantile(late_shipments_boot_distn, 0.025)
upper = np.quantile(late_shipments_boot_distn, 0.975)

# Print the confidence interval
print((lower, upper))

(0.04697500000000001, 0.076)


>When you have a confidence interval width equal to one minus the significance level, if the hypothesized population parameter is within the confidence interval, you should fail to reject the null hypothesis.

## t-value

Okay, imagine you want to know if something is different between two groups of people. This video explains how to do that using a statistical test called a "t-test." It's like a detective trying to figure out if two suspects are really different, or if the differences you see are just random chance.

Here's a breakdown:

*   **The Question:** We want to know if there's a real difference in something (like salary) between two groups (like people who started coding as kids versus adults).
*   **The Guess (Null Hypothesis):** We start by assuming there's NO real difference between the groups. This is our "innocent until proven guilty" assumption. In our example, it means assuming kids and adults make the same average salary.
*   **The Evidence (Sample Data):** We collect information from both groups (e.g., salaries from a sample of coders).
*   **Calculate Averages:** We find the average salary for each group in our sample.
*   **The T-Statistic:** This is a special number that tells us how different the groups are, taking into account how much the salaries vary within each group and how many people are in each group. It's like a "difference score" that is adjusted for how reliable the data is.
*   **Big T-Statistic = Potentially Important Difference:** A bigger t-statistic suggests the groups are more different than we'd expect if the "no difference" guess was true. A small t-statistic suggests the groups might not be very different.
*   **Standard Error:** This measures how much the average salary of each sample group might vary from the true average.
*   **Next Steps:** After calculating the t-statistic, you'd use it to figure out the "p-value" (covered in the next video). The p-value helps you decide whether you have enough evidence to reject your initial "no difference" guess.

**Analogy:**

Imagine you have two basketball teams, and you want to know if one is better than the other.

*   **Null Hypothesis:** Both teams are equally good on average.
*   **Evidence:** You watch a few games and record the scores for each team.
*   **Calculate Averages:** You find the average score for each team in the games you watched.
*   **The T-Statistic:** The t-statistic considers the difference in average scores, how much the scores varied from game to game for each team, and the number of games you watched. A big difference, with consistent scores, would give a bigger t-statistic.
*   **Next Steps:** You use the t-statistic and other information to figure out how likely it is that the difference you saw in the games was just luck, or if one team is truly better.

**In simple terms, the t-test helps you decide if the differences you see between two groups are real or just due to random chance.** It does this by calculating a t-statistic, which considers the difference between the group averages, how much the data varies, and the number of observations in each group.

## Two Sample Mean Test Statistic

The hypothesis test for determining if there is a difference between the means of two populations uses a different type of test statistic to the z-scores you saw in Chapter 1. It's called "t", and it can be calculated from three values from each sample using this equation:

$$
t = \frac{(\bar{x}_{child} - \bar{x}_{adult})}{\sqrt{\frac{s_{child}^2}{n_{child}} + \frac{s_{adult}^2}{n_{adult}}}}
$$

Where:

*   `t` is the t-statistic.
*   `𝑥̄child` is the sample mean of the "child" group.
*   `𝑥̄adult` is the sample mean of the "adult" group.
*   `s_child` is the sample standard deviation of the "child" group.
*   `s_adult` is the sample standard deviation of the "adult" group.
*   `n_child` is the sample size of the "child" group.
*   `n_adult` is the sample size of the "adult" group.

***

While trying to determine why some shipments are late, you may wonder if the weight of the shipments that were on time is less than the weight of the shipments that were late. The `late_shipments` dataset has been split into a "yes" group, where `late == "Yes"` and a "no" group where `late == "No"`. The weight of the shipment is given in the `weight_kilograms` variable.

The sample means for the two groups are available as `xbar_no` and `xbar_yes`. The sample standard deviations are `s_no` and `s_yes`. The sample sizes are `n_no` and `n_yes`.

In [71]:
# Late shipments sample(where late is yes)
late_shipments_yes = late_shipments.query("late=='Yes'")

late_shipments_yes.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
44,65081.0,Vietnam,PMO - US,Direct Drop,EXW,Air,1.0,Yes,ARV,Adult,...,14016.0,147168.0,10.5,0.35,Hetero Unit III Hyderabad IN,Yes,1955.0,6194.69,expensive,151.29
65,13926.0,South Africa,PMO - US,Direct Drop,DDP,Air,1.0,Yes,ARV,Adult,...,19992.0,156337.44,7.82,0.26,"Cipla, Goa, India",Yes,1378.0,3646.1,reasonable,337.06
68,25830.0,South Africa,PMO - US,Direct Drop,DDP,Air,1.0,Yes,ARV,Pediatric,...,11640.0,34920.0,3.0,0.01,"Aurobindo Unit III, India",Yes,4386.0,12917.3,expensive,75.29
78,47625.0,South Africa,PMO - US,Direct Drop,DDP,Air,1.0,Yes,ARV,Adult,...,54247.0,424211.54,7.82,0.26,"Cipla, Goa, India",Yes,3728.0,3646.13,reasonable,914.6


In [73]:
# Early shipments sample(where late is no)
late_shipments_no = late_shipments.query("late=='No'")

late_shipments_yes.head()

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_quantity,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,2996.0,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83
44,65081.0,Vietnam,PMO - US,Direct Drop,EXW,Air,1.0,Yes,ARV,Adult,...,14016.0,147168.0,10.5,0.35,Hetero Unit III Hyderabad IN,Yes,1955.0,6194.69,expensive,151.29
65,13926.0,South Africa,PMO - US,Direct Drop,DDP,Air,1.0,Yes,ARV,Adult,...,19992.0,156337.44,7.82,0.26,"Cipla, Goa, India",Yes,1378.0,3646.1,reasonable,337.06
68,25830.0,South Africa,PMO - US,Direct Drop,DDP,Air,1.0,Yes,ARV,Pediatric,...,11640.0,34920.0,3.0,0.01,"Aurobindo Unit III, India",Yes,4386.0,12917.3,expensive,75.29
78,47625.0,South Africa,PMO - US,Direct Drop,DDP,Air,1.0,Yes,ARV,Adult,...,54247.0,424211.54,7.82,0.26,"Cipla, Goa, India",Yes,3728.0,3646.13,reasonable,914.6


In [79]:
# The sample means for early shipments
xbar_no = late_shipments_no['weight_kilograms'].mean()

xbar_no

1897.7912673056444

In [81]:
# The sample means for late shipments
xbar_yes = late_shipments_yes['weight_kilograms'].mean()
xbar_yes

2715.6721311475408

In [83]:
# The sample std for early shipments
s_no = late_shipments_no['weight_kilograms'].std()

s_no

3154.039507084167

In [85]:
# The sample std for late shipments
s_yes = late_shipments_yes['weight_kilograms'].std()
s_yes

2544.688210903328

In [91]:
# The sample size of early shipments
n_no = len(late_shipments_no)

n_no

939

In [93]:
# The sample size of early shipments
n_yes = len(late_shipments_yes)

n_yes

61

In [98]:
# Calculate the numerator of the test statistic
numerator = xbar_no - xbar_yes

# Calculate the denominator of the test statistic
denominator = np.sqrt(s_no ** 2 / n_no + s_yes ** 2 / n_yes)

# Calculate the test statistic
t_stat = numerator / denominator

# Print the test statistic
print(t_stat)

-2.3936661778766433


>When testing for differences between means, the test statistic is called 't' rather than 'z', and can be calculated using six numbers from the samples. Here, the value is about `-2.39` or `2.39`, depending on the order you calculated the numerator.