# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 04 · Notebook 02 – Hypothesis Testing
**Instructor:** Amir Charkhi  |  **Goal:** Make statistical decisions with confidence.

> Format: short theory → quick practice → build understanding → mini-challenges.


---
## Learning Objectives
- Understand null and alternative hypotheses
- Master t-tests and p-values
- Learn Type I and Type II errors
- Apply hypothesis testing to business decisions

## 1. The Hypothesis Testing Framework
Converting business questions into statistical tests.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 5)

In [None]:
# Business Question: Has our new training program improved sales?
# Old average sales per rep: $5000/month
# Sample of new trained reps:

np.random.seed(42)
new_sales = np.random.normal(5200, 800, 30)  # 30 reps after training

print("Hypothesis Testing Steps:")
print("1. H₀ (Null): Training has no effect (μ = $50,000)")
print("2. H₁ (Alternative): Training improves sales (μ > $50,000)")
print("3. Significance level: α = 0.05")
print("4. Run the test...")

In [None]:
# One-sample t-test
old_average = 50000
t_stat, p_value = stats.ttest_1samp(new_sales, old_average)

print(f"\nSample mean: ${new_sales.mean():,.2f}")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# Since we're testing if it's greater (one-tailed)
p_value_one_tailed = p_value / 2
print(f"p-value (one-tailed): {p_value_one_tailed:.4f}")

if p_value_one_tailed < 0.05:
    print("\n✅ Reject H₀: Training appears to improve sales!")
else:
    print("\n❌ Fail to reject H₀: No significant improvement")

**Exercise 1 – Website Speed Test (easy)**  
Test if new server reduced page load time from 2.5 seconds.


In [None]:
# Your turn
# old_load_time = 2.5  # seconds
# new_load_times = [2.1, 2.3, 2.0, 2.2, 1.9, 2.4, 2.1, 2.0, 2.2, 2.1]


<details>
<summary><b>Solution</b></summary>

```python
old_load_time = 2.5  # seconds
new_load_times = [2.1, 2.3, 2.0, 2.2, 1.9, 2.4, 2.1, 2.0, 2.2, 2.1]

# Test if new times are less than old
t_stat, p_value = stats.ttest_1samp(new_load_times, old_load_time)

print(f"Old average: {old_load_time} seconds")
print(f"New average: {np.mean(new_load_times):.2f} seconds")
print(f"Improvement: {old_load_time - np.mean(new_load_times):.2f} seconds")
print(f"\nt-statistic: {t_stat:.3f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed): {p_value/2:.4f}")

if p_value/2 < 0.05:
    print("\n✅ Significant improvement in load time!")
else:
    print("\n❌ No significant improvement")
```
</details>

## 2. Two-Sample t-Tests
Comparing two groups (the foundation of A/B testing).

In [None]:
# Compare two marketing campaigns
np.random.seed(42)
campaign_a = np.random.normal(125, 20, 50)  # clicks per day
campaign_b = np.random.normal(135, 22, 50)

print(f"Campaign A: {campaign_a.mean():.1f} ± {campaign_a.std():.1f} clicks/day")
print(f"Campaign B: {campaign_b.mean():.1f} ± {campaign_b.std():.1f} clicks/day")

In [None]:
# Two-sample t-test
t_stat, p_value = stats.ttest_ind(campaign_a, campaign_b)

print(f"\nDifference: {campaign_b.mean() - campaign_a.mean():.1f} clicks/day")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("\n✅ Significant difference between campaigns")
else:
    print("\n❌ No significant difference")

In [None]:
# Visualize the comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Histograms
ax1.hist(campaign_a, alpha=0.5, label='Campaign A', color='blue', bins=15)
ax1.hist(campaign_b, alpha=0.5, label='Campaign B', color='red', bins=15)
ax1.set_xlabel('Clicks per Day')
ax1.set_ylabel('Frequency')
ax1.legend()
ax1.set_title('Distribution Comparison')

# Box plots
ax2.boxplot([campaign_a, campaign_b], labels=['Campaign A', 'Campaign B'])
ax2.set_ylabel('Clicks per Day')
ax2.set_title('Box Plot Comparison')

plt.tight_layout()
plt.show()

**Exercise 2 – Customer Satisfaction Comparison (medium)**  
Compare satisfaction scores between two customer service teams.


In [None]:
# Your turn
# team_1 = [8, 7, 9, 8, 7, 8, 9, 8, 7, 8, 9, 8]  # scores out of 10
# team_2 = [7, 6, 7, 8, 6, 7, 6, 7, 8, 7, 6, 7]


<details>
<summary><b>Solution</b></summary>

```python
team_1 = np.array([8, 7, 9, 8, 7, 8, 9, 8, 7, 8, 9, 8])  # scores out of 10
team_2 = np.array([7, 6, 7, 8, 6, 7, 6, 7, 8, 7, 6, 7])

print(f"Team 1: {team_1.mean():.2f} ± {team_1.std():.2f}")
print(f"Team 2: {team_2.mean():.2f} ± {team_2.std():.2f}")

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(team_1, team_2)

print(f"\nDifference: {team_1.mean() - team_2.mean():.2f} points")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

# Effect size (Cohen's d)
pooled_std = np.sqrt((team_1.std()**2 + team_2.std()**2) / 2)
cohens_d = (team_1.mean() - team_2.mean()) / pooled_std
print(f"Cohen's d: {cohens_d:.2f} (effect size)")

if p_value < 0.05:
    print(f"\n✅ Team 1 significantly outperforms Team 2")
else:
    print(f"\n❌ No significant difference")
```
</details>

## 3. Understanding p-values & Significance

In [None]:
# Simulate multiple tests to understand p-values
np.random.seed(42)
n_simulations = 1000
p_values = []

# Run many tests on random data (no real effect)
for _ in range(n_simulations):
    group_a = np.random.normal(100, 15, 30)
    group_b = np.random.normal(100, 15, 30)  # Same distribution!
    _, p = stats.ttest_ind(group_a, group_b)
    p_values.append(p)

# How many false positives at α = 0.05?
false_positives = sum(p < 0.05 for p in p_values)
print(f"False positive rate: {false_positives/n_simulations:.1%}")
print(f"Expected: 5.0%")
print(f"\nThis is Type I error - rejecting H₀ when it's true!")

In [None]:
# Visualize p-value distribution under null hypothesis
plt.figure(figsize=(10, 5))
plt.hist(p_values, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(x=0.05, color='red', linestyle='--', label='α = 0.05')
plt.xlabel('p-value')
plt.ylabel('Frequency')
plt.title('p-value Distribution Under Null Hypothesis (Should be Uniform)')
plt.legend()
plt.show()

## 4. Statistical Power & Sample Size

In [None]:
# How sample size affects our ability to detect differences
def run_test_with_sample_size(n, effect_size=0.5):
    """Run t-test with given sample size and effect"""
    np.random.seed(42)
    group_a = np.random.normal(100, 15, n)
    group_b = np.random.normal(100 + effect_size*15, 15, n)  # Small effect
    _, p = stats.ttest_ind(group_a, group_b)
    return p < 0.05  # Did we detect the difference?

sample_sizes = [10, 20, 50, 100, 200, 500]
for n in sample_sizes:
    detected = run_test_with_sample_size(n)
    print(f"n={n:3d}: {'✅ Detected' if detected else '❌ Missed'} the difference")

In [None]:
# Power calculation: How many samples do we need?
from statsmodels.stats.power import ttest_power

# For 80% power to detect medium effect (d=0.5)
effect_size = 0.5
alpha = 0.05
power = 0.8

# Calculate required sample size
from statsmodels.stats.power import tt_ind_solve_power
n_required = tt_ind_solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    ratio=1  # Equal group sizes
)

print(f"Required sample size per group: {n_required:.0f}")
print(f"Total participants needed: {2*n_required:.0f}")

**Exercise 3 – Power Analysis (medium)**  
Calculate sample size needed to detect 10% improvement in conversion rate.


In [None]:
# Your turn
# Current conversion rate: 5%
# Want to detect: 5.5% (10% relative improvement)
# Calculate required sample size


<details>
<summary><b>Solution</b></summary>

```python
from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.stats.power import zt_ind_solve_power

# Current and target conversion rates
p1 = 0.05  # 5%
p2 = 0.055  # 5.5% (10% relative improvement)

# Calculate effect size for proportions
effect = proportion_effectsize(p1, p2)
print(f"Effect size: {effect:.3f}")

# Calculate required sample size
n_required = zt_ind_solve_power(
    effect_size=effect,
    alpha=0.05,
    power=0.8,
    ratio=1
)

print(f"\nRequired sample size per group: {n_required:.0f}")
print(f"Total visitors needed: {2*n_required:.0f}")

# Reality check
if n_required > 10000:
    print("\n⚠️ Large sample needed! Consider:")
    print("- Running test longer")
    print("- Accepting lower power")
    print("- Looking for larger effects")
```
</details>

## 5. Multiple Testing Correction

In [None]:
# The multiple comparisons problem
np.random.seed(42)

# Test 20 different metrics (all from same distribution)
n_tests = 20
p_values = []

for i in range(n_tests):
    control = np.random.normal(100, 15, 100)
    treatment = np.random.normal(100, 15, 100)  # No real difference!
    _, p = stats.ttest_ind(control, treatment)
    p_values.append(p)

# Without correction
significant_raw = sum(p < 0.05 for p in p_values)
print(f"Without correction: {significant_raw}/{n_tests} 'significant' results")
print(f"That's {significant_raw/n_tests:.0%} false positives!\n")

In [None]:
# Bonferroni correction
bonferroni_alpha = 0.05 / n_tests
significant_bonferroni = sum(p < bonferroni_alpha for p in p_values)

print(f"Bonferroni correction:")
print(f"Adjusted α = {bonferroni_alpha:.4f}")
print(f"Significant results: {significant_bonferroni}/{n_tests}")

# Benjamini-Hochberg (less conservative)
from statsmodels.stats.multitest import multipletests
reject, p_adjusted, _, _ = multipletests(p_values, method='fdr_bh')
print(f"\nBenjamini-Hochberg (FDR):")
print(f"Significant results: {sum(reject)}/{n_tests}")

**Exercise 4 – A/B Test Analysis (hard)**  
Analyze results from an A/B test with multiple metrics.


In [None]:
# Your turn
# Metrics: clicks, time_on_site, bounce_rate, conversion
# control_data = {
#     'clicks': [120, 115, 125, 110, 130, 122, 118, 124],
#     'time_on_site': [45, 52, 48, 51, 47, 50, 49, 46],
#     'bounce_rate': [0.35, 0.32, 0.38, 0.33, 0.36, 0.34, 0.37, 0.35],
#     'conversion': [0.045, 0.042, 0.048, 0.041, 0.046, 0.044, 0.043, 0.045]
# }
# treatment_data = {
#     'clicks': [135, 128, 140, 132, 138, 136, 134, 137],
#     'time_on_site': [52, 58, 55, 57, 54, 56, 53, 55],
#     'bounce_rate': [0.28, 0.25, 0.30, 0.27, 0.29, 0.26, 0.28, 0.27],
#     'conversion': [0.052, 0.055, 0.058, 0.054, 0.056, 0.053, 0.057, 0.055]
# }


<details>
<summary><b>Solution</b></summary>

```python
control_data = {
    'clicks': [120, 115, 125, 110, 130, 122, 118, 124],
    'time_on_site': [45, 52, 48, 51, 47, 50, 49, 46],
    'bounce_rate': [0.35, 0.32, 0.38, 0.33, 0.36, 0.34, 0.37, 0.35],
    'conversion': [0.045, 0.042, 0.048, 0.041, 0.046, 0.044, 0.043, 0.045]
}
treatment_data = {
    'clicks': [135, 128, 140, 132, 138, 136, 134, 137],
    'time_on_site': [52, 58, 55, 57, 54, 56, 53, 55],
    'bounce_rate': [0.28, 0.25, 0.30, 0.27, 0.29, 0.26, 0.28, 0.27],
    'conversion': [0.052, 0.055, 0.058, 0.054, 0.056, 0.053, 0.057, 0.055]
}

# Run tests for each metric
results = []
for metric in control_data.keys():
    control = np.array(control_data[metric])
    treatment = np.array(treatment_data[metric])
    
    t_stat, p_value = stats.ttest_ind(control, treatment)
    
    # Calculate percentage change
    pct_change = (treatment.mean() - control.mean()) / control.mean() * 100
    
    results.append({
        'metric': metric,
        'control_mean': control.mean(),
        'treatment_mean': treatment.mean(),
        'pct_change': pct_change,
        'p_value': p_value
    })

# Create results dataframe
results_df = pd.DataFrame(results)

# Apply Bonferroni correction
results_df['p_value_adjusted'] = results_df['p_value'] * len(results_df)
results_df['significant'] = results_df['p_value_adjusted'] < 0.05

print("A/B Test Results Summary:")
print("="*60)
for _, row in results_df.iterrows():
    print(f"\n{row['metric'].upper()}:")
    print(f"  Control: {row['control_mean']:.3f}")
    print(f"  Treatment: {row['treatment_mean']:.3f}")
    print(f"  Change: {row['pct_change']:+.1f}%")
    print(f"  p-value: {row['p_value']:.4f}")
    print(f"  Adjusted p-value: {row['p_value_adjusted']:.4f}")
    print(f"  Significant: {'✅ Yes' if row['significant'] else '❌ No'}")

print("\n" + "="*60)
print("RECOMMENDATION:")
if results_df['significant'].any():
    print("✅ Implement the treatment! Significant improvements found.")
else:
    print("⚠️ No significant improvements after correction. Need more data.")
```
</details>

## 6. Mini-Challenges
- **M1 (easy):** Test if coin is fair (60 heads in 100 flips)
- **M2 (medium):** Paired t-test for before/after weight loss program
- **M3 (hard):** Design an A/B test for 3 groups (ANOVA)

In [None]:
# Your turn - try the challenges!


<details>
<summary><b>Solutions</b></summary>

```python
# M1 - Coin fairness test
from scipy.stats import binom_test
heads = 60
flips = 100
p_value = binom_test(heads, flips, p=0.5)
print(f"Observed: {heads}/{flips} heads ({heads/flips:.0%})")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("✅ Coin appears biased!")
else:
    print("❌ Cannot conclude coin is biased")

# M2 - Paired t-test
before = [85, 90, 78, 92, 87, 83, 88, 91, 79, 86]
after = [82, 87, 76, 88, 85, 80, 85, 87, 77, 83]
t_stat, p_value = stats.ttest_rel(before, after)
weight_loss = np.mean(np.array(before) - np.array(after))
print(f"\nAverage weight loss: {weight_loss:.1f} kg")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("✅ Significant weight loss!")

# M3 - ANOVA for 3 groups
control = [100, 102, 98, 105, 103, 99, 101, 104]
variant_a = [108, 110, 107, 112, 109, 111, 108, 110]
variant_b = [115, 118, 116, 120, 117, 119, 115, 118]

f_stat, p_value = stats.f_oneway(control, variant_a, variant_b)
print(f"\nF-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.6f}")
if p_value < 0.05:
    print("✅ Significant difference between groups!")
    print("Run post-hoc tests to find which groups differ")
```
</details>

## Wrap-Up & Next Steps
✅ You can formulate null and alternative hypotheses  
✅ You understand p-values and statistical significance  
✅ You know about Type I/II errors and power analysis  
✅ You can handle multiple comparisons  

**Next:** A/B Testing Framework - Putting it all together for real experiments!
