# Hypothesis Testing II: Exercises

**MAT0611 - Mathematics for Machine Learning**

This notebook contains exercises on two-sample tests, chi-square tests, and likelihood ratio tests.

---

## Part 1: Step-by-Step Exercises

Work through these problems manually to understand the underlying mechanics of the tests.

### Exercise 1.1: Chi-Square Goodness of Fit (Manual)

A genetics experiment produces offspring with the following observed counts across four phenotypes:

| Phenotype | A | B | C | D |
|-----------|---|---|---|---|
| Observed  | 30| 10| 15| 5 |

According to Mendelian genetics, the expected ratio should be 9:3:3:1.

**Tasks:**

a) Calculate the expected frequencies for each phenotype.

b) Compute the chi-square test statistic $\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$

c) Determine the degrees of freedom.

d) Using $\alpha = 0.05$ and a chi-square table, find the critical value.

e) State your conclusion: Do the data support the Mendelian ratio?

**Work space:** (Use paper or the cell below for calculations)

In [None]:
# Optional: Use this cell to check your calculations

### Exercise 1.2: Two-Sample t-test (Manual)

Two teaching methods are compared. Group A (n=5) scores: [78, 82, 85, 80, 75]
Group B (n=5) scores: [88, 92, 85, 90, 85]

**Tasks:**

a) Calculate the mean and standard deviation for each group.

b) Calculate the pooled standard deviation $s_p$.

c) Compute the t-statistic for testing $H_0: \mu_A = \mu_B$ vs $H_1: \mu_A \neq \mu_B$.

d) Determine the degrees of freedom.

e) Using a t-table with $\alpha = 0.05$ (two-tailed), find the critical value.

f) State your conclusion.

**Work space:** (Use paper or the cell below)

In [None]:
# Optional: Check your calculations

### Exercise 1.3: Chi-Square Test of Independence (Manual)

A study examines the relationship between coffee consumption and insomnia:

|               | Insomnia | No Insomnia | Total |
|---------------|----------|----------------|-------|
| Coffee Drinker| 25       | 75             | 100   |
| Non-Drinker   | 10       | 90             | 100   |
| Total         | 35       | 165            | 200   |

**Tasks:**

a) Calculate expected frequencies for each cell under independence.

b) Compute the chi-square statistic.

c) Determine degrees of freedom.

d) Using $\alpha = 0.05$, is there evidence of an association?

**Work space:**

In [None]:
# Check your work

---

## Part 2: Applied Exercises with Python

Now let's use Python libraries to analyze real datasets and more complex scenarios.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

### Exercise 2.1: Two-Sample t-test with Real Data

A pharmaceutical company tests a new drug to reduce blood pressure. They measure systolic blood pressure (mmHg) for treatment and control groups.

**Task:** Determine if the drug significantly reduces blood pressure.

In [None]:
# Generate simulated data
np.random.seed(123)
control_group = np.random.normal(140, 15, 50)  # Mean: 140, SD: 15
treatment_group = np.random.normal(132, 15, 50)  # Mean: 132, SD: 15

print(
    f"Control group (n={len(control_group)}): Mean = {control_group.mean():.2f}, SD = {control_group.std(ddof=1):.2f}"
)
print(
    f"Treatment group (n={len(treatment_group)}): Mean = {treatment_group.mean():.2f}, SD = {treatment_group.std(ddof=1):.2f}"
)

**Your tasks:**

a) Create a visualization comparing the two groups (box plots or histograms).

b) Check the assumption of equal variances using Levene's test.

c) Perform an appropriate two-sample t-test.

d) Calculate Cohen's d effect size to measure the practical significance.

e) Interpret the results and state your conclusion.

In [None]:
# a) Visualization
# YOUR CODE HERE

In [None]:
# b) Levene's test for equal variances
# YOUR CODE HERE

In [None]:
# c) Two-sample t-test
# YOUR CODE HERE

In [None]:
# d) Cohen's d effect size
# YOUR CODE HERE

In [None]:
# e) Interpretation
# YOUR CODE HERE

**f) Your interpretation:**

(Write your conclusion here)

### Exercise 2.2: Paired t-test

A fitness program claims to reduce body weight. Weights (kg) were measured before and after the 8-week program for 20 participants.

In [None]:
# Generate paired data
np.random.seed(456)
n_participants = 20
before = np.random.normal(85, 12, n_participants)
weight_loss = np.random.normal(3, 2, n_participants)  # Average loss of 3 kg
after = before - weight_loss

# Create DataFrame
weight_data = pd.DataFrame(
    {
        "Participant": range(1, n_participants + 1),
        "Before": before,
        "After": after,
        "Difference": before - after,
    }
)

print(weight_data.head(10))
print(f"\nMean difference: {weight_data['Difference'].mean():.2f} kg")

**Your tasks:**

a) Create a visualization showing before/after weights (e.g., scatter plot with diagonal line, or paired plot).

b) Perform a paired t-test to determine if the program significantly reduces weight.

c) Calculate a 95% confidence interval for the mean weight loss.

d) What would happen if you incorrectly used an independent t-test instead? Compare the results.

e) Interpret your findings.

In [None]:
# a) Visualization
# YOUR CODE HERE

In [None]:
# b) Paired t-test
# YOUR CODE HERE

In [None]:
# c) 95% Confidence interval
# YOUR CODE HERE

In [None]:
# d) Compare with independent t-test (incorrect approach)
# YOUR CODE HERE

**e) Your interpretation:**

(Write your findings here)

### Exercise 2.3: Mann-Whitney U Test (Nonparametric)

Two different websites are tested for user satisfaction scores (1-10 scale). The data are ordinal and not normally distributed.

In [None]:
# Ordinal satisfaction scores (1-10)
np.random.seed(789)
website_a = np.random.choice([5, 6, 7, 8], size=30, p=[0.1, 0.3, 0.4, 0.2])
website_b = np.random.choice([6, 7, 8, 9, 10], size=30, p=[0.1, 0.2, 0.3, 0.3, 0.1])

print("Website A scores:")
print(f"  Median: {np.median(website_a)}, Mean: {np.mean(website_a):.2f}")
print(f"  Distribution: {np.bincount(website_a)[5:]}\n")

print("Website B scores:")
print(f"  Median: {np.median(website_b)}, Mean: {np.mean(website_b):.2f}")
print(f"  Distribution: {np.bincount(website_b)[6:]}")

**Your tasks:**

a) Create a visualization comparing the distributions (histograms or box plots).

b) Explain why Mann-Whitney U test is more appropriate than a t-test here.

c) Perform a Mann-Whitney U test.

d) For comparison, also perform a t-test. How do the results differ?

e) State your conclusion about which website has higher satisfaction.

In [None]:
# a) Visualization
# YOUR CODE HERE

**b) Why Mann-Whitney U is appropriate:**

(Explain here)

In [None]:
# c) Mann-Whitney U test
# YOUR CODE HERE

In [None]:
# d) Compare with t-test
# YOUR CODE HERE

**e) Your conclusion:**

(Write here)

### Exercise 2.3.5: Kolmogorov-Smirnov Test

A machine learning model generates prediction errors that should theoretically follow a normal distribution. You've collected 100 prediction errors and want to test this assumption. Additionally, you have errors from two different models and want to compare their distributions.

**Your tasks:**

a) Visualize the distributions of both models using histograms and Q-Q plots.

b) Perform a one-sample KS test to check if Model 1 errors follow a standard normal distribution N(0,1).

c) Perform a two-sample KS test to compare the distributions of Model 1 and Model 2 errors.

d) Compare your KS test results with a t-test. What different information does each test provide?

e) Based on your analysis, which test (KS or t-test) is more appropriate for this scenario and why?

In [None]:
# Prediction errors from Model 1
np.random.seed(789)
errors_model1 = np.concatenate(
    [
        np.random.normal(0, 1, 80),  # Most errors are normal
        np.random.uniform(-3, 3, 20),  # But some outliers
    ]
)

# Prediction errors from Model 2
errors_model2 = np.random.normal(0.5, 1.2, 100)  # Slightly different distribution

print("Model 1 errors - sample statistics:")
print(f"  Mean: {np.mean(errors_model1):.3f}, Std: {np.std(errors_model1, ddof=1):.3f}")
print(f"  Min: {np.min(errors_model1):.2f}, Max: {np.max(errors_model1):.2f}")

print("\nModel 2 errors - sample statistics:")
print(f"  Mean: {np.mean(errors_model2):.3f}, Std: {np.std(errors_model2, ddof=1):.3f}")
print(f"  Min: {np.min(errors_model2):.2f}, Max: {np.max(errors_model2):.2f}")

In [None]:
# a) Visualizations
# YOUR CODE HERE

In [None]:
# b) One-sample KS test (Model 1 vs. standard normal)
# YOUR CODE HERE

In [None]:
# c) Two-sample KS test (Model 1 vs. Model 2)
# YOUR CODE HERE

In [None]:
# d) Compare with t-test
# YOUR CODE HERE

**e) Your analysis and recommendation:**

(Explain which test is more appropriate and why)

### Exercise 2.4: Chi-Square Test of Independence

In [None]:
# Create contingency table
purchase_data = pd.DataFrame(
    {
        "Day": ["Weekday", "Weekend"],
        "Purchased": [120, 80],
        "Did Not Purchase": [180, 120],
    }
)
purchase_data = purchase_data.set_index("Day")

print("Contingency Table:")
print(purchase_data)
print(f"\nTotal customers: {purchase_data.sum().sum()}")

**Your tasks:**

a) Visualize the contingency table using a grouped bar chart or heatmap.

b) Calculate expected frequencies manually and verify with scipy.

c) Perform chi-square test of independence.

d) Calculate Cramér's V to measure effect size.

e) Interpret: Is there evidence that purchase behavior depends on weekday vs weekend?

In [None]:
# a) Visualization
# YOUR CODE HERE

In [None]:
# b) Calculate expected frequencies
# YOUR CODE HERE

In [None]:
# c) Chi-square test of independence
# YOUR CODE HERE

In [None]:
# d) Cramér's V effect size
# YOUR CODE HERE

**f) Your interpretation:**

(Write here)

### Exercise 2.5: Chi-Square Test of Homogeneity

A company has three regional offices and wants to know if employee job satisfaction levels are similar across regions.

In [None]:
# Job satisfaction data by region
satisfaction_data = pd.DataFrame(
    {
        "Region": ["North", "South", "West"],
        "Low": [25, 30, 20],
        "Medium": [45, 40, 50],
        "High": [30, 30, 30],
    }
)
satisfaction_data = satisfaction_data.set_index("Region")

print("Job Satisfaction by Region:")
print(satisfaction_data)
print(f"\nTotal employees: {satisfaction_data.sum().sum()}")

**Your tasks:**

a) Create a stacked bar chart showing satisfaction distribution by region.

b) State the null and alternative hypotheses for a test of homogeneity.

c) Perform the chi-square test of homogeneity.

d) Examine the standardized residuals to identify which cells contribute most to the chi-square statistic.

e) Conclude: Do all regions have similar satisfaction distributions?

In [None]:
# a) Stacked bar chart
# YOUR CODE HERE

**b) Hypotheses:**

- H₀: (Write null hypothesis)
- H₁: (Write alternative hypothesis)

In [None]:
# c) Chi-square test of homogeneity
# YOUR CODE HERE

In [None]:
# d) Standardized residuals
# Hint: residuals = (observed - expected) / sqrt(expected)
# YOUR CODE HERE

**e) Your conclusion:**

(Write here)

### Exercise 2.6: Likelihood Ratio Test for Regression Models

You're building a model to predict house prices. You want to test whether adding square footage and number of bedrooms significantly improves the model beyond just using location.

In [None]:
# Generate synthetic housing data
np.random.seed(2024)
n = 100

location_score = np.random.uniform(1, 10, n)  # Location quality (1-10)
sqft = np.random.uniform(800, 3000, n)  # Square footage
bedrooms = np.random.randint(1, 6, n)  # Number of bedrooms

# Price depends on all factors
price = (
    50000
    + 30000 * location_score
    + 100 * sqft
    + 15000 * bedrooms
    + np.random.normal(0, 20000, n)
)

housing_df = pd.DataFrame(
    {
        "location_score": location_score,
        "sqft": sqft,
        "bedrooms": bedrooms,
        "price": price,
    }
)

print(housing_df.head())
print(f"\nCorrelation with price:")
print(housing_df.corr()["price"].sort_values(ascending=False))

**Your tasks:**

a) Fit a reduced model: `price ~ location_score`

b) Fit a full model: `price ~ location_score + sqft + bedrooms`

c) Perform a likelihood ratio test to compare the models.

d) Compare with the F-test provided by statsmodels.

e) Interpret: Is the additional complexity justified?

In [None]:
import statsmodels.api as sm

# a) Reduced model
# YOUR CODE HERE

In [None]:
# b) Full model
# YOUR CODE HERE

In [None]:
# c) Likelihood ratio test
# G = 2 * (loglik_full - loglik_reduced)
# YOUR CODE HERE

In [None]:
# d) Compare with F-test
# Hint: Use lr_test or compare model summaries
# YOUR CODE HERE

**e) Your interpretation:**

(Write here)

---

## Challenge Exercise: Comprehensive Analysis

A medical study investigates the effectiveness of three different treatments for reducing cholesterol levels. Patients are randomly assigned to one of three groups, and cholesterol levels are measured before and after treatment.

In [None]:
# Generate data for three treatment groups
np.random.seed(999)
n_per_group = 25

# Treatment A: moderate effect
before_a = np.random.normal(220, 25, n_per_group)
after_a = before_a - np.random.normal(15, 8, n_per_group)

# Treatment B: strong effect
before_b = np.random.normal(220, 25, n_per_group)
after_b = before_b - np.random.normal(25, 10, n_per_group)

# Treatment C (control): minimal effect
before_c = np.random.normal(220, 25, n_per_group)
after_c = before_c - np.random.normal(5, 8, n_per_group)

cholesterol_df = pd.DataFrame(
    {
        "Treatment": ["A"] * n_per_group + ["B"] * n_per_group + ["C"] * n_per_group,
        "Before": np.concatenate([before_a, before_b, before_c]),
        "After": np.concatenate([after_a, after_b, after_c]),
    }
)
cholesterol_df["Change"] = cholesterol_df["Before"] - cholesterol_df["After"]

print(cholesterol_df.groupby("Treatment")[["Before", "After", "Change"]].describe())

**Your comprehensive analysis should include:**

1. **Exploratory Data Analysis**: Create visualizations showing distributions and changes.

2. **Within-group analysis**: Use paired t-tests to check if each treatment significantly reduces cholesterol.

3. **Between-group comparison**: Compare the effectiveness of the three treatments (you'll need ANOVA or Kruskal-Wallis for this - choose appropriately).

4. **Post-hoc analysis**: If there are overall differences, perform pairwise comparisons.

5. **Effect sizes**: Calculate appropriate effect size measures.

6. **Clinical significance**: Beyond statistical significance, discuss practical implications.

7. **Final recommendation**: Which treatment would you recommend and why?

In [None]:
# 1. Exploratory Data Analysis
# YOUR CODE HERE

In [None]:
# 2. Within-group analysis (paired t-tests)
# YOUR CODE HERE

In [None]:
# 3. Between-group comparison
# YOUR CODE HERE

In [None]:
# 4. Post-hoc pairwise comparisons
# YOUR CODE HERE

In [None]:
# 5. Effect sizes
# YOUR CODE HERE

**6. Clinical significance discussion:**

(Discuss here)

**7. Final recommendation:**

(Your recommendation here)

---

## Reflection Questions

1. When should you use a paired t-test vs an independent t-test? Give an example where using the wrong test could lead to incorrect conclusions.

2. What are the advantages and disadvantages of nonparametric tests like Mann-Whitney U compared to parametric tests?

3. Explain the conceptual difference between a chi-square test of independence and a test of homogeneity. Why do they use the same formula?

4. When comparing nested models, what are the advantages of using a likelihood ratio test versus examining individual coefficient p-values?

5. How would you explain to a non-statistician the difference between statistical significance and practical significance?