# Hypotheses testen

In dit labo leer je hoe je statistische hypotheses test op basis van steekproefdata. We gebruiken verschillende statistische tests om beslissingen te maken onder onzekerheid.

In [None]:
import numpy as np
from scipy import stats

## ✍️ p-waarde
Bereken de eenzijdige p-waarde van de nulhypothese $H_0: X \sim \mathcal{N}(55, 81^2)$ voor volgende waarden van $x$: 15, 120, 63, 888

In [2]:
mu = 55
sigma = 81

for x in [15, 120, 63, 888]:
    p_right = 1 - stats.norm.cdf(x, mu, sigma)
    p_left = stats.norm.cdf(x, mu, sigma)

    if x < mu:
        print(f"x = {x:3d}: P(X≤x) = {p_left:.4f}")
    else:
        print(f"x = {x:3d}: P(X≥x) = {p_right:.4f}")

x =  15: P(X≤x) = 0.3107
x = 120: P(X≥x) = 0.2111
x =  63: P(X≥x) = 0.4607
x = 888: P(X≥x) = 0.0000


## ✍️ One-Sample t-test

Een bedrijf beweert dat de gemiddelde reistijd van werknemers naar het werk 30 minuten is. Een steekproef van 20 werknemers toont de volgende reistijden (in minuten):

`[28, 35, 32, 29, 31, 33, 27, 36, 30, 34, 31, 29, 32, 28, 35, 31, 33, 30, 32, 29]`

Test of de gemiddelde reistijd significant verschilt van 30 minuten (α = 0.05). Doe dit zowel via manuele berekening van de teststatistiek als via `scipy.stats.ttest_1samp`.

In [3]:
# Sample data
commute_times = np.array(
    [28, 35, 32, 29, 31, 33, 27, 36, 30, 34, 31, 29, 32, 28, 35, 31, 33, 30, 32, 29]
)
mu_0 = 30  # Hypothesized mean

# Manual calculation
# Calculate sample statistics
sample_mean = commute_times.mean()
sample_std = commute_times.std(ddof=1)  # N-1 for sample standard deviation
n = len(commute_times)
se = sample_std / np.sqrt(n)  # Standard error

# Calculate t-statistic manually
t_statistic_manual = (sample_mean - mu_0) / se

# Degrees of freedom
df = n - 1

# Calculate p-value (two-tailed test)
p_value_manual = 2 * (1 - stats.t.cdf(abs(t_statistic_manual), df))

print("Manual")
print(f"t-statistic: {t_statistic_manual:.4f}")
print(f"p-value (two-tailed): {p_value_manual:.4f}")

# Using scipy.stats.ttest_1samp
t_statistic_scipy, p_value_scipy = stats.ttest_1samp(commute_times, mu_0)


print("\nUSING stats.ttest_1samp():")
print(f"t-statistic: {t_statistic_scipy:.4f}")
print(f"p-value (two-tailed): {p_value_scipy:.4f}")

# Hypothesis test conclusion
print("\nTest result:")
if p_value_scipy < 0.05:
    print("The mean commute time significantly differs from 30 minutes.")
else:
    print("No significant evidence that mean commute time differs from 30 minutes.")

Manual
t-statistic: 2.1904
p-value (two-tailed): 0.0412

USING stats.ttest_1samp():
t-statistic: 2.1904
p-value (two-tailed): 0.0412

Test result:
The mean commute time significantly differs from 30 minutes.


## ✍️ Two-Sample t-test

Een farmaceutisch bedrijf test een nieuw medicijn om bloeddruk te verlagen. Ze vergelijken twee groepen:
- **Controlegroep** (placebo): [120, 118, 122, 125, 119, 121, 123, 120, 124, 122, 121, 119, 123, 120, 122]
- **Behandelingsgroep** (medicijn): [115, 112, 118, 114, 116, 113, 115, 117, 114, 116, 115, 113, 112, 118, 116]

Test of het medicijn de bloeddruk significant verlaagt (eenzijdige test, α = 0.05) - in de veronderstelling van een gezamenlijke sample variantie. Doe dit zowel via manuele berekening van de teststatistiek als via `scipy.stats.ttest_ind`.

In [5]:
# Blood pressure data
control = np.array([120, 118, 122, 125, 119, 121, 123, 120, 124, 122, 121, 119, 123, 120, 122])
treatment = np.array([115, 112, 118, 114, 116, 113, 115, 117, 114, 116, 115, 113, 112, 118, 116])

# Manual calculation
n1 = len(control)
n2 = len(treatment)
mean1 = control.mean()
mean2 = treatment.mean()
std1 = control.std(ddof=1)
std2 = treatment.std(ddof=1)

# Pooled standard deviation (assuming equal variances)
pooled_std = np.sqrt(((n1 - 1) * std1**2 + (n2 - 1) * std2**2) / (n1 + n2 - 2))

# Standard error of difference
se_diff = pooled_std * np.sqrt(1 / n1 + 1 / n2)

# t-statistic
t_stat_manual = (mean1 - mean2) / se_diff

# Degrees of freedom
df_manual = n1 + n2 - 2

# p-value for one-sided test (control > treatment)
p_value_manual = 1 - stats.t.cdf(t_stat_manual, df_manual)

print("Manual")
print(f"t-statistic: {t_stat_manual:.4f}")
print(f"p-value: {p_value_manual:.4f}")

# Using scipy.stats.ttest_ind
t_stat_scipy, p_value_scipy = stats.ttest_ind(control, treatment, alternative="greater")

print("\nUsing scipy.stats.ttest_ind():")
print(f"t-statistic: {t_stat_scipy:.4f}")
print(f"p-value: {p_value_scipy:.4f}")

# Hypothesis test conclusion
if p_value_scipy < 0.05:
    print("The medicine significantly lowers blood pressure.")
else:
    print("No significant evidence that the medicine lowers blood pressure.")

Manual
t-statistic: 8.8369
p-value: 0.0000

Using scipy.stats.ttest_ind():
t-statistic: 8.8369
p-value: 0.0000
The medicine significantly lowers blood pressure.


## ✍️ Binomiaaltest

Een webshop beweert dat 80% van hun klanten tevreden is. In een steekproef van 100 klanten zijn 72 tevreden.

Test of de tevredenheid significant lager is dan 80% (eenzijdige test, α = 0.05). Gebruik hiervoor `scipy.stats.binomtest`

In [7]:
# Binomial test parameters
n_trials = 100  # Total number of customers
n_successes = 72  # Number of satisfied customers
p_claimed = 0.80  # Claimed satisfaction rate

result = stats.binomtest(n_successes, n_trials, p_claimed, alternative="less")

# Calculate sample proportion
p_observed = n_successes / n_trials

print(f"Sample proportion: {p_observed:.2%}")
print(f"p-value (one-sided): {result.pvalue:.4f}")
if result.pvalue < 0.05:
    print("Satisfaction is significantly lower than 80%.")
else:
    print("No significant evidence that satisfaction is lower than 80%.")

Sample proportion: 72.00%
p-value (one-sided): 0.0342
Satisfaction is significantly lower than 80%.
