Often when running experiments, we have no idea how many trials we need to run (Section 11.2.2 of *Mathematical Statistics and Data Analysis* by John Rice discusses this as well). However, we can determine the least number of samples to achieve our specifications using a two-sided t-test (a statistical test for different means using the Neyman-Pearson criterion).

A t-test is formulated as 

$$
t_i = \frac{\bar{x} - \bar{y}}{s/\sqrt{n}}
$$
where
$$
|t_i|  > \gamma \implies \text{accept } H_1 \\
|t_i|  < \gamma \implies \text{accept } H_0 \\
$$

such that $\bar{x}$ and $\bar{y}$ are sample averages, $s$ is the population standard deviation, $n$ is the number of samples (or degrees of freedom), $H_1$ is the alternative hypothesis, and $H_0$ is the null hypothesis. 

Under the null hypothesis $H_0$, $x \sim \mathcal{N}(\mu, s^2)$ and $y \sim \mathcal{N}(\mu, s^2)$, then $t$ is distributed according to a Student-t distribution.

Under the alternative hypothesis $H_1$, $x \sim \mathcal{N}(\mu_x, s^2)$ and $y \sim \mathcal{N}(\mu_y, s^2)$, then $t$ is distributed accordining to a non-central Student-t distribution.

In the Neyman-Pearson criterion, the goal is to maximize the probability of detection $P(H_1 | H_1)$ or power $\Beta$ given a probability of false-positives $P(H_1|H_0)$ or size $\alpha$.

In [17]:
import numpy as np
from scipy.stats import nct

alpha = 0.05
beta = 0.8
delta_mean = 2.5
sigmax = 0.5
sigmay = 0.5
pooled_sample_variance = (sigmax**2 + sigmay**2) / 2 # Assume same number of trials for each population
cohens_d = delta_mean / np.sqrt(pooled_sample_variance)
print(f"Effect size (Cohen's d): {cohens_d}")

Effect size (Cohen's d): 5.0


Given these inputs, we calculate the threshold $\gamma$ using different $n$ and terminate when we reach the desired power.

In [19]:
def calcBeta(N: float, verbose: bool = False):
    # Calculate threshold for Neyman-Pearson test to ensure 
    # the probability of false alarm given P(|test_statistic| > gamma | H_0) = alpha
    gamma = nct.ppf(1 - alpha/2, N-1, 0.0)
    if verbose:
        print(f"Threshold for two-sided t-test: {gamma}")

    # Calculate power or probability of detection (cohens_d makes the t-distribution a non-central t-distribution)
    ncp = np.sqrt(N) * cohens_d
    upper_prob = 1.0 - nct.cdf(gamma, N-1, ncp) # P(t > gamma| H_1)
    lower_prob = nct.cdf(-gamma, N-1, ncp) # P(t < gamma | H_1)
    if verbose:
        print(f"Power (or probability of detection): {upper_prob + lower_prob}")
    return upper_prob + lower_prob

Use bisection algorithm to quickly find the best $n$!

In [20]:

MAX_N = 1000
MIN_N = 2
ERROR = 1e-3
# Assume beta is smallest at MIN_N and largest at MAX_N

max_n = MAX_N
min_n = MIN_N
it = 0
while True:
    n_proposal = (max_n + min_n) / 2.0
    beta_n_proposal = calcBeta(n_proposal)
    if abs(beta_n_proposal - beta) > ERROR and abs(n_proposal - MIN_N) > ERROR and abs(n_proposal - MAX_N) > ERROR:
        if beta_n_proposal > beta:
            max_n = n_proposal
        else:
            min_n = n_proposal
    else:
        break

print(f"{n_proposal:0.3f} samples gives a test with power {beta_n_proposal:0.3f}")


2.491 samples gives a test with power 0.800
