# Why ANOVA Matters

In [1]:
import numpy as np
from scipy.stats import ttest_ind
from itertools import combinations

In [2]:
def sample():
    """
    Draws a sample of 100 independent observations from the same population.
    In this case, the true population is standard normal.
    """
    return np.random.normal(size=100)

## Explaination of Significance

In [3]:
def do_pairwise_t_test():
    
    """
    Draws two independent samples form the population and performs an independent t-test.
    The t-test checks if the means of the samples are statistically different.
    Returns True if the null is rejected; if the samples have statistically different means.
    The significance level of the test is .05.
    """
    
    sample_1 = sample()
    sample_2 = sample()

    p_value = ttest_ind(sample_1, sample_2).pvalue

    return p_value < .05

Running 1000 t-tests on 1000 pairs of data drawn from the same distribution with significance level of .05, we should expect that approximately 50 false positives and a empirical false positive rate (FPR) of about .05.

As an application, consider a drug test where one of the samples is from a control group and the other sample is from a group taking an experimental drug. Each observation is the (standardized) observation of some measurable medical test on an individual in the group. By assumption, since I've created the data generation process, I know that the drug actually does nothing; both groups actually come from the same population. However, researchers do not know this. Suppose there are 1000 research groups that are researching either the same or different non-effective drugs. If each group uses a significance level of .05 to test their results, about 50 groups will report that their drugs are effective!

This is one reason why we should not trust medical results unless there is a plausible scientific explanation underlying the result.

In [4]:
num_trials = 1000
results = [do_pairwise_t_test() for _ in range(num_trials)]

num_false_positives = results.count(True)
empirical_FPR = num_false_positives / num_trials

print(num_false_positives)
print(empirical_FPR)

50
0.05


## More than Two Groups

In [5]:
def do_all_way_pairwise_tests(num_groups, pairwise_sig):
    
    """
    Draw num_groups samples from the population.
    Perform a t-test for every pair.
    Report a result if any pairwise test is significant.
    """
    
    groups = [sample() for _ in range(num_groups)]
    
    pairs = combinations(groups, r=2)

    return any(
        [ttest_ind(pair[0], pair[1]).pvalue < pairwise_sig for pair in pairs]
    )

In [6]:
num_trials = 1000
results = [do_all_way_pairwise_tests(num_groups=3, pairwise_sig=.05) for _ in range(num_trials)]

num_false_positives = results.count(True)
empirical_FPR = num_false_positives / num_trials

print(num_false_positives)
print(empirical_FPR)

111
0.111


Instead of having a FPR that we would hope for by specifying the significance as .05. We get many more false positive rates than we would hope for.

Since we have 3 samples, we have to make 3 pairwise tests each with significance of .05. If any of these tests cause us to reject the null, we will declare a false positive. So, under the assumption of independence, we can theoretically see that this testing process will results in a FPR of $1−(1−.05)^{3}=0.142625$. (Since, in practice, a set of random samples of finite length will almost never have 0 covariance, the tests will actually be correlated producing a FPR lower than the theoretical expectation.)