# Hypothesis testing (or A/B testing)

## 1. Introduction and vocabulary

In science, we can formulate an hypothesis (based on previous knowledge), and perform multiple experiments to verify whether the hypothesis is correct or not.

We can reject the hypothesis if we get strong, (statistically) convincing evidence that it is wrong.
Sometimes, the evidence against the hypothesis is not (statistically) strong enough, and we fait to reject the hypothesis.

The hypothesis that there is no difference between A and B is called the **null hypothesis**.\
Adopting this formulation for the null hypothesis makes it easier to formulate hypotheses (we do not need previous knowledge to specify in which way A and B are different).

In practice, rejecting the null hypothesis (or failing to do so) occurs via a statistical test. A statistical test requires:
1. experimental data
2. a null hypothesis, to reject or fail to reject
3. an alternative hypothesis.

The alternative hypotheses can be as simple as the exact opposite of the null hypothesis (for instance with two datasets A and B). It becomes more difficult (and more interesting!) when the datasets are more numerous, because we can have several alternative hypotheses.

Note: In the latter case, whatever the alternative hypothesis is selected, we can only either reject or fail to reject the null hypothesis. However, the outcome might depend on the alternative hypothesis selected.

## 2. p-values

### 2.1 Definition

In [1]:
### TBD sample vs population.

**p-values** are values between 0 and 1 that indicate whether we can reject the null hypothesis, by comparing the p-value to a user-defined threshold.

In practice, the threshold value for p-values is usually set at 0.05. This means that if p<0.05, we reject the null hypothesis, while if p>0.05, we fail to reject the null hypothesis.\
This threshold value is called **significance level** or **$\alpha$-level**.\
In practice, it also means that if we repeat the experiment a large number of times, we take the wrong decision only in 5% of the cases.

Note: A small p-value although there is actually no difference between the two datasets A and B is a **False positive**.\
Depending on how crucial it is that we hit the correct answer (for instance, for measuring the efficiency of a given medical treatment), it is possible to use a much smaller threshold value.\
In more relaxed cases, it would also be possible to use larger threshold values. 

Important note: the p-values help us distinguishing whether two samples are different. They provide no information on the extent of this difference. 

### 2.2 Effect size

In [3]:
### TBD here**effect size

### 2.3 Calculating p-values

p-value is the sum of 
1. getting the desired outcome just by random chance
2. getting an outcome equally rare 
3. getting an even rarer outcome

In [4]:
### TBD EXPAND HERE

From the above, it is very clear that the p-value is different from the probability of getting the desired outcome

We use statistical distributions for calculating p values.

### 2.4 p-hacking

#### 2.4.1 p-hacking

1. With a given threshold, the more tests we run, the more false positives we will discover (by construction): we should end up with 5% FP with a threshold of 0.05. This is called the "Multiple testing problem". There are several methods to compensate for that, the most popular being the **False Discovey Rate**.

2. The proper sample size has to be determined before starting the experiment, and not adjusted later on in the hope of improving p-values. To determine the sample size, we need to perform a **power analysis**.

#### 2.4.2 False discovery rates (FDR)

Sometines, bad data nevertheless looks good. FDRs are a way to sort out such data.

Usually, data follows a distribution curve (for instance, a Gaussian distribution) where most of the data points fall relatively close to the mean value for the sample, while only a few data points lie far away from the mean value.\
If we draw samples from this distribution and compare them, they will be identical in most cases (the p-value of a statistical test will be > 0.05). In a few cases (up to 5% if the threshold value is set at 0.05), the samples will be considered different (the p-value of a statistical test will be < 0.05).\
The latter cases are called **False Positives (FP)** because the statistical test indicates that the samples are drawn from different populations, which is false.\
When samples are large, even 5% FP is a large absolute number (5% of 1 000 000 = 50 000)!

A **False Discovery Rate (FDR)** evaluates the number of false positives and removes them.

For instance, one can compare the distribution of p-values when samples are drawn from the same distribution (the p-values are uniformly distributed) and the distribution of p-values when samples are drawn from a different distribution (the p-values are skewed towards lower p-values, with a majority of them < 0.05 by construction).

Note: In the latter case, the drawings ending up with p-values > 0.05 are called **False Negatives (FN)**: they have p-values > 0.05 although they are drawn from different distributions. The number of FNs can be reduced by increasing the size of the samples.

Comparing the two distributions, it becomes possible to identify the FPs by:
1. identifying the level of the uniform distribution.
2. calculating the number N of potential FPs as those N drawings with p-values above the uniform distribution level.
3- selecting the drawings with the N smallest p-values.

This procedure for retrieving FPs is called the **Benjamini-Hochberg method**.\
It then modifies some of the p-values for FPs and places them above the 0.05 threshold, so that a large fraction of the FPs are not significant anymore.

In [5]:
### TBD INCLUDE BH FORMULA HERE

#### 2.4.3 Statistical power

**Statistical power** is the probability to correctly reject the null hypothesis (that is, to get a small p-value).\
Statistical power increases when the two distributions we draw from have little overlap, and decreases when the two distributions have a large overlap. Moreover, statistical power decreases when the sample size decreases, and increases when the sample size increases.\
A power of 0.9 means that the chance of *correctly* rejecting the null hypothesis is 90%. A common value for power is 0.8.\

if the p-value is small, but slightly higher than 0.05, we cannot reject the null hypothesis. Increasing the size of the sample, hoping to reach a p-value < 0.05, would be p-hacking. Performing a power analysis provides us the sample size required to conduct the experiment.

A **power analysis** evaluates the number of measurements to reach a sufficient statistical power.\
The more the distributions overlap,the larger the samples have to be in order to reach a given power.

A simple way to estimate the overlap between two distributions is to use their distributions and standard deviations to compute e metric called **effect size** (usually noted d). The simplest way to compute the effect size is the following, but it is possible to select other definitions.

$$
d = \frac{\mu_B - \mu_A}{\sqrt{\frac{\sigma²_A + \sigma²_B}{2}}}
$$

where:
- $\mu_A$, $\mu_B$  are the means of the A and B distributions
- \$sigma_A$ and $\sigma_B$ are the standard deviations of the A and B distributions.

## 3. t-tests

### 3.1 t-tests

A **t-test** tests whether there is a significant difference between the means of two groups.

To reject the null hypothesis, one can directly compare the calculated t-value to a table, or calculate the p-value from the t-value.

**One-sample t-tests** are used to compare the mean of a sample to a known reference mean value. The null hypothesis states that the sample's mean is equal to the reference value (the alternative being that it is not equal).

$$
t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}
$$
where:
- $\bar{x}$ is the mean of the sample
- $\mu$ is the reference value
- s is standard deviation of the sample
- n is the number of elements in the sample


**Paired t-tests** are useful when we have at our disposal *before* and *after* measurements from the same test sample, for instance before/after administrating a drug, or before/after modifying a functionality on a webpage. The null hypothesis states that the mean of the difference between the pairs is zero (the alternative being that it is not zero).

$$
t = \frac{\bar{x_d} - 0}{\frac{s}{\sqrt{n}}}
$$
where:
- $\bar{x_d}$ is the mean value of the differences
- s is standard deviation of the sample
- n is the number of elements in the sample

When we have independent samples (group A and group B), **unpaired t-tests**, also called **independent samples** t-tests are used instead. This is for instance the case when a test group is compared with a control group. The null hypothesis states that the mean values in both groups are equal (the alternative being that they are not equal)\
Two versions are available, one where it can be assumed that the variance of a given quantity in group A is the same as the variance of this quantity in group B, or a more conservative (=secure, robust) version where the two variances in group A and in group B are assumed to be different. 

$$
t = \frac{\bar{x_1} - \bar{x_2}}{ \sqrt{ \frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}     }     }
$$
where:
- $\bar{x_1}$ and $\bar{x_2}$ are the mean of the samples
- $\mu$ is the reference value
- $s_1$ and $s_2$ are the standard deviations of the samples
- $n_1$ and $n_2$ are the number of elements in the samples

### 3.2 Testing the required assumptions

#### 3.2.1 Normality assumption

Performing a t-test requires that the variable tested must be normally distributed.

There are several methods for testing for normality:
- the Kolmogorov-Smirnov test
- the Shapiro-Wilk test
- the Anderson Darling test
- the D'Agostino-Pearson omnibus test
The null hypothesis is that the data are normally distributed.

In [6]:
### TBD ADD PYTHON EXAMPLES HERE

Note: if the p-value is smaller than 0.05, we can reject the hypothesis that the data is normally distributed. if the p-value is larger than 0.05, we can only fail to reject that the data is normally distributed. In practice, we assume that the data is normally distributed, but this is formally incorrect.

Note: in all the tests above, the p-value calculated depends on the sample size, and graphical methods are often preferred. 

The first graphical method is simply to plot a histogram representing the data. However, this is not very robust (especially in the case of small samples), and plotting a quantile-quantile plot (or QQ plot) is much more robust. In a QQ plot, the data is normally distributed when the sample quantiles align on the theoretical quantiles. This method presents the advantage that it indicates which quantiles dviate from the normal distribution.

In [7]:
### TBD add histogram and QQ plots here

#### 3.2.2 Equal variance assumption assumption

As mentioned above, independent t-tests require equal variance. This can be tested using a Levene's test.

A **Levene's** test tests whether variances of several samples are the same. The null hypothesis is that the variances for several groups are equal (the alternative hypothesis being that at least one group has a different variance).

$$
L = \frac{\text{Number of elements - number of groups}}{\text{number of groups} -1} - \frac{\sum \text{elements per group} \times (\text{group mean - total mean})²}{\sum \text{squared deviations within individual groups}}
$$

THe L-value calculated is equal to the F-value. Combining the F-value and the degrees of freedom, it is possible to calculate the p-value. The degrees of freedom are 
$$
\frac{\text{Number of groups}-1}{\text{Number of elements}-1}
$$