# Hypothesis testing (or A/B testing)

## 1. Intro

In science, we can formulate an hypothesis (based on previous knowledge), and perform multiple experiments to verify whether the hypothesis is correct or not.

We can reject the hypothesis if we get strong, (statistically) convincing evidence that it is wrong.
Sometimes, the evidence against the hypothesis is not (statistically) strong enough, and we fait to reject the hypothesis.

The hypothesis that there is no difference between A and B is called the **null hypothesis**.\
It makes it easier to formulate hypotheses (we do not need previous knowledge to specify in which way A and B are different).

In practice, rejecting the null hypothesis (or failing to do so) occurs via a statistical test. A statistical test requires:
1. experimental data
2. a null hypothesis, to reject or fail to reject
3. an alternative hypothesis.

The alternative hypotheses can be as simple as the exact opposite of the null hypothesis (for instance with two datasets A and B). It becomes more difficult (and more interesting!) when the datasets are more numerous, because we can have several alternative hypotheses.

Note: In the latter case, whatever the alternative hypothesis is selected, we can only either reject or fail to reject the null hypothesis. However, the outcome might depend on the alternative hypothesis selected.

In [None]:
TBD sample vs. population.

## 2. p-values

In [1]:
p values are values between 0 and 1 that measure TBD
the closer to 0 a p value is, the more we are convinced that TBD are different

SyntaxError: invalid syntax (248587138.py, line 1)

In practice, the threshold value is usually set at 0.05. This means that if we repeat the experiment a large number of times, we take the wrong decision only in 5% of the cases.

Note: A small p-value although there is actually no difference between the two datasets A and B is a **False positive**.\
Depending on how crucial it is that we hit the correct answer (for instance, for measuring the efficiency of a given mediacl treatment), it is possible to use a much smaller threshold value.\
In more relaxed cases, it would also be possible to use larger threshold values. 

Important note: the p-values help us distinguishing whether two samples are different. They provide no information on the extent of this difference. 

In [None]:
tbd here **effect size**

## 3. Calculating p-values

p-value is the sum of 
1. getting the desired outcome just by random chance
2. getting an outcome equally rare 
3. getting an even rarer outcome

From the above, it is very clear that the p-value is different from the probability of gettine the desired outcome

We use statistical distributions for calculating p values.

## 4. p-hacking

1. With a given threshold, the more tests we run, the more false positives we will discover (by construction): we should en uo with 5% FP with a threshold of 0.05. This is called the "Multiple testing problem". There are several methods to compensate for that, the most popular being the **False Discovey Rate**.

2. The proper sample size has to be determined before starting the experiment, and not adjusted later on in the hope of improving p-values. To determine the sample size, we need to perform a **power analysis**.

## 5. False discovery rates (FDR)

Sometines, bad data nevertheless looks good. FDRs are a way to sort out such data.

Usually, data follows a distribution curve (for instance, a Gaussian distribution) where most of the data points fall relatively close to the mean value for the sample, while only a few data points lie far away from the mean value.\
If we draw samples from this distribution and compare them, they will be identical in most cases (the p-value of a statistical test will be > 0.05). In a few cases (up to 5% if the threshold value is set at 0.05), the samples will be considered different (the p-value of a statistical test will be < 0.05).\
The latter cases are called **False Positives (FP)** because the statistical test indicates that the samples are drawn from different populations, which is false.\
When samples are large, even 5% FP is a large absolute number (5% of 1 000 000 = 50 000)!

A **False Discovery Rate (FDR)** evaluates the number of false positives and removes them.

For instance, one can compare the distribution of p-values when samples are drawn from the same distribution (the p-values are uniformly distributed) and the distribution of p-values when samples are drawn from a different distribution (the p-values are skewed towards lower p-values, with a majority of them < 0.05 by construction).

Note: In the latter case, the drawings ending up with p-values > 0.05 are called **False Negatives (FN)**: they have p-values > 0.05 although they are drawn from different distributions. The number of FNs can be reduced by increasing the size of the samples.

In [None]:
Comparing the two distributions, it becomes possible to identify the FPs by:
1. identifying the level of the uniform distribution.
2. calculating the number N of potential FPs as those N drawings with p-values above the uniform distribution level.
3- selecting the drawings with the N smallest p-values.

This procedure for retrieving FPs is called the **Benjamini-Hochberg method**.\
It then modifies some of the p-values for FPs and places them above the 0.05 threshold, so that a large fraction of the FPs are not significant anymore.

In [None]:
TBD INCLUDE BH FORMULA HERE

## 6. Statistical power

**Statistical power** is the probability to correctly reject the null hypothesis (that is, to get a small p-value).\
Statistical power increases when the two distributions we draw from have little overlap, and decreases when the two distributions have a large overlap. Moreover, statistical power decreases when the sample size decreases, and increases when the sample size increases.\
A power of 0.9 means that the chance of *correctly* rejecting the null hypothesis is 90%. A common value for power is 0.8.\

if the p-value is small, but slightly higher than 0.05, we cannot reject the null hypothesis. Increasing the size of the sample, hoping to reach a p-value < 0.05, would be p-hacking. Performing a power analysis provides us the sample size required to conduct the experiment.

A **power analysis** evaluates the number of measurements to reach a sufficient statistical power.\
The more the distributions overlap,the larger the samples have to be in order to reach a given power.

A simple way to estimate the overlap between two distributions is to use their distributions and standard deviations to compute e metric called **effect size** (usually noted d). The simplest way to compute the effect size is the following, but it os possible to select other definitions.

$$
d = \frac{\mu_B - \mu_A}{\sqrt{\frac{\sigma²_A + \sigma²_B}{2}}}
$$

where:
- $\mu_A$, $\mu_B$  are the means of the A and B distributions
- \$sigma_A$ and $\sigma_B$ are the standard deviations of the A and B distributions.