<h1 align="center"> A/B Testing </h1>

- In this projet, we take a deep dive into the main caracteristics of the A/B Testing and we explain how to leverage it in Data Science.   

# Introduction 

A/B Testing is a method used to compare two versions of a variable to find out which one performs better in a controlled environment. For instance, let's say you own a company and you want to increase the sales of your product. Either, you use a random experimentations, or a scientific models. A/B Testing is one of the preminent used statistical tools. 

In the following, we'll see how different statistical methods can be used to make A/B Testing successful.

To understand what A/B Testing is about, let's consider two alternative designs : A and B. Visitors of a website are randomly served with one of the two. Then, data about their activities is collected by web analytics. Given this data, we want to know which design performs better. 

Now, different kinds of mertrics can be used to measure a website efficacy. With discrete metrics, also called binomial metrics, only the two values 0 and 1 are possible. As an example, click-through rate : if a users is shown an advertisement, do they click on it ?  

<p align="center">
  <img src="./Images/AB_test.PNG" width="600px"/>
</p>

With continuous metrics, also called non-binomial metrics. The metric may take continous values that are not limited to a set of two discrete states. As an example, how much revenue does a user generates in a month ? 

<p align="center">
  <img src="./Images/AB_testing_2.PNG" width="600px"/>
</p>

# Statistical significance

With data collected from user's activity on our website, we can compare the efficacy of the two designs A and B. Simply comparing mean wouldn't be meaningful, as we would fail to assess the statistical significance of our observations.

## 1- Make hypothesis

### a - Null hypothesis or H0

The null hypothesis states that there is no difference between the control (group A) and variant (group B) groups, i.e. that they produce an equivalent click-through rate, or average revenue per user, etc

### b - Alternative Hypothesis or H1

The alternative hypothesis challenges the null hypothesis and is basically a hypothesis that the researcher believes to be true. It's what you might hope that your A/B test will prove to be true.

<div class="alert alert-info"> <b> Reminder  p-value </b> : 
    
The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct : 
    
    
\begin{align}
      P_{value}    = p( data \; at \;\; least \;\; as \;\; extrem \;\; as \;\; actual \;\; observation | H_{0})
\end{align}

    
Typically, a small p-value (< 0.05) suggests that null hypothesis is to be rejected while a large p-value (> 0.05) denotes that null hypothesis is to be accepted 
</div>

## 2 - Define a metric 

### 2.1 - Discrete metrics

Let's start by discret metrics such as click-through rate. We randomly show visitors one of the two designs and we keep track how many of them click on it.

Let's say that we have collected the following informations : 
- $N_{A}$  = 15 visitors saw the advertisement A and 7 of them click on it. 
- $N_{B}$ = 19 visitors saw the advertisement B and 15 of them click on it. 

As a first glance, it looks like design B was more effective. Let's proof this statistically. 

#### Fisher’s exact test

Fisher test is used to assess whether or not there is a correlation between two categorical features. It is typically used as an alternative to the Chi-Square Test of Independence when one or more of the cell counts in a 2×2 table is less than 5. 


<p align="center">
  <img src="./Images/fisher_test.PNG" width="600px"/>
</p>


<p align="center">
    \begin{align}
      p = \frac{C^{a}_{a + b} * C^{c}_{c + d}}{C^{a + b}_{n}}  = \frac{(a+b)! * (c+d)! * (a+c)! * (b+d)!}{a!*b!*c!*d!*n!}
    \end{align} 
</p>



Using this formula, the probability of seeing our actual observations is $ 7.5 \% $. Since, this value if greater than 0.05, we will reject our $ H_{0} $ : The two groups aren't independants.  


#### Pearson’s chi-squared test

Fisher’s exact test has the important advantage of computing exact p-value. But if we have a large sample size, it may be computationally inefficient. In this case, we use Person's Chi-squared test to compute an approximation of p-value. 

<p align="center">
  <img src="./Images/person_test.PNG" width="200px"/>
</p>

Avec : 
<center>
    $ E_{i,j} = \frac{(O_{i1} \; + \; O_{i2}) (O_{1j} \; + \; O_{2j}) }{O_{11} \;+\; O_{12} \; + \; O_{21} \; +  \; O_{22} } $
  
</center>

- In our case, using Pearson's Chi-squared test, we obtain a p-value ≈ 5.1 \% which is greater than 5 \%. This test can used in non-normal distribution when the sample size is large enough, thanks to central limit theorem.

### 2.2 - Continuous metrics

Now, let's consider the case of a countinuous metric such as the average revenu per user. Our target is to determine which one of the two layouts is effecient based on how much revenue a user generates. 

#### Z-test

Z-test can be applied under the following assumptions : 
   - The samples follow a normal distributions (a bell curve) or the samples size is large enough (central limit theorem).
   - The sampling distributions have known variance σX and σY
   
Under the above assumptions, the below Z static follows a standard normal distribution. 


<p align="center">
  <img src="./Images/Z_test.PNG" width="300px"/>
</p>



Unfortunately in real applications, the standard deviations are unknown and must be estimated.  
Assuming σX=100 and σX=90, we get a p-value ≈ 9 \% 

#### Student’s t-test 

To overcome the previous problem, Student’s t-test can be applied under the following assumptions : 
   - The samples follow a normal distributions (a bell curve) or the samples size is large enough (central limit theorem).
   - The sampling distributions have similar variances σX ≈ σY
   

<p align="center">
  <img src="./Images/Student.PNG" width="400px"/>
</p>


Using Student's t-test, we got a p-value ≈ 8.4%

#### Welch’s t-test
In most cases, Students' t-test can be effectively applied with good results. However, it may rarely happen that the second condition is violated. In this case, we use Welch's t-test which also has a Student’s t distribution, but with a different number of degrees of freedom ν.


<p align="center">
  <img src="./Images/welch.PNG" width="450px"/>
</p>

In our example, using Welch’s t-test we obtain t ≈ -1.848 and ν ≈ 28.51, which give p-value ≈ 7.5%

## 3- Minimum Sample Size Calculation

Before using an A/B test, each group (control and treatment) should has a miminum sample size to have statically significant results. When we are dealing with a continuous metrics such as the mean order amount, we intend to compare the mean of Control and Experimental Groups. In this case, we usually use the Central Limit Theorem and state that the mean sampling distribution of both Control and Experimental Groups follow Normal Distribution.


<p align="center">
  <img src="./Images/normale.PNG" width="450"/>
</p>

Hence, to compare the means of two normally distributed samples, we need a minimum sample size that can be calculated as follows : 


<p align="center">
  <img src="./Images/N.PNG" width="350"/>
</p>

## 4 - Two-Tailed and One-Tailed test

A two-tailed Test is designed to show whether the sample mean is significantly greater than and significantly less than the mean of a population. 
The two-tailed test gets its name from testing the area under both tails (sides) of a noraml distribution. A one-tailed test, on the other hand, is set up to show that the sample mean would be higher or lower than the population mean. If the sample being tested falls into the one-sided critical area, the alternative hypothesis will be accepted instead of the null hypothesis. A one-tailed test is also known as a directional hypothesis or directional test. 

A two-tailed test, on the other hand, is designed to examine both sides of a specified data range to test whether a sample is greater than or less than the range of values.



<p align="center">
  <img src="./Images/1_2_tailed.PNG" width="500"/>
</p>

We choose a one-tailed test when we are sure about the direction of the change. A two-tailed test is a conservative choice as we might be wrong about the direction. For instance, we always use it when the alternative hypothesis is written with  ≠ or < or > sign.

#### Type I error

A type I error means rejecting the hypothesis null (H0) when it is actually true (it should not be rejected). This error is also known as significance level. It is usually set at 0.05 or 5%, which means that our results only have 5% chance of occuring, or less, when the hypothesis (H0) is actually true. It correponds to a False positive.

When our p value is greater than α, it means our results are statistically significant and consistent with the alternative hypothesis (we reject (H0) hypothesis ) and vice versa.


<p align="center">
  <img src="./Images/type_1.PNG" width="450"/>
</p>

#### Type II error 

Type II error noted means not rejecting the (H0) hypothesis when it should be. It correponds to a false negative. it is failing to conclude there was an effect when there actually was. It's denoted as β. 

The Type II error rate is beta (β), represented by the shaded area on the left side. The remaining area under the curve represents statistical power, which is 1 – β : it represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at 0.8 by convention



<p align="center">
  <img src="./Images/type_2.PNG" width="450"/>
</p>

Type I and type II errors are inversely related to each other, as reducing one results in increasing the other. Depending on situations, one can be more important to minimize than the other. For example, in the court situation, it would be better to let a guilty person walk than to send an innocent person to jail. Therefore, in this case, type I error is worse. 



<p align="center">
  <img src="./Images/type_3.PNG" width="450"/>
</p>

Below a confusion matrix, with type I error α, correct inference (true negative) 1 - α, type II error β and correct inference (true positive)  1 – β:


<p align="center">
  <img src="./Images/confusion_matrix.PNG" width="450"/>
</p>

## 5 - Application

Let's take our previous example. We have two designs and we are interested in knowing which one is more profitable. The average revenues generated by the first and the second groups are 18 € and 18.75 € respectivly, with a standard deviation of 6 €.  Based on those values, can any inference be made about the difference of designs ?

Each contains 100 users. We will perform a two-tailed test :

- First thing first, we will specify our null hypothesis (H0) : mean of revenues = 18  €
- Then H1 : our alternative hypothesis (H1) : mean != 1000 € (that's what we want to prove)
- Rejection region: $ Z <= $ $ - Z_{2.5}$ and $Z>= Z_{2.5} $ (assuming 5% significance level, split 2.5 each on either side).



\begin{align}
     $ Z_{X,Y} = \frac{\overline{X} - \overline{Y}}{\sqrt{\sigma^2/n_{x} + \sigma_{y}^2/n_{y}}} $ = 1.25
\end{align}


- This calculated Z value falls between the two limits defined by: - Z2.5 = -1.96 and Z2.5 = 1.96.


<p align="center">
  <img src="./Images/Z_test_last.PNG" width="450px"/>
</p>

This concludes that there is insufficient evidence to infer that there is any difference between the two designs. Therefore, the null hypothesis cannot be rejected. Alternatively, the p-value = P(Z< -1.25)+P(Z >1.25) = 2 * 0.1056 = 0.2112 = 21.12%, which is greater than 0.05 or 5%, leads to the same conclusion.