# A/B Test Process

<img src="../img/a_b_process.png">


## **01. COMING UP WITH THE HYPOTHESIS**

including how you measure the sucess (primary metric)

_ie: Changing the "Sign Up button" to green will increase the sign-ups_

ASK YOURSELF: If the chosen metric were to increase while everything stays the same, would you achieve your goal and address the problem?

**- Statistical Hypothesis** 
- Null hypothesis (H0): what we want to reject
- Acceptance hypothesis (H1): what we are subject to accept

_ie:_
_H0: The CRT of blue button is the same as the red button._
_H1: CTR for red button is higher than the blue button_ 



***






## **02. DESIGN THE TEST**

### **02.1 POWER ANALYSIS**

<img src="../img/power_analysis.png">

**Determine the POWER (1 - beta, where beta = probability of type 2 error) of the test**
_Probability of making  type 2 error (false negative)_
Usually used 80%

**Determine the SIGNIFICANCE (alpha) level of the test:**
_Probability of main type 1 error (false positive)_
Generally used 5%

**Determine MINIMUM DETECTABLE EFFECT (delta) of the test**
It ensures that your test is sensitive enough to detect meaningful differences, but not so sensitive that it detects trivial changes that aren't practically significant.

_ie. If we are looking at CTR, and we have a MDE of 5%. If the difference in CTR between the two versions is smaller than the MDE (5%), than the test might not be able to confidently say which version is better._

If the MDE is too high, you might need a larger sample size.
Likewise, if you


_Suppose your baseline conversion rate is 10%. You want to detect an increase to at least 12% (a 2% absolute increase). You set your power at 80% and alpha at 5%. With a sample size of 1,000 per group (control and test), you input these values into an MDE calculator, and it tells you the minimum effect you can reliably detect given these parameters._


### **02.2 CALCULATE MIN SAMPLE SIZE**
There are different approaches for binary metrics (ie. user clicked or not clicked), and continuos metrics (ie. a percentage value of clicks.)


For the example below we will generate some data to play with.

Below how you can calculate sample size with Python


In [14]:
from statsmodels.stats.power import NormalIndPower

# Parameters
baseline_rate = 0.20  # Baseline conversion rate
mde = 0.05  # Minimum Detectable Effect that you want
effect_size = mde / baseline_rate  # Relative effect size
power = 0.80  # Desired power
alpha = 0.05  # Significance level

# Create an instance of NormalIndPower
power_analysis = NormalIndPower()

# Calculate sample size
sample_size = power_analysis.solve_power(effect_size=effect_size,
                                         power=power,
                                         alpha=alpha,
                                         ratio=1)  # Ratio of the sample sizes in the two groups

print(f"Required sample size per group: {sample_size:.0f}")


Required sample size per group: 251


### **02.3 TEST DURATION**

**Duration = N / # visitors per day**
where the N is the size of the Min Sample Size

NOTES:
- Try to run the test in the times that make sense to your test (consider external periods, seasonality, etc.)
- NEVER STOP THE TEST WHEN YOU ACHIEVE SIGNIFICANTE LEVEL, this is called P-hacking. Determine before-hand the number of days you should run the test before
- NOVELTY EFFECT: When the period if too short, users then to react to it, but the effect can "wear off during time".
- MATURATION EFFECT: when you run the test for too long, you risk that external effects have an influence in your results.



***

## **03. RUN THE TEST**

In [16]:
import pandas as pd
import numpy as np

N_samples_experiment = 1000
N_samples_control = 1000

# Generate random click data, normal distribution
click_experiment = pd.Series(np.random.binomial(1, 0.5, size=N_samples_experiment))
click_control = pd.Series(np.random.binomial(1, 0.2, size=N_samples_control))


# Simulated example data based on required sample size
# For simplicity, let's assume the experiment resulted in the following observed counts
#click_control = np.random.binomial(n=sample_size, p=baseline_rate)
#click_experiment = np.random.binomial(n=sample_size, p=baseline_rate + mde)



# Generate group identifier
exp_id = pd.Series(np.repeat("exp", N_samples_experiment))
ctr_id = pd.Series(np.repeat("ctr", N_samples_control))

df_exp = pd.concat([click_experiment, exp_id], axis=1)
df_ctr = pd.concat([click_control, ctr_id], axis=1)

df_exp.columns = ["is_clicked", "group"]
df_ctr.columns = ["is_clicked", "group"]

df_ab_test = pd.concat([df_ctr, df_exp])

df_ab_test.head()



Unnamed: 0,is_clicked,group
0,0,ctr
1,0,ctr
2,1,ctr
3,1,ctr
4,0,ctr


## **04. ANALYSE RESULTS**

In [10]:
# Shows the numbers of clicks and non-clicks by the group
df_ab_test_dic = df_ab_test.value_counts().to_dict()
print(f"Clicks by group \n {df_ab_test_dic}\n\n")

# Calculates the probabilities of clicks for both groups
p_hat_exp = df_ab_test_dic[(1, "exp")] / N_samples_experiment
p_hat_ctr = df_ab_test_dic[(1, "ctr")] / N_samples_control

print(f"Click probability in Control {p_hat_ctr}")
print(f"Click probability in Experiment {p_hat_exp}")


Clicks by group 
 {(0, 'ctr'): 795, (1, 'exp'): 505, (0, 'exp'): 495, (1, 'ctr'): 205}


Click probability in Control 0.205
Click probability in Experiment 0.505


In [13]:
from scipy.stats import chi2_contingency, fisher_exact

# create the contigency table
data = np.array([[df_ab_test_dic[(1, "ctr")], df_ab_test_dic[(0, "ctr")]], #control group
[df_ab_test_dic[(1, "exp")], df_ab_test_dic[(0, "exp")]]]) # experimental group

# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(data)

print(f"Chi2 Statistic: {chi2}")
print(f"P-Value: {p}")

# Interpret the result
alpha = 0.05  # significance level
if p < alpha:
    print("Reject the null hypothesis - there is a significant difference between the groups.")
else:
    print("Fail to reject the null hypothesis - there is no significant difference between the groups.")

# If sample sizes are small or any expected frequency is below 5, use Fisher's Exact Test
oddsratio, p_value = fisher_exact(data)
print(f"Fisher's Exact Test P-Value: {p_value}")

if p_value < alpha:
    print("Reject the null hypothesis - there is a significant difference between the groups.")
else:
    print("Fail to reject the null hypothesis - there is no significant difference between the groups.")


Chi2 Statistic: 195.2200021836445
P-Value: 2.3067270314680285e-44
Reject the null hypothesis - there is a significant difference between the groups.
Fisher's Exact Test P-Value: 2.2579432263396825e-45
Reject the null hypothesis - there is a significant difference between the groups.


__REFERENCES__

[https://tatevaslanyan.com/complete-guide-to-a-b-testing-design-implementation-and-pitfalls/]