# **Fundamentals of A/B Testing**

*An A/B test is an experiment with two groups to establish which of the two treatments, products, procedures, or the like is superior. Often one of the two treatments is the standard existing treatment or no treatment. If a standard (or no) treatment is used, it is called the control. A typical hypothesis is that a new treatment is better than the control. A/B tests are common in web design and marketing since results are so readily measured. Some examples of A/B testing include:*

*   Testing two soil treatments to determine which produces better seed germination.
*   Testing two therapies to determine which suppresses cancer more effectively.
*   Testing two prices to determine which yields more net profit.
*   Testing two web headlines to determine which produces more clicks.
*   Testing two web ads to determine which generates more conversions.

A/B testing in data science is generally used in a web context. Treatments might be the design of a web page, the price of a product, the wording of a headline, or some other item. Some thought is required to preserve the principles of randomization. Typically the subject in the experiment is the web visitor, and the outcomes we are interested in measuring are clicks, purchases, visit duration, the number of pages visited, whether a particular page is visited and the like. In a standard A/B experiment, we need to decide on one metric ahead of time. Multiple behavior metrics might be collected and be of interest, but if the experiment is expected to lead to a decision between treatment A and treatment B, a single metric or test statistic needs to be established beforehand. Selecting a test statistic after the experiment is conducted opens the door to researcher bias.

#### **References:**

> [**Frequentist A/B Testing**](https://ethen8181.github.io/machine-learning/ab_tests/frequentist_ab_test.html)

> [**A/B Test Significance in Python**](https://cosmiccoding.com.au/tutorials/ab_tests)

# **A/B Testing**

A/B Testing is a user experience research technique. These are statistical tests that allow us to decide which would be better between any two features or strategies. With A/B tests, many features such as being a member of a site, clicking on advertisements, and going for a sale can be tested. In this dataset, we will observe whether a change made in the web interface of a market increases the number of clicks.

In [None]:
# Install Kaggle.
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Files Upload.
from google.colab import files

files.upload()

In [None]:
# Create a Kaggle Folder.
!mkdir ~/.kaggle

# Copy the kaggle.json to the folder created.
!cp kaggle.json ~/.kaggle/

# Permission for the json file to act.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Dataset Download.
!kaggle datasets download -d tklimonova/grocery-website-data-for-ab-test

In [None]:
# Unzip Dataset.
!unzip grocery-website-data-for-ab-test.zip

In [None]:
# Import Library.
import pandas as pd
import numpy as np
from scipy.stats import shapiro, mannwhitneyu
import warnings

warnings.filterwarnings("ignore")

# Load Dataset.
data = pd.read_csv("grocerywebsiteabtestdata.csv")
data.head()

Unnamed: 0,RecordID,IP Address,LoggedInFlag,ServerID,VisitPageFlag
0,1,39.13.114.2,1,2,0
1,2,13.3.25.8,1,1,0
2,3,247.8.211.8,1,1,0
3,4,124.8.220.3,0,3,0
4,5,60.10.192.7,0,2,0


In [None]:
# Dataset Shape.
print("Shape of the Dataset is", data.shape)

# Drop Missing Values.
data = data.dropna()

Shape of the Dataset is (184588, 5)


***An IP Address may have visited the page more than once. To not affect the results, reduce the "Visit" value to 1 for users with multiple visits.***

In [None]:
df = data.groupby(["IP Address", "LoggedInFlag", "ServerID"])["VisitPageFlag"].sum()

df = df.reset_index(name="VisitPageFlagSum")
df.head(10)

Unnamed: 0,IP Address,LoggedInFlag,ServerID,VisitPageFlagSum
0,0.0.108.2,0,1,0
1,0.0.109.6,1,1,0
2,0.0.111.8,0,3,0
3,0.0.160.9,1,2,0
4,0.0.163.1,0,2,0
5,0.0.169.1,1,1,0
6,0.0.178.9,1,2,0
7,0.0.181.9,0,1,1
8,0.0.185.4,1,3,0
9,0.0.192.6,1,3,0


In [None]:
df["VisitPageFlag"] = df["VisitPageFlagSum"].apply(lambda x: 1 if x != 0 else 0)
df.head(10)

Unnamed: 0,IP Address,LoggedInFlag,ServerID,VisitPageFlagSum,VisitPageFlag
0,0.0.108.2,0,1,0,0
1,0.0.109.6,1,1,0,0
2,0.0.111.8,0,3,0,0
3,0.0.160.9,1,2,0,0
4,0.0.163.1,0,2,0,0
5,0.0.169.1,1,1,0,0
6,0.0.178.9,1,2,0,0
7,0.0.181.9,0,1,1,1
8,0.0.185.4,1,3,0,0
9,0.0.192.6,1,3,0,0


***Split dataset into "$Test$" and "$Control$" groups with the help of "ServerID". Set ServerID 1 as "Test" group and ServerID 2 and 3 as "Control" group.***

In [None]:
df["group"] = df["ServerID"].map({1: "Test", 2: "Control", 3: "Control"})
df.drop(["ServerID", "VisitPageFlagSum"], axis=1, inplace=True)

df.head()

Unnamed: 0,IP Address,LoggedInFlag,VisitPageFlag,group
0,0.0.108.2,0,0,Test
1,0.0.109.6,1,0,Test
2,0.0.111.8,0,0,Control
3,0.0.160.9,1,0,Control
4,0.0.163.1,0,0,Control


In [None]:
df_control = df[df["group"] == "Control"].copy()
df_control.reset_index(inplace=True, drop=True)

df_control.head()

Unnamed: 0,IP Address,LoggedInFlag,VisitPageFlag,group
0,0.0.111.8,0,0,Control
1,0.0.160.9,1,0,Control
2,0.0.163.1,0,0,Control
3,0.0.178.9,1,0,Control
4,0.0.185.4,1,0,Control


In [None]:
df_test = df[df["group"] == "Test"].copy()
df_test.reset_index(inplace=True, drop=True)

df_test.head()

Unnamed: 0,IP Address,LoggedInFlag,VisitPageFlag,group
0,0.0.108.2,0,0,Test
1,0.0.109.6,1,0,Test
2,0.0.169.1,1,0,Test
3,0.0.181.9,0,1,Test
4,0.0.195.5,1,0,Test


***Examine the descriptive statistics of the Control Group.***

In [None]:
df_control.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LoggedInFlag,66460.0,0.503912,0.499988,0.0,0.0,1.0,1.0,1.0
VisitPageFlag,66460.0,0.092251,0.289382,0.0,0.0,0.0,0.0,1.0


***Examine the descriptive statistics of the Test Group.***

In [None]:
df_test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LoggedInFlag,33303.0,0.503258,0.499997,0.0,0.0,1.0,1.0,1.0
VisitPageFlag,33303.0,0.115515,0.319647,0.0,0.0,0.0,0.0,1.0


***In the "$df\_control$" dataset, let's calculate the ratio of those who enter the page we want comparing it to all those who enter the site.***

In [None]:
control_sum_visit = df_control["VisitPageFlag"].count()
print("Sum visit for control group:", control_sum_visit)

control_visit_1 = df_control[df_control["VisitPageFlag"] == 1]["VisitPageFlag"].count()
print("Visit Page target = 1 :", control_visit_1)

print("Control Group Ratio is", control_visit_1 / control_sum_visit)

Sum visit for control group: 66460
Visit Page target = 1 : 6131
Control Group Ratio is 0.09225097803189888


***In the "$df\_test$" dataset, let's calculate the ratio of those who enter the page we want comparing it to all those who enter the site.***

In [None]:
test_sum_visit = df_test["VisitPageFlag"].count()
print("Sum visit for test group:", test_sum_visit)

test_visit_1 = df_test[df_test["VisitPageFlag"] == 1]["VisitPageFlag"].count()
print("Visit Page target = 1 :", test_visit_1)

print("Test Group Ratio is", test_visit_1 / test_sum_visit)

Sum visit for test group: 33303
Visit Page target = 1 : 3847
Test Group Ratio is 0.11551511875806984


When we look directly at the click rates, we see a difference between the two groups. It seems that the new feature applied to the test group is getting more clicks. But this result can be misleading. Therefore, we should seek an answer to the question of whether there is a statistically significant difference. Now we will run A/B testing.

For A/B testing to be applied, the dataset must satisfy the Normality and Variance homogeneity assumptions. Then, it can be passed to the implementation of the hypothesis tests.

   1. If normality and variance homogeneity is provided, an independent two-sample t-test (parametric test) is applied.
   2. If normality and homogeneity of variance are not provided, the **Mann-Whitney U test** (non-parametric test) is performed.

**How to check the assumption of normality?**

In this, we will first determine the $H_{0}$ and $H_{1}$ hypotheses.

- $H_{0}$: The assumption of normality is provided.
- $H_{1}$: The assumption of normality is not provided.

## **Normality Assumption**

In [None]:
test_stat, p_value = shapiro(df_control["VisitPageFlag"])
print("Test Stat = %.4f, p-value = %.4f" % (test_stat, p_value))

Test Stat = 0.3266, p-value = 0.0000


In [None]:
test_stat, p_value = shapiro(df_test["VisitPageFlag"])
print("Test Stat = %.4f, p-value = %.4f" % (test_stat, p_value))

Test Stat = 0.3711, p-value = 0.0000


$H_{0}$ is rejected because the p-value is $<$ 0.05. The assumption of normality was not provided.

Therefore, we will use the **Mann-Whitney U test**.

- $H_{0}$: There is no significant difference between the two groups in terms of click rate to the desired page.
- $H_{1}$: There is a difference between the two groups in terms of click rate to the desired page.

## **Mann-Whitney U Test**

In [None]:
test_stat, p_value = mannwhitneyu(df_control["VisitPageFlag"], df_test["VisitPageFlag"])
print("Test Stat = %.4f, p-value = %.4f" % (test_stat, p_value))

Test Stat = 1080913226.5000, p-value = 0.0000


$H_{0}$ is rejected because p-value $<$ 0.05. In other words, we can say statistically that there is a difference between the two groups.

In [None]:
group_count = (
    df.groupby(["group", "VisitPageFlag"])["group"].count().reset_index(name="Count")
)

groupped = pd.crosstab(
    group_count["group"],
    group_count["VisitPageFlag"],
    values=group_count["Count"],
    aggfunc=np.sum,
    margins=True,
)

In [None]:
100 * groupped.div(groupped["All"], axis=0)

VisitPageFlag,0,1,All
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Control,90.774902,9.225098,100.0
Test,88.448488,11.551512,100.0
All,89.998296,10.001704,100.0


## **Conclusion**

***While the rate of clicking on the link was 9.22% in the Control group, this rate increased to 11.55% in the Test group. As a result of our tests, we can say that this rate increase is not accidental, but has been proven statistically.***