In [2]:
import scipy.stats as scs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**A/B Testing** 

The goal of A/B testing is to check whether a change (a treatment) is going to have a significant impact on the target population. It is a randomized controlled experiment in which we test 2 versions of the same variable and evaluate which performs better in the controlled environment.

**How does it work?**

Familiarize yourself with the following terms before we proceed further:

1. **Hypothesis** refers to a claim  made about the population. Its a statement whose authenticity will be tested with the statistical analysis underlying the A/B testing. In a normal test environment, we have 2 hypothesis on hand, the Null Hypothesis ($H_{o}$) which states that any impact of the treatment observed on the sample population is purely by chance. The Alternate Hypothesis ($H_{a}$) is the one the researcher wants to prove as being true. It is the one where we believe that a treatment affected the response of the sample population.
2. Generally in A/B testing we split the traffic (read, population) into 2 groups, **Control Group** and the **Test Group**. Each group is created/ obtained by random sampling (hence the process being referred to as a randomized experiment). It is very important to assure the randomness of the sampling process as it eliminates the *sampling bias* and we want our sample population to be as true a representative of the true (and significantly larger distribution as poosible)
Let's understand the 2 groups in the classic context, that is from when A/B testing is used to access the impact of changes (treatment) made to a website. The samples obtained from the traffic to the version A (original) of the website is the control group whereas the samples from the traffic to the version B (enhanced) of the website is the test group.
3. **Sampling Size** is a vital point of the experiment too as it determines how much we can overcome the *under coverage bias*. Too less samples may not help us cover the probable types of users for the product.
4. **Errors during the test** There are two types of errors that may occur in our hypothesis testing:

  *Type I error*: We reject the null hypothesis when it is true. That is we accept the variant B when it is not performing better than A.

  *Type II error*: We failed to reject the null hypothesis when it is false. It means we conclude variant B is not good when it performs better than A.  
5. **Baseline Conversion Rate and Lift:** The baseline conversion rate is the rate at which existings users sign up for the website whereas desired lift is the increase in the conversion rate which we are trying to achieve with the changes to design implemented in the version B.
6. **Significance Level:** Probability of the type I error, denoted by $\alpha$. In simple words, its the probability that we will reject a null hypothesis when it is True.
7. **P-value:** Just like $\alpha$ supports the alternative hypothesis, p-value supported the null hypothesis. It is the probability that any observation is simply an occurance by chance.
8. **Metric for Evaluation** is an integral part of our analysis. It determines how we will quantify the results of our evaluation and use them to make inferences about the data. 
####Let's start wih defining the associated variables!

In [3]:
#import packages
import pandas as pd
import math
import statsmodels.stats.api as sms
import scipy.stats as st

  import pandas.util.testing as tm


In [4]:
dataPath = "/content/drive/My Drive/ab_data.csv"
raw_data = pd.read_csv(dataPath)
df = raw_data.copy()
print("Number of rows: ", df.shape[0], " Number of columns: ", df.shape[1])
df.head()

Number of rows:  294478  Number of columns:  5


Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


Some of the users in the control group experianced the new page and similarly, a few users in the treatment group were directed to the old page. We don't need these instances. It is better to delete them.

In [5]:
mask1 = (df["group"] == "control") & (df["landing_page"] == "new_page")
index_to_drop1 = df[mask1].index
df = df.drop(index_to_drop1)

mask2 = (df["group"] == "treatment") & (df["landing_page"] == "old_page")
index_to_drop2 = df[mask2].index
df = df.drop(index_to_drop2)

df["group"].value_counts()

treatment    145311
control      145274
Name: group, dtype: int64

In [6]:
# remove the duplicates users from the data
print(df["user_id"].count())
print(df["user_id"].nunique())
df.drop_duplicates(subset ='user_id',keep ='first',inplace = True)

290585
290584


In [7]:
# Show the % split between users who saw new vs old page
# Calculate pooled probability
mask = (df["group"] == "control")
conversions_control = df["converted"][mask].sum()
total_users_control = df["converted"][mask].count()

mask = (df["group"] == "treatment")
conversions_treatment = df["converted"][mask].sum()
total_users_treatment = df["converted"][mask].count()

print("Split of control users who saw old page vs treatment users who saw new page: ", 
          round(total_users_control / df["converted"].count() * 100, 2), "% ",
          round((total_users_treatment / df["converted"].count()) * 100, 2), "%")

Split of control users who saw old page vs treatment users who saw new page:  49.99 %  50.01 %


In [8]:
# count number of users who converted in each group
print("Number of control users who converted on old page: ", conversions_control)
print("Percentage of control users who converted: ", round((conversions_control / total_users_control) * 100, 2), "%")

mask = (df["group"] == "treatment")
print("Number of treatment users who converted on new page: ", conversions_treatment)
print("Percentage of treatment users who converted: ", round((conversions_treatment/ total_users_treatment) * 100, 2), "%")

Number of control users who converted on old page:  17489
Percentage of control users who converted:  12.04 %
Number of treatment users who converted on new page:  17264
Percentage of treatment users who converted:  11.88 %


**User Defined variables:**

1. *Baseline Conversion Rate* The estimate of the metric to be analyzed before making changes.
2. *Practical Significance Level* It the minimum improvement desired to improce the business. 
3. *Confidence Level* It is the p-value to be compared to the $\alpha$. 
4. *Sensitivity* Probability that the null hypothesis is not rejected when it should be.

The baseline rate can be estimated using historical data, the practical significance level will depend on what makes sense to the business and the confidence level and sensitivity are generally set at 95% and 80% respectively but can be adjusted to suit different experiments or business needs.
Once these are set, the sample size required can be calculated statistically.

In [20]:
#Check what sample size is required
baseline_rate = conversions_control / total_users_control
practical_significance = 0.01
confidence_level = 0.05 
sensitivity = 0.7

effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance)
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size, power = sensitivity, 
                                               alpha = confidence_level, ratio=1)
print("Required sample size: ", round(sample_size), " per group")

Required sample size:  13532  per group


####**Let's begin with the A/B Testing**

Prior to proceeding with the experiment, it is essential to develop an understanding of the experimental design. This includes asking and answering questions such as:
1. How long will the experiment run and how frequently will it be tested?
2. How many users will be considered as part of the experiment? If the number of users is less, we need to increase the run time of the experiment for sound results.
3. How to account for learning effect? Sometimes, changes made to the website are subtle and not easily recognizable to the users in which case we need to give them some time to get acquainted with the changes.
4. Tools to capture the statistics of the data being used in the experiment. Tools may limit the data accessible to us and sometimes can introduce inherent biases of their own.

We start with a simple setting: 

$H_{o}$ states that the probability of conversion in the treatment group minus the probability of conversion in the control group equals zero and $H_{a}$ considers the probability of conversion in the treatment group minus the probability of conversion in the control group does not equal zero.

In [21]:
def groupStats(group):
    mask = (df["group"] == group)
    conversions = df["converted"][mask].sum()
    total_users = df["converted"][mask].count()

    return conversions,total_users

conversions_control, total_users_control =groupStats(group = "control")
conversions_treatment, total_users_treatment =groupStats(group = "treatment")

prob_pooled = (conversions_control + conversions_treatment) / (total_users_control + total_users_treatment)
print(prob_pooled)

0.11959708724499628


In [18]:
# Calculate pooled standard error and margin of error
se_pooled = math.sqrt(prob_pooled * (1 - prob_pooled) * (1 / total_users_control + 1 / total_users_treatment))
z_score = st.norm.ppf(1 - confidence_level / 2)
margin_of_error = se_pooled * z_score

# Calculate the estimated difference between probability of conversions in the experiment and control groups
d_hat = (conversions_treatment / total_users_treatment) - (conversions_control / total_users_control)

# Test if we can reject the null hypothesis
lower_bound = d_hat - margin_of_error
upper_bound = d_hat + margin_of_error

if practical_significance < lower_bound:
    print("Reject null hypothesis")
else: 
    print("Do not reject the null hypothesis")
    
print("The lower bound of the confidence interval is ", round(lower_bound * 100, 2), "%")
print("The upper bound of the confidence interval is ", round(upper_bound * 100, 2), "%")

Do not reject the null hypothesis
The lower bound of the confidence interval is  -0.39 %
The upper bound of the confidence interval is  0.08 %


The process used above is known as the **2-sample Z-test** in statistical terms. We read about it in books but most of us have never thought about it's implementation is the real-world.

Let's touch the base with the process. $n_{1}$ and $n_2$ are the sample sizes of the distrbutions $P_1$ and $P_2$ respectively. Here $P$ refers to the sample proportion being considered. 

Since the null hypothesis states that $P_1=P_2$, we use a pooled sample proportion (p) to compute the standard error of the sampling distribution.
The pooled probability, $p = \frac{(p1 * n1 + p2 * n2)}{(n1 + n2)}$

Standard error (SE) of the sampling distribution difference between two proportions.
$SE = \sqrt{ p * ( 1 - p ) * ((1/n1) + (1/n2)) }$

The test statistic is a z-score (z) defined by the following equation:
$z = \frac{(p1 - p2)} {SE}$

We will use the Z-score and the SE to calculate the margin of error = $z*SE$


