# A/B Testing
## Table of Contents:
1. [Background reading](#background-reading)
2. [Dataset](#dataset)
3. [Essence of Data](#essence-of-data)
    
   3.1 [What is A/B Testing?](#what-is-ab-testing)
       
   3.2 [Example Code](#example-code)
       
4. [AI in the fabric of society](#ai-in-the-fabric-of-society)
5. [More on A/B testing and ethics of experimentation in tech](#more-on-ab-testing-and-ethics-of-experimentation-in-tech)

## Background reading:

**Suggested Reading:**

*Intro to Statistical Learning*: Chapter 13, Mutiple Testing (Sections 13.1 and 13.2, Page 553-561)

**Optional Reading:**

*Ethical Algorithm*: Chapter 3, Game People Play, The Commuting Game, (Page 101-105)

*Ethical Algorithm*: Chapter 4, Lost in the Garden, Past Performance is No Guarantee of Future Returns, (Page 137-141)

## Dataset:
The example from github uses the ecommerce testing dataset from kaggle
Link: https://www.kaggle.com/zhangluyuan/ab-testing 

Attribute columns:
1. ID - unique user ID
2. Time - what time the advertisement was shown to the user
3. Con_Treat - whether the user is in the control or treatment group
4. Page - whether the user can see the old or new page
5. Converted - did the user click on the add (Yes = 1) or (No = 0)

In this case A is column 5, whether the user clicked on the ad or not
B is columns 3 and 4 which is whether the user is a member of the control group (old page) or treatment group (new page)

**Article:** https://hbr.org/2017/06/a-refresher-on-ab-testing 


## Essence of Data:

### What is A/B Testing?

![](../../../fig/ABTest/fig1.png)


It is a way to compare multiple versions of a feature to see which performs better. It is typically used for websites

Example: If the advertisement is bigger, will more people click on it? Are users more likely to purchase from a website with a different user interface?

Example Application: Google doing A/B Testing with different shades of blue for link colors
https://bharathbalasubramanian.medium.com/data-driven-decisions-googles-50-shades-of-blue-experiment-996f01819a97 

Another Example Application: Harvard Business School: “Is A/B Testing Effective? Evidence from 35,000 Startups”
https://hbswk.hbs.edu/item/is-ab-testing-effective-evidence-from-35000-startups 

It is the most basic type of a randomized controlled experiment. It is a statistical test and the number of samples (n) needed to determine statistical significance should be determined before the experiment is done. A null and alternative hypothesis must be determined then a statistical test needs to be performed to determine whether the results between the two groups are statistically significant.

![](../../../fig/ABTest/fig2.png)


An A/B test can be tied to multi-armed bandits because the algorithm will use the differences between the two responses to determine if an option should be exploited or explored.

With A/B testing of a graphic e-commerce website advertisement (like our python tutorial) exploitation would mean using the advertisement that is more likely to be clicked on as the “winner” and using this advertisement to drive clicks to the companies’ website. Exploration could mean getting more data on the advertisement that seems “least effective”. For example, continuing user testing to see if the less popular advertisement is more effective with certain user demographics compared to the “winner”. 


### Example Code

In [106]:
#Import necessary libraries
import pandas as pd                     # data manipulation and analysis
import math                             # mathematical operations
import random                           # generating random numbers
import numpy as np                      # numerical computations
import seaborn as sns                   # data visualization
import matplotlib.pyplot as plt         # creating plots
from sklearn.model_selection import train_test_split       # splitting data into training and testing sets
from sklearn.linear_model import LogisticRegression       # logistic regression model
from sklearn import metrics                              # model evaluation metrics

In [108]:
#Load the dataset and columns of interest
clickdataset = 'ab_test.csv'
dataset = pd.read_csv(clickdataset)
dataset

Unnamed: 0,id,time,con_treat,page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1
...,...,...,...,...,...
294473,751197,28:38.6,control,old_page,0
294474,945152,51:57.1,control,old_page,0
294475,734608,45:03.4,control,old_page,0
294476,697314,20:29.0,control,old_page,0


In [110]:
df = dataset.drop('time', axis = 1)
df

Unnamed: 0,id,con_treat,page,converted
0,851104,control,old_page,0
1,804228,control,old_page,0
2,661590,treatment,new_page,0
3,853541,treatment,new_page,0
4,864975,control,old_page,1
...,...,...,...,...
294473,751197,control,old_page,0
294474,945152,control,old_page,0
294475,734608,control,old_page,0
294476,697314,control,old_page,0


In [112]:
#Checking for null values
df.isna().sum()

id           0
con_treat    0
page         0
converted    0
dtype: int64

In [114]:
df.describe()

Unnamed: 0,id,converted
count,294478.0,294478.0
mean,787974.124733,0.119659
std,91210.823776,0.324563
min,630000.0,0.0
25%,709032.25,0.0
50%,787933.5,0.0
75%,866911.75,0.0
max,945999.0,1.0


In [116]:
#Identify rows where the treatment doesn't match new page and the control doesn't match the old page
mismatch1 = df.query('page == "new_page" & con_treat != "treatment"').count().iloc[0] 
mismatch2 = df.query('page == "old_page" & con_treat != "control"').count().iloc[0]
print (f'Total number of mismatched pages: {mismatch1 + mismatch2}')

Total number of mismatched pages: 3893


In [118]:
#Split data into into a new dataframe where the page and treatment match
dfnew = df.query('page == "new_page" & con_treat == "treatment" | page == "old_page" & con_treat == "control"')

In [120]:
#Remove any duplicate userids
dfnew = dfnew.drop_duplicates(subset = 'id', keep = "first")

In [122]:
#Check the number of unique users that clicked on the ad (both new and old versions of the page)
converted_all = sum(dfnew.query('converted == 1').converted)/len(dfnew['id']) * 100   # calculating the proportion
print('The proportion of users that click on either version of the advertisement is: {:.4f}%'.format(converted_all))

The proportion of users that click on either version of the advertisement is: 11.9597%


In [124]:
#Find the click rate for users shown the old version of the ad (control group)
converted_control = len(dfnew.query('converted == 1 & con_treat == "control"')['id'])/len(dfnew.query('con_treat == "control"')['id'])
print(f'The proportion of users from the control group that click on the old advertisement is: {converted_control:.4f}%')

The proportion of users from the control group that click on the old advertisement is: 0.1204%


In [126]:
#Find the click rate for users shown the new version of the ad (treatment group)
converted_treatment = len(dfnew.query('converted == 1 & con_treat == "treatment"')['id'])/len(dfnew.query('con_treat == "treatment"')['id'])
print(f'The proportion of users from the treatment group that click on the new advertisement is: {converted_treatment:.4f}%')

The proportion of users from the treatment group that click on the new advertisement is: 0.1188%


In [128]:
#Calculate the difference between groups
converted_difference = converted_control - converted_treatment
print(f'The difference between the proportion of the control group and treatment group that click on the ad is: {converted_difference:.4f}%')

The difference between the proportion of the control group and treatment group that click on the ad is: 0.0016%


In [130]:
#Test for statistical significance using z test and 95% confidence interval
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = df[df['con_treat'] == 'control']['converted']
treatment_results = df[df['con_treat'] == 'treatment']['converted']
n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs = nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs = nobs, alpha = 0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'95% confidence interval for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'95% confidence interval for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

z statistic: 1.24
p-value: 0.216
95% confidence interval for control group: [0.119, 0.122]
95% confidence interval for treatment group: [0.117, 0.121]


In [132]:
#Check the p-value to make the decision
if pval > 0.05:
    print('The p-value is {:.4f}, which is greater than 0.05. Therefore, we fail to reject the null hypothesis and conclude that the difference between the groups is not statistically significant.'. format(pval))
else:
    print('The p-value is {:.4f}, which is less than or equal to 0.05. Therefore, we reject the null hypothesis and conclude that the difference between the groups is statistically significant.'. format(pval))

The p-value is 0.2161, which is greater than 0.05. Therefore, we fail to reject the null hypothesis and conclude that the difference between the groups is not statistically significant.


# Multiple Testing

We will perform multiple testing in order to determine whether the treatment has a significant effect under different conditions (e.g., users who saw only the new page or only the old page).
This will allow us for more granular analysis of whether the ad's effectiveness differs depending on page type.

When performing multiple tests, the chance of false positives increases. These corrections help control for this:

- **Bonferroni:** Conservative approach that reduces Type I errors but may miss true positives.
- **FDR (Benjamini-Hochberg):** Balances false positives and true positives, making it more suitable for larger datasets.

In [135]:
from statsmodels.stats.multitest import multipletests

#Define hypotheses for multiple testing
hypotheses = {
    'New Page Only': dataset.query('page == "new_page"'),
    'Old Page Only': dataset.query('page == "old_page"'),
    'All Data': dataset
}

#Initialize variables
p_values = []
z_stats = []

#Perform z-tests for each hypothesis
for name, data in hypotheses.items():
    control_results = data.query('con_treat == "control"')['converted']
    treatment_results = data.query('con_treat == "treatment"')['converted']
    
    if len(control_results) > 0 and len(treatment_results) > 0:
        # Calculate successes and total observations
        successes = [control_results.sum(), treatment_results.sum()]
        nobs = [len(control_results), len(treatment_results)]
        
        # Perform z-test
        z_stat, pval = proportions_ztest(successes, nobs = nobs)
        z_stats.append(z_stat)
        p_values.append(pval)

In [147]:
#Bonferroni correction (Family-Wise Error Rate)
bonferroni_results = multipletests(p_values, alpha = 0.05, method = 'bonferroni')
bonferroni_pvals = bonferroni_results[1]

#FDR (False Discovery Rate) 
fdr_results = multipletests(p_values, alpha = 0.05, method = 'fdr_bh')
fdr_pvals = fdr_results[1]

#Combine results into a DataFrame to visualize easily
results = pd.DataFrame({
    'Hypothesis': hypotheses.keys(),
    'Original_P-Value': p_values,
    'Bonferroni_P-Value': bonferroni_pvals,
    'FDR_P-Value': fdr_pvals, 
    'Z-Statistic': z_stats})

print("\nSummary of Multiple Testing Results:")
print(results)



Summary of Multiple Testing Results:
      Hypothesis  Original_P-Value  Bonferroni_P-Value  FDR_P-Value  \
0  New Page Only          0.729826            1.000000     0.729826   
1  Old Page Only          0.354842            1.000000     0.532263   
2       All Data          0.216116            0.648348     0.532263   

   Z-Statistic  
0     0.345357  
1    -0.925238  
2     1.236922  
