# A/B Testing

## AKA Applied Hypothesis Testing!

If you went through all the stats up to this point and thought "oh man when am I ever going to use this stuff" - I get it. But one of the most common ways that Hypothesis Testing techniques are used in the real world is through A/B Testing!

One of the most common places you see A/B Testing out in the world is in marketing - companies will run A/B tests on elements of their website, their emails, their calls to action, etc. While you see A/B testing in other places, Marketing is going to be my example lens for today's session.

### A/B Testing in Marketing

Hubspot is a marketing software company, and I'm going to use some of their resources in the setup to why all this matters. You can access the specific A/B Testing Kit they put out for marketing optimization process at this link: https://drive.google.com/drive/folders/1Wk3J2nA5gguN1Y_41cACxQ9mcJls9TmI

Hubspot's definition of split testing, aka A/B testing:

> Split testing, commonly referred to as A/B testing, is a method of testing through which marketing variables (such as copy, images, layout, etc) are compared to each other to identify the one that brings a better conversion rate. In this context, the element that is being testing is called the “control” and the element that is argued to give a better result is called the “treatment.”

#### Hubspot's 10 Guidelines for Effective A/B Testing: 

1. Only conduct one test (on one asset) at a time
2. Test one variable at a time
3. Test minor changes, too
4. You can A/B test the entire element
5. Measure as far down funnel as possible
6. Set up control & treatment
7. Decide what you want to test
8. Split your sample group randomly 
9. Test at the same time
10. Decide on necessary significance before testing

### What will the data look like?

Data source: https://www.kaggle.com/zhangluyuan/ab-testing

Unfortunately, this data has no real meta-data associated with it, but the author did say the data comes from an e-commerce website. 

Full credit to Robbie Geoghegan, now a Data Scientist at Facebook, for giving me the idea and sharing work they did on this dataset: https://medium.com/@robbiegeoghegan/implementing-a-b-tests-in-python-514e9eb5b3a1 

Another blog I referenced: https://medium.com/@RenatoFillinich/ab-testing-with-python-e5964dd66143

Before we go any further, and typically before we run a test like this, we need to decide our significance level. Otherwise, let's assume that the group who ran this test did it properly (ran tests in parallel, split users randomly, etc)

Significance Level: $\alpha = .05$

In [1]:
# Imports
import pandas as pd
import numpy as np

from scipy import stats

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize, proportions_ztest

In [3]:
# Grab our data - want the column 'timestamp' to be a datetime object
df = pd.read_csv('data/ab_data.csv', parse_dates=['timestamp'])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   user_id       294478 non-null  int64         
 1   timestamp     294478 non-null  datetime64[ns]
 2   group         294478 non-null  object        
 3   landing_page  294478 non-null  object        
 4   converted     294478 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 11.2+ MB


In [5]:
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [7]:
# Check our timeframe
print(df['timestamp'].min())
print(df['timestamp'].max())

2017-01-02 13:42:05.378582
2017-01-24 13:41:54.460509


#### There's an issue...

In [11]:
#duplicate user id entries
df['user_id'].duplicated().sum()

3894

In [10]:
# Let's see...
# control group people got new page -- incorrectly
df.loc[(df['group'] == 'control') & (df['landing_page'] == 'new_page')]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
22,767017,2017-01-12 22:58:14.991443,control,new_page,0
240,733976,2017-01-11 15:11:16.407599,control,new_page,0
490,808613,2017-01-10 21:44:01.292755,control,new_page,0
846,637639,2017-01-11 23:09:52.682329,control,new_page,1
850,793580,2017-01-08 03:25:33.723712,control,new_page,1
...,...,...,...,...,...
293894,741581,2017-01-09 20:49:03.391764,control,new_page,0
293996,942612,2017-01-08 13:52:28.182648,control,new_page,0
294200,928506,2017-01-13 21:32:10.491309,control,new_page,0
294253,886135,2017-01-06 12:49:20.509403,control,new_page,0


In [12]:
len(df.loc[(df['group'] == 'control') & (df['landing_page'] == 'new_page')])

1928

In [13]:
# treatment/test group people got old page -- incorrectly
len(df.loc[(df['group'] == 'treatment') & (df['landing_page'] == 'old_page')])

1965

In [14]:
control_but_new = df.loc[(df['group'] == 'control') & (df['landing_page'] == 'new_page')]
treatment_but_old = df.loc[(df['group'] == 'treatment') & (df['landing_page'] == 'old_page')]

In [15]:
print(len(control_but_new))
print(len(treatment_but_old))

1928
1965


In [16]:
[*control_but_new.index]

[22,
 240,
 490,
 846,
 850,
 988,
 1198,
 1354,
 1474,
 1877,
 2023,
 2214,
 2745,
 2759,
 2857,
 2947,
 3362,
 3421,
 3548,
 3817,
 3903,
 3913,
 4038,
 4282,
 4284,
 4311,
 4485,
 4693,
 4748,
 4962,
 5281,
 5295,
 5604,
 6047,
 6477,
 6576,
 6598,
 6873,
 6909,
 7007,
 7297,
 7405,
 7421,
 7500,
 7624,
 7766,
 7866,
 7900,
 8018,
 8041,
 8120,
 8138,
 8349,
 8425,
 8486,
 8592,
 8647,
 8935,
 9303,
 9346,
 9352,
 9363,
 9376,
 9467,
 9522,
 9544,
 9630,
 9692,
 10025,
 10309,
 10572,
 10639,
 11102,
 11178,
 11376,
 11386,
 11476,
 11665,
 11792,
 11928,
 12064,
 12191,
 12546,
 12615,
 12640,
 12726,
 13118,
 13187,
 13289,
 13492,
 13527,
 13921,
 14591,
 14727,
 14838,
 14929,
 14998,
 15182,
 15196,
 15245,
 15328,
 15351,
 15481,
 15771,
 15867,
 16051,
 16060,
 16351,
 16362,
 16465,
 16814,
 16833,
 16935,
 17109,
 17135,
 17300,
 17411,
 17413,
 17576,
 17598,
 17644,
 17907,
 17917,
 18054,
 18067,
 18141,
 18204,
 18212,
 18258,
 18285,
 18293,
 18314,
 18497,
 18502,
 18

In [21]:
# remove these 2 groups from data set for cleaning
to_drop = [*control_but_new.index, *treatment_but_old.index]   

len(df.drop(to_drop)) / len(df)

0.9867799971474949

In [24]:
df = df.drop(to_drop)

In [25]:
len(df)

290585

#### One more thing to check...

In [26]:
# Check it out...
df['user_id'].duplicated().sum()

1

In [27]:
df = df.drop_duplicates(subset=['user_id'])

In [28]:
df['user_id'].duplicated().sum()

0

#### Now, let's explore:

In [29]:
# Split out our two groups
control_group = df.loc[df['group'] == 'control']
treat_group = df.loc[df['group'] == 'treatment']

In [30]:
# Check the number of samples, timeframe and conv % for each group
for sub_df in [control_group, treat_group]:
    name = list(sub_df['group'])[0].title()
    print(f"Number of Samples in our {name} Group: {len(sub_df):,}")
    print(f"Timeframe: {sub_df['timestamp'].min()} - {sub_df['timestamp'].max()}")
    print(f"Number of Conversions in our {name} Group: {sub_df['converted'].sum():,}")
    print(f"Conversion % in our {name} Group: {sub_df['converted'].mean() * 100:.3f}%")
    print("*"*20)

Number of Samples in our Control Group: 145,274
Timeframe: 2017-01-02 13:42:15.234051 - 2017-01-24 13:41:54.460509
Number of Conversions in our Control Group: 17,489
Conversion % in our Control Group: 12.039%
********************
Number of Samples in our Treatment Group: 145,310
Timeframe: 2017-01-02 13:42:05.378582 - 2017-01-24 13:41:44.097174
Number of Conversions in our Treatment Group: 17,264
Conversion % in our Treatment Group: 11.881%
********************


In [None]:
# Can see that conversion is less in treatment group. So would stick with control. Test did not work

Our friend at Facebook, whose [blog](https://medium.com/@robbiegeoghegan/implementing-a-b-tests-in-python-514e9eb5b3a1) and [code](https://github.com/RobbieGeoghegan/AB_Testing/blob/master/AB_Testing.ipynb) inspired this notebook, uses two things you can determine in advance to calculate effect size:

> Baseline rate — an estimate of the metric being analyzed before making any changes
> Practical significance level — the minimum change to the baseline rate that is useful to the business, for example an increase in the conversion rate of 0.001% may not be worth the effort required to make the change whereas a 2% change will be

In other words, you can determine the minimum amount of change you want to see between your two groups and use that to calculate effect size (different than calculating effect size after the study has been conducted, which isn't ideal).

To do this with statsmodels, since we're doing a test on a proportion, we use: https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportion_effectsize.html

In [32]:
control_group.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1
5,936923,2017-01-10 15:20:49.083499,control,old_page,0
7,719014,2017-01-17 01:48:29.539573,control,old_page,0


In [36]:
len(control_group.loc[control_group['converted'] == 1])

17489

In [38]:
control_group['converted'].sum()

17489

In [41]:
# let's grab some useful variables, going ahead and doing for both groups
n_control = len(control_group)
num_conv_control = control_group['converted'].sum()
prop_conv_control = num_conv_control/n_control #converted proportion

In [42]:
n_treat = len(treat_group)
num_conv_treat = treat_group['converted'].sum()
prop_conv_treat = num_conv_treat/n_treat #converted proportion

In [44]:
# baseline is what we expect given what we have
# here, we'll capture that with our percentage of conversions 
baseline_rate = prop_conv_control
practical_significance = 0.01 # user defined - want at last 1% difference here

effect_size = proportion_effectsize(baseline_rate, baseline_rate + practical_significance)
#proportion_effectsize is a stats models function - effect size for a test comparing two proportions

In [45]:
# determine our minimum sample size per group
confidence_level = 0.05 # user defined - want to be 95% confident
power = 0.8 # user defined (1 - beta)

min_sample_size = NormalIndPower().solve_power(effect_size = effect_size, 
                                               power = power, 
                                               alpha = confidence_level)

print(f"Required minimum sample size: {min_sample_size:,.0f} per group")

Required minimum sample size: 17,209 per group


In [46]:
print(n_control)
print(n_treat)

145274
145310


In [47]:
# Now let's test!
# Using a proportion test (not dealing with means but proportions)
results = proportions_ztest([num_conv_control, num_conv_treat], [n_control, n_treat])

In [48]:
results # Returns z state and p-value

(1.3109241984234394, 0.18988337448195103)

In [51]:
print(f"Test Statistic: {results[0]:.3f}, P-Value: {results[1]:.3f}")

Test Statistic: 1.311, P-Value: 0.190


So?

- P value of 0.19 is not < 0.05. Thus we fail to reject the null as this is not a significant result.
