In [1]:
# %matplotlib inline
import numpy as np
import scipy
import pandas as pd
import matplotlib.pyplot as plt
#import statsmodels.formula.api as sm
import seaborn as sns
import sklearn as sl
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 350)

Randomization

A crucial assumption behind an A/B test is that the only difference between test and control has to be the feature we are testing. This implies that test and control user distribution are comparable. If this is true, we can then exactly estimate the impact of the feature change on whichever metric we are testing.
Comparable test and control user distribution means that, for each relevant segment, the relative proportion of users in test and control is similar. That is, if US users are 10% of users in the test group, we expect to also have ~10% of US users in control. If we have 50% of repeat users in test, we should have a similar percentage in control, and so on.

From a purely statistical standpoint, the above should be true over a large enough number of users. And in A/B testing, we are looking for very small gains, so sample size is large, and, therefore, test and control distributions should be the same.

In practice, it is pretty frequent that test and control distributions are different, invalidating the test results. The number one reason for that is bugs or bias in the randomization algorithm that assigns users to test and control, leading to over/under representation of certain segments. That is, we might have more US users in control, but those users have higher conversion rate, so the difference we see in the metric is not only affected by the feature change that we are testing.

It is therefore extremely important to check that test and control distributions are similar before doing the statistical test. Let’s see how.

In [3]:
data = pd.read_csv('./dataset/randomization.csv')

In [4]:
data.head()

Unnamed: 0,user_id,source,device,browser_language,browser,sex,age,country,test,conversion
0,1,SEO,Web,EN,Chrome,M,38,Chile,0,0
1,2,SEO,Mobile,ES,Android_App,M,27,Colombia,0,0
2,3,SEO,Mobile,ES,Iphone_App,M,18,Guatemala,1,0
3,5,Ads,Web,ES,Chrome,M,22,Argentina,1,0
4,8,Ads,Mobile,ES,Android_App,M,19,Venezuela,1,0


Check A/B Test Randomization

Checking that randomization worked well simply means making sure that all variables have the same distribution in test and control. So, taking for instance the first variable, source, it would mean checking that proportion of users coming from ads, SEO, and direct is the same.

This can easily be done the following way:

In [5]:
data.groupby("source")["test"].apply(lambda x: len(x[x==0]))

source
Ads       74352
Direct    37238
SEO       73721
Name: test, dtype: int64

In [6]:
data.groupby("source")["test"].apply(lambda x: len(x[x==1]))

source
Ads       86448
Direct    43047
SEO       86279
Name: test, dtype: int64

In [4]:
data_grouped_source = data.groupby("source")["test"].agg(frequency_test_0 = lambda x: len(x[x==0]), frequency_test_1 = lambda x: len(x[x==1]))

In [5]:
data_grouped_source

Unnamed: 0_level_0,frequency_test_0,frequency_test_1
source,Unnamed: 1_level_1,Unnamed: 2_level_1
Ads,74352,86448
Direct,37238,43047
SEO,73721,86279


In [6]:
data_grouped_source.sum()

frequency_test_0    185311
frequency_test_1    215774
dtype: int64

In [4]:
print(data_grouped_source/data_grouped_source.sum())

        frequency_test_0  frequency_test_1
source                                    
Ads                0.401             0.401
Direct             0.201             0.200
SEO                0.398             0.400


As we can see, relative frequency of source for different segments is the same. That is, we have basically the same proportion of users coming from Ads, Direct, and SEO in both test and control.

We could potentially keep checking all the variables like this. But it would be extremely time consuming (and boring), especially when you start considering numerical variables and categorical variables with many levels.

So we turn this into a machine learning problem and let an algorithm do the boring work for us. The approach is:

Get rid of the conversion variable for now. We don’t care about it here. We are jut checking if the two user distributions are the same. This is before we check conversion rate for the groups

Use the variable test as our label. Try to build a model that manages to separate the users whose test value is 0 vs those whose test value is 1. If randomization worked well, this will be impossible because the two groups are exactly the same. If all variable relative frequencies were the same as for source, no model would be able to separate test == 1 vs test == 0. If randomization did not work well, the model will manage to use a given variable to separate the two groups.

As a model, pick a decision tree. This will allow you to clearly see which variable (if any) is used for the split. That’s where randomization failed.

In [7]:
import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source

In [8]:
data.head()

Unnamed: 0,user_id,source,device,browser_language,browser,sex,age,country,test,conversion
0,1,SEO,Web,EN,Chrome,M,38,Chile,0,0
1,2,SEO,Mobile,ES,Android_App,M,27,Colombia,0,0
2,3,SEO,Mobile,ES,Iphone_App,M,18,Guatemala,1,0
3,5,Ads,Web,ES,Chrome,M,22,Argentina,1,0
4,8,Ads,Mobile,ES,Android_App,M,19,Venezuela,1,0


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401085 entries, 0 to 401084
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   user_id           401085 non-null  int64 
 1   source            401085 non-null  object
 2   device            401085 non-null  object
 3   browser_language  401085 non-null  object
 4   browser           401085 non-null  object
 5   sex               401085 non-null  object
 6   age               401085 non-null  int64 
 7   country           401085 non-null  object
 8   test              401085 non-null  int64 
 9   conversion        401085 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 30.6+ MB


In [9]:
data_dummy = pd.get_dummies(data)

In [10]:
data_dummy.head()

Unnamed: 0,user_id,age,test,conversion,source_Ads,source_Direct,source_SEO,device_Mobile,device_Web,browser_language_EN,...,country_El Salvador,country_Guatemala,country_Honduras,country_Mexico,country_Nicaragua,country_Panama,country_Paraguay,country_Peru,country_Uruguay,country_Venezuela
0,1,38,0,0,0,0,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,27,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,18,1,0,0,0,1,1,0,0,...,0,1,0,0,0,0,0,0,0,0
3,5,22,1,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,8,19,1,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [11]:
train_cols = data_dummy.drop(['test', 'conversion'], axis=1)

In [12]:
tree=DecisionTreeClassifier(
    #change weights. Our data set is now perfectly balanced. It makes easier to look at tree output
    class_weight="balanced",
    #only split if if it's worthwhile. The default value of 0 means always split no matter what if you can increase overall performance, which creates tons of noisy and irrelevant splits
    min_impurity_decrease = 0.001
    )

In [13]:
tree.fit(train_cols,data_dummy['test'])

DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.001)

In [14]:
export_graphviz(tree, out_file="tree_test.dot", feature_names=train_cols.columns, proportion=True, rotate=True)
with open("tree_test.dot") as f:
    dot_graph = f.read()  
s = Source.from_file("tree_test.dot")
s.view()

'tree_test.dot.pdf'

From the above pdf file,
So we can see that test and control are not the same! Users from Argentina and Uruguay are way more likely to be in test than control. When country_Argentina is 1, the tree shows that users in control are ~23% and in test 73% instead of 50/50. For Uruguay, the proportions are even more extreme: 11% in control and 89% in test! Not good!

Let’s double check this manually in our dataset.

In [21]:
print(data_dummy.groupby("test")[["country_Argentina", "country_Uruguay"]].mean())

      country_Argentina  country_Uruguay
test                                    
0                 0.050            0.002
1                 0.173            0.017


Our tree was right! In test, 17% of users are from Argentina, but in control only 5% of users are from Argentina. Uruguay is even more extreme: test has 1.7% of users from Uruguay and control has just 0.2% of Uruguayan users.

And this is a big problem because that means we are not comparing anymore apples to apples in our A/B test. The difference we might see in conversion rate might very well depend on the fact that users between the two groups are different.

Let’s check it in practice:

In [15]:
from scipy import stats
  
#this is the test results using the orginal dataset
original_data = stats.ttest_ind(data_dummy.loc[data['test'] == 1]['conversion'], 
                                data_dummy.loc[data['test'] == 0]['conversion'], 
                                equal_var=False)

In [16]:
#this is after removing Argentina and Uruguay
data_no_AR_UR = stats.ttest_ind(data_dummy.loc[(data['test'] == 1) & (data_dummy['country_Argentina'] ==  0) & (data_dummy['country_Uruguay'] ==  0)]['conversion'], 
                                data_dummy.loc[(data['test'] == 0) & (data_dummy['country_Argentina'] ==  0) & (data_dummy['country_Uruguay'] ==  0)]['conversion'], 
                                equal_var=False)

In [17]:
print(pd.DataFrame( {"data_type" : ["Full", "Removed_Argentina_Uruguay"], 
                         "p_value" : [original_data.pvalue, data_no_AR_UR.pvalue],
                         "t_statistic" : [original_data.statistic, data_no_AR_UR.statistic]
                         }))

                   data_type  p_value  t_statistic
0                       Full    0.000       -7.354
1  Removed_Argentina_Uruguay    0.720        0.358


Huge difference! The biased test where some countries are over/under represented is statistically significant with negative t statistics. So test is worse than control! After removing those two countries, we get non-significant results.

At this point, we have two options:

Acknowledge that there was a bug, go talk to the software engineer in charge of randomization, figure out what went wrong, fix it and re-run the test. Note that when we found a bug, it might be a sign that more things are messed up, not just the one we found. So when we find a bug, always try to get to the bottom of it

If we do find out that everything was fine, but for some reason there was only a problem with those two countries, we can potentially adjust the weights for those two segments so that relative frequencies become the same and then re-run the test