<a href="https://colab.research.google.com/github/ZhangxjMia/AB_Testing/blob/main/website_transaltion_abtest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Website Translation - Project Description

Company XYZ is a worldwide e-commerce site with localized versions of the site. A data scientist at XYZ noticed that **Spain-based users have a much higher conversion rate than any other Spanish-speaking country**.

Spain and LatAm country manager suggested that one reason could be translation. All Spanish-speaking countries had the same translation of the site which was written by a Spaniard. Therefore, they agreed to try a test where each country would have its own translation written by a local. That is, Argentinian users would see a translation written by an Argentinian, Mexican users written by a Mexican and so on. Obviously, nothing would change for users from Spain.

After they run the test however, they are really surprised because the test is negative. That is, it appears that the non-highly localized translation was doing better!

## Problem to be solved
1. *Confirm* that test is actually negative. i.e., the old version of the site with just one translation across Spain and LatAm performs better
2. *Explain* why that might be happening. Are the localized translations really worse?
3. If you identified what was wrong, *design an algorithm* that would return FALSE if the same problem is happening in the future and TRUE if everything is good and results can be trusted.

## Dataset: test_table & user_table
> "test_table" - general information about the test results.

**Columns:**

* **user_id** : the id of the user. Unique by user. Can be joined to user id in the other table. For each user, we just check whether conversion happens the first time they land on the site since the test started.
* **date** : when they came to the site for the first time since the test started
* **source** : marketing channel: *Ads, SEO, Direct* . *Direct* means everything except for ads and SEO. Such as directly typing site URL on the browser, downloading the app w/o coming from SEO or Ads, referral friend, etc.
* **device** : device used by the user. It can be mobile or web
* **browser_language** : in browser or app settings, the language chosen by the user. It can be *EN, ES, Other* (Other means any language except for English and Spanish)
* **ads_channel** : if marketing channel is ads, this is the site where the ad was displayed. It can be: *Google, Facebook, Bing, Yahoo ,Other*. If the user didn't come via an ad, this field is NA
* **browser** : user browser. It can be: *IE, Chrome, Android_App, FireFox, Iphone_App, Safari, Opera
* **conversion** : whether the user converted (1) or not (0). This is our label. A test is considered successful if it increases the proportion of users who convert.
* **test** : users are randomly split into test (1) and control (0). Test users see the new translation and control the old one. For Spain-based users, this is obviously always 0 since there is no change there.

> "user_table" - some information about the user

**Columns:**

* **user_id** : the id of the user. It can be joined to user id in the other table **sex** : user sex: Male or Female
* **age** : user age (self-reported)
* **country** : user country based on ip address

## Project Key Points
A key assumption of the A/B test is that the only difference between the treatment group and the control group is the feature we are testing, which means that the treatment group and the control group are comparable in user distribution. If this is true, then we can accurately estimate the impact of feature changes on any metrics of our experiment.

Comparability: The user distribution of the treatment group and the control group means that for each relevant market, the relative proportions of users in the treatment group and the control group are similar, that is, if the US users are 10% of the users in the treatment, We hope that the proportion of American users in the control group is 10%.

From a statistical point of view, the above situation is correct among enough users. In the a/b test, we are looking for a very small gain, so the sample size needs to be as large as possible to ensure that the distribution of the treatment group and the control group are the same.

Therefore, before performing statistical tests, it is extremely important to check whether the distribution of the treatment group and the control group are similar.

# Data Pre-processing

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 350)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# Load data
user = pd.read_csv("user_table.csv")
test = pd.read_csv('test_table.csv')

In [None]:
print("User table's shape: ", user.shape)
print("Test table's shape: ", test.shape)

User table's shape:  (452867, 4)
Test table's shape:  (453321, 9)


In [None]:
user.head()
test.head()

Unnamed: 0,user_id,sex,age,country
0,765821,M,20,Mexico
1,343561,F,27,Nicaragua
2,118744,M,23,Colombia
3,987753,F,27,Venezuela
4,554597,F,20,Spain


Unnamed: 0,user_id,date,source,device,browser_language,ads_channel,browser,conversion,test
0,315281,2015-12-03,Direct,Web,ES,,IE,1,0
1,497851,2015-12-04,Ads,Web,ES,Google,IE,0,1
2,848402,2015-12-04,Ads,Web,ES,Facebook,Chrome,0,0
3,290051,2015-12-03,Ads,Mobile,Other,Facebook,Android_App,0,1
4,548435,2015-11-30,Ads,Web,ES,Google,FireFox,0,1


In [None]:
# Check if two tables have duplicate data
user['user_id'].nunique() == len(user['user_id'])
test['user_id'].nunique() == len(test['user_id'])

True

True

In [None]:
len(user['user_id']) - len(test['user_id'])

-454

In [None]:
# There are some user_id missing in user table but relatively small, so we could user join to remove those data.
mydata = test.merge(user, on = ['user_id'])

mydata.head()
mydata.describe().T
mydata.info()

Unnamed: 0,user_id,date,source,device,browser_language,ads_channel,browser,conversion,test,sex,age,country
0,315281,2015-12-03,Direct,Web,ES,,IE,1,0,M,32,Spain
1,497851,2015-12-04,Ads,Web,ES,Google,IE,0,1,M,21,Mexico
2,848402,2015-12-04,Ads,Web,ES,Facebook,Chrome,0,0,M,34,Spain
3,290051,2015-12-03,Ads,Mobile,Other,Facebook,Android_App,0,1,F,22,Mexico
4,548435,2015-11-30,Ads,Web,ES,Google,FireFox,0,1,M,19,Mexico


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,452867.0,499944.805166,288676.264784,1.0,249819.0,500019.0,749543.0,1000000.0
conversion,452867.0,0.04956,0.217034,0.0,0.0,0.0,0.0,1.0
test,452867.0,0.476462,0.499446,0.0,0.0,0.0,1.0,1.0
age,452867.0,27.13074,6.776678,18.0,22.0,26.0,31.0,70.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 452867 entries, 0 to 452866
Data columns (total 12 columns):
user_id             452867 non-null int64
date                452867 non-null object
source              452867 non-null object
device              452867 non-null object
browser_language    452867 non-null object
ads_channel         181693 non-null object
browser             452867 non-null object
conversion          452867 non-null int64
test                452867 non-null int64
sex                 452867 non-null object
age                 452867 non-null int64
country             452867 non-null object
dtypes: int64(4), object(8)
memory usage: 44.9+ MB


In [None]:
# Convert date from object to datetime
mydata['date'] = pd.to_datetime(mydata['date'])

In [None]:
# We need to confirm the original conversion rate of Spain is higher than Latin-America country

country_conversion = mydata.query('test == 0').groupby(['country'])['conversion'].mean()
country_conversion.sort_values(ascending = False)

In [None]:
# Since Spain doesn't change, so we can remove Spain
mydata = mydata.query('country != "Spain"')

In [None]:
mydata.groupby('test')['conversion'].mean()
# The control group's conversion rate is higher than the treatment group

test
0    0.048292
1    0.043411
Name: conversion, dtype: float64

### Compare two population mean: two sample t-test
Situation Apply: two unrelated groups (sample size can be equal or not equal), test the difference of the mean.

#### 1. Firstly we should confirm if the variance of two groups are equal, utilize levene test for equality of variances.

In [None]:
from scipy import stats
stats.levene(mydata[mydata['test'] == 1]['conversion'],
             mydata[mydata['test'] == 0]['conversion'])

LeveneResult(statistic=54.497646998915, pvalue=1.5593292774404536e-13)

p-value < 0.05, the variances of two groups are not equal, so we need to set the equal_var parameter to "False"
Notice:
1. if two groups have equality of variance, make a mistake to set equal_var = False, then p-value will overstate.
2. if two groups have unequality of variance, haven't set the equal_var to False (default is True). then p-value will understate.

In [None]:
test_result = stats.ttest_ind(mydata[mydata['test'] == 1]['conversion'],
                              mydata[mydata['test'] == 0]['conversion'],
                              equal_var = False)
test_result

Ttest_indResult(statistic=-7.353895203080277, pvalue=1.9289178577799033e-13)

p-value < 0.05. we reject null hypothesis and conclude that the treatment and control group are significant difference.
Possible Reasons:
1. Sample size is too small.
2. There are some error during experiment, so treatment and control are not randomly selected.

In [None]:
# Plot to see if the weird result happens constently or in sudden.
import matplotlib.pyplot as plt
# Compare daily conversion rate of treatment to control
data_daily = mydata.groupby('date')['conversion'].agg([lambda x: 
                                                         x[mydata['test'] == 1].mean()
                                                        /x[mydata['test'] == 0].mean()]).plot()
plt.show()

<Figure size 640x480 with 1 Axes>

The treatment is always worse than control group, and has relative small difference in different date. This means we have enough sample size, but there are some error in experiment.

#### 2. Check if treatment and control have somilar distribution.
To check if ABtest is effectively randomization means to confirm all the invariants in the treatment and control have same distribution. Take the first invariant for example: source. To check the porprotion of users from *Ads*, *SEO*, and *Direct* are the same.

In [None]:
# group by source and estimate relative frequencies
data_source = mydata.groupby('source')['test'].agg(
[lambda x: len(x[x == 0]),
 lambda x: len(x[x == 1])])

data_source

data_source_rate = data_source/data_source.sum()
data_source_rate.rename(index = str, columns = {'<lambda_0>': 'freq_test_0', '<lambda_1>': 'freq_test_1'})

Unnamed: 0_level_0,<lambda_0>,<lambda_1>
source,Unnamed: 1_level_1,Unnamed: 2_level_1
Ads,74352,86448
Direct,37238,43047
SEO,73721,86279


Unnamed: 0_level_0,freq_test_0,freq_test_1
source,Unnamed: 1_level_1,Unnamed: 2_level_1
Ads,0.401228,0.400641
Direct,0.200949,0.1995
SEO,0.397823,0.399858


As we can see, the relative frequencies for different sources are same, which means the proportion of users from these sources are the same. We can continue to check invariants like this, but it's time-consuming.

#### 3. Easy Method: remove conversion, only check if user distribution in these two groups are the same.

* Use 'test' vaeiable as our label and try to build a model to separate test == 0 and test == 1. If randomization in ABtest is good, no model can separate it since these two groups are similar. If randomization in ABtest is not good, this model will separate two groups using given variable.
* Choose Decision Tree as the model. It can clear show which variable is used to split. This is why randomization fails in AB Test.

In [None]:
import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source
from sklearn import tree

# Convert date to string, easy to convert to dummy variable
mydata['date'] = mydata['date'].apply(str)

data_dummy = pd.get_dummies(mydata)
data_dummy

# Remove conversion, now 'test' is the lable
train_cols = data_dummy.drop(['test', 'conversion'], axis = 1)
train_cols

Unnamed: 0,user_id,conversion,test,age,date_2015-11-30 00:00:00,date_2015-12-01 00:00:00,date_2015-12-02 00:00:00,date_2015-12-03 00:00:00,date_2015-12-04 00:00:00,source_Ads,...,country_El Salvador,country_Guatemala,country_Honduras,country_Mexico,country_Nicaragua,country_Panama,country_Paraguay,country_Peru,country_Uruguay,country_Venezuela
1,497851,0,1,21,0,0,0,0,1,1,...,0,0,0,1,0,0,0,0,0,0
3,290051,0,1,22,0,0,0,1,0,1,...,0,0,0,1,0,0,0,0,0,0
4,548435,0,1,19,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
5,540675,0,1,22,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
6,863394,0,0,35,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
452861,783089,0,0,20,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,0,0
452862,425010,0,0,50,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
452863,826793,0,1,20,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
452865,785224,0,1,21,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,user_id,age,date_2015-11-30 00:00:00,date_2015-12-01 00:00:00,date_2015-12-02 00:00:00,date_2015-12-03 00:00:00,date_2015-12-04 00:00:00,source_Ads,source_Direct,source_SEO,...,country_El Salvador,country_Guatemala,country_Honduras,country_Mexico,country_Nicaragua,country_Panama,country_Paraguay,country_Peru,country_Uruguay,country_Venezuela
1,497851,21,0,0,0,0,1,1,0,0,...,0,0,0,1,0,0,0,0,0,0
3,290051,22,0,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,548435,19,1,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
5,540675,22,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
6,863394,35,0,0,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
452861,783089,20,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
452862,425010,50,0,0,0,0,1,0,0,1,...,0,0,0,1,0,0,0,0,0,0
452863,826793,20,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
452865,785224,21,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [None]:
dtclf = DecisionTreeClassifier(
    # 指定样本各类别的的权重，主要是为了防止训练集某些类别的样本过多导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重，如果使用“balanced”，则算法会自己计算权重，样本量少的类别所对应的样本权重会高。
    class_weight='balanced',
    # 这个值限制了决策树的增长，如果某节点的不纯度(基尼系数，信息增益，均方差，绝对差)小于这个阈值则该节点不再生成子节点。即为叶子节点 。
    min_impurity_decrease = 0.001
    )
dtclf.fit(train_cols, data_dummy['test'])

DecisionTreeClassifier(class_weight='balanced', criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.001, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [None]:
export_graphviz(dtclf, out_file = 'tree_test.dot', feature_names = train_cols.columns, proportion = True, rotate = True)
with open('tree_test.dot') as f:
    dot_graph = f.read()

s = Source.from_file('tree_test.dot')
s.view()

'tree_test.dot.pdf'

We can see that the distributions of treatment and control are not the same. When country_Argentina = 1, it shows percentage of users in control is 23% (1-87.3%*88.3%), percentage of users in treatment is 77% (87.3%*88.3%). For Uruguay, control : treatment = 12.7% : 87.3%.

In [None]:
# We doule check the user proportion in Argentina and Uruguay
data_dummy.groupby('test')[['country_Argentina', 'country_Uruguay']].mean()

Unnamed: 0_level_0,country_Argentina,country_Uruguay
test,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.050488,0.002239
1,0.173223,0.017236


The decision tree is right. For Argentina, treatment : control = 17% : 5%, for Uruguay, treatment : control = 1.7% : 0.2%. This is the problem, which means the significant difference in conversion rate due to the difference of these two groups.

In [None]:
# Prove

# Use original dataset
original_data = stats.ttest_ind(data_dummy[mydata['test'] == 1]['conversion'],
                                data_dummy[mydata['test'] == 0]['conversion'],
                                equal_var = False)

# Use dataset without Argentina and Uruguay
removed_data = stats.ttest_ind(data_dummy[(mydata['test'] == 1) &
                                          (data_dummy['country_Argentina'] == 0) &
                                          (data_dummy['country_Uruguay'] == 0)
                                         ]['conversion'],
                               data_dummy[(mydata['test'] == 0) &
                                          (data_dummy['country_Argentina'] == 0) &
                                          (data_dummy['country_Uruguay'] == 0)
                                         ]['conversion'],
                               equal_var = False)

print(pd.DataFrame({'data_type' : ['original', 'removed'],
                    'p_value' : [original_data.pvalue, removed_data.pvalue],
                    't_stat': [original_data.statistic, removed_data.statistic]}))

  data_type       p_value    t_stat
0  original  1.928918e-13 -7.353895
1   removed  7.200849e-01  0.358346


The difference is significant. After we removed these two countries, we got result that is not significant. Even though our target is to improve conversion rate, and this is not a big success, but at least we know that localized translation won't make conversion rate become worse.

## Project Suggestions
1. Talk to the enginner who is responsible for randomization, find out what's the problem, and fix it to run the test again.
2. If you find everthing is fine other than these two countries, you can adjust the weight of these two countries.