# Statistics with python: A/B test

A/B testing is a tool that allows us to test two or more different ideas against each other, and to see which one empirically performs better.      

Why is A/B testing important ?       
1) no guessing:       
we run this test in real world, we get to know which idea is better under the conditions that matters most.       
2) provide accurate answer quickly.       
3) allows to rapidly iterate on ideas.       
4) one of the only statistiaclly sound ways to establish casual relationships.      

**A/B testing**      
A/B testing works by exposing unique randomly assigned groups of users to each of the ideas you want to test. Then we can observe these users, and by measuring how they behave, untangle the impact of each of your ideas and determine which is the best.      

If we have users and ideas, then chances are we can run an A/B test.       
Before performing A/B testing, we must first understand what is worth testing and optimizing for.

**Key Performance Indicator (KPIs)**            
Typically A/B tests are run to improve KPIs.       
These are the metrics that are most important to the business or organisation.       
For drug company: likelihood of a side-effect etc,     
For mobile game company: revenue, play time per user etc.      

Identifying KPIs requires domain knowledge + experience + EDA.       



## 1. Key Performance Indicator (KPIs)

### 1.1 Identifying and understanding KPIs

Example: Meditation App      
A mobile app that offers meditation services for a paid subscription as well as one-off in-app purchases.       

We are motivated to maintain a strong free-trial to paying user conversion rate. (KPIs)       

Dataset 1: User demographics      
tied to a unique user ID number.

In [1]:
import pandas as pd

In [2]:
df_customer_data = pd.read_csv("../python_basics/data/user_demographics_v1.csv")

In [3]:
df_customer_data.head()

Unnamed: 0,uid,reg_date,device,gender,country,age
0,54030035.0,2017-06-29T00:00:00Z,and,M,USA,19
1,72574201.0,2018-03-05T00:00:00Z,iOS,F,TUR,22
2,64187558.0,2016-02-07T00:00:00Z,iOS,M,USA,16
3,92513925.0,2017-05-25T00:00:00Z,and,M,BRA,41
4,99231338.0,2017-03-26T00:00:00Z,iOS,M,FRA,59


In [4]:
df_customer_data_mod=df_customer_data.rename(columns = {'reg_date':'date'})
df_customer_data_mod['date'] = df_customer_data_mod['date'].str.slice(0,10) # only the year and month and day

In [5]:
df_customer_data_mod.head()

Unnamed: 0,uid,date,device,gender,country,age
0,54030035.0,2017-06-29,and,M,USA,19
1,72574201.0,2018-03-05,iOS,F,TUR,22
2,64187558.0,2016-02-07,iOS,M,USA,16
3,92513925.0,2017-05-25,and,M,BRA,41
4,99231338.0,2017-03-26,iOS,M,FRA,59


Dataset 2: User actions        
Containing the data trial period and date of puchase and price paid in cent.

In [6]:
df_app_purchases = pd.read_csv("../python_basics/data/purchase_data_v1.csv")

In [7]:
df_app_purchases.head()

Unnamed: 0,date,uid,sku,price
0,2017-07-10,41195147,sku_three_499,499
1,2017-07-15,41195147,sku_three_499,499
2,2017-11-12,41195147,sku_four_599,599
3,2017-09-26,91591874,sku_two_299,299
4,2017-12-01,91591874,sku_four_599,599


KPI: conversion rate       
One question in defining our KPI is over what interval should we consider the conversion rate. The conversion immediately after lapse, 1 week or 1 month after.       

**Chooseing KPIs**:                 
One way to decide this is to see the generalisability of these statistics across different demographic groups.      
Stability in this way is desired.       
We can also check if one is more correlated with important factors like retention or spending than the others.      

We can match demographic data to subscription data to explore specific relationship.


In [30]:
# Merge on the 'uid' field
df_uid_combined_data_mod = df_app_purchases.merge(df_customer_data_mod, on=['uid'], how='inner') #SQL-style JOIN, df1.merge(df2)

# Examine the results 
print(df_uid_combined_data_mod.head())
print(len(df_uid_combined_data_mod))

       date_x       uid            sku  price      date_y device gender  \
0  2017-07-10  41195147  sku_three_499    499  2017-06-26    and      M   
1  2017-07-15  41195147  sku_three_499    499  2017-06-26    and      M   
2  2017-11-12  41195147   sku_four_599    599  2017-06-26    and      M   
3  2017-09-26  91591874    sku_two_299    299  2017-01-05    and      M   
4  2017-12-01  91591874   sku_four_599    599  2017-01-05    and      M   

  country  age  
0     BRA   17  
1     BRA   17  
2     BRA   17  
3     TUR   17  
4     TUR   17  
9006


In [31]:
# To look at purchases that happened on the date of registration
# Merge on the 'uid' and 'date' field
df_uid_date_combined_data_mod = df_app_purchases.merge(df_customer_data_mod, on=['uid', 'date'], how='inner')

# Examine the results 
print(df_uid_date_combined_data_mod.head())
print(len(df_uid_date_combined_data_mod))

         date       uid             sku  price device gender country  age
0  2016-03-30  94055095    sku_four_599    599    iOS      F     BRA   16
1  2015-10-28  69627745     sku_one_199    199    and      F     BRA   18
2  2017-02-02  11604973  sku_seven_1499    499    and      F     USA   16
3  2016-06-05  22495315    sku_four_599    599    and      F     USA   19
4  2018-02-17  51365662     sku_two_299    299    iOS      M     TUR   16
35


In [32]:
# original merge without changing "reg_date"
# Merge on the 'uid' field
df_uid_combined_data = df_app_purchases.merge(df_customer_data, on=['uid'], how='inner') #SQL-style JOIN, df1.merge(df2)

# Examine the results 
print(df_uid_combined_data.head())
print(len(df_uid_combined_data))

         date       uid            sku  price              reg_date device  \
0  2017-07-10  41195147  sku_three_499    499  2017-06-26T00:00:00Z    and   
1  2017-07-15  41195147  sku_three_499    499  2017-06-26T00:00:00Z    and   
2  2017-11-12  41195147   sku_four_599    599  2017-06-26T00:00:00Z    and   
3  2017-09-26  91591874    sku_two_299    299  2017-01-05T00:00:00Z    and   
4  2017-12-01  91591874   sku_four_599    599  2017-01-05T00:00:00Z    and   

  gender country  age  
0      M     BRA   17  
1      M     BRA   17  
2      M     BRA   17  
3      M     TUR   17  
4      M     TUR   17  
9006


In [33]:
# Calculate the mean purchase price 
purchase_price_mean = df_uid_combined_data.price.agg('mean')

# Examine the output 
print(purchase_price_mean)

406.77259604707973


In [34]:
# Calculate the mean and median purchase price 
purchase_price_summary = df_uid_combined_data.price.agg(['mean', 'median'])

# Examine the output 
print(purchase_price_summary)

mean      406.772596
median    299.000000
Name: price, dtype: float64


In [35]:
# Calculate the mean and median of price and age
purchase_summary = df_uid_combined_data.agg({'price': ['mean', 'median'], 'age': ['mean', 'median']})

# Examine the output 
print(purchase_summary)

             price        age
mean    406.772596  23.922274
median  299.000000  21.000000


In [36]:
#custom function
def truncated_mean(data):
    top_val = data.quantile(.9)
    bot_val = data.quantile(.1)
    trunc_data = data[(data <= top_val) & (data >= bot_val)]
    mean = trunc_data.mean()
    return(mean)

df_uid_combined_data.agg({"age":[truncated_mean]})

Unnamed: 0,age
truncated_mean,21.671545


In [15]:
# Group the data 
grouped_purchase_data = df_uid_combined_data.groupby(by = ['device', 'gender'])

# Aggregate the data
purchase_summary = grouped_purchase_data.agg({'price': ['mean', 'median', 'std']})

# Examine the results
print(purchase_summary)

                    price                   
                     mean median         std
device gender                               
and    F       400.747504    299  179.984378
       M       416.237308    499  195.001520
iOS    F       404.435330    299  181.524952
       M       405.272401    299  196.843197


### 1.2 Practical example of calculating KPIs

You're now going to take what you've learned and work through calculating a KPI yourself. Specifically, you'll calculate the average amount paid per purchase within a user's first 28 days using the purchase_data DataFrame from before.       

This KPI can provide a sense of the popularity of different in-app purchase price points to users within their first month. 

In [16]:
import numpy as np
from datetime import datetime, timedelta

In [37]:
df_uid_combined_data.head()

Unnamed: 0,date,uid,sku,price,reg_date,device,gender,country,age
0,2017-07-10,41195147,sku_three_499,499,2017-06-26T00:00:00Z,and,M,BRA,17
1,2017-07-15,41195147,sku_three_499,499,2017-06-26T00:00:00Z,and,M,BRA,17
2,2017-11-12,41195147,sku_four_599,599,2017-06-26T00:00:00Z,and,M,BRA,17
3,2017-09-26,91591874,sku_two_299,299,2017-01-05T00:00:00Z,and,M,TUR,17
4,2017-12-01,91591874,sku_four_599,599,2017-01-05T00:00:00Z,and,M,TUR,17


In [38]:
df_uid_combined_data['reg_date'] = df_uid_combined_data['reg_date'].str.slice(0,10)
df_uid_combined_data['date'] = pd.to_datetime(df_uid_combined_data['date'])
df_uid_combined_data['reg_date'] = pd.to_datetime(df_uid_combined_data['reg_date'])

In [19]:
df_uid_combined_data.head()

Unnamed: 0,date,uid,sku,price,reg_date,device,gender,country,age
0,2017-07-10,41195147,sku_three_499,499,2017-06-26,and,M,BRA,17
1,2017-07-15,41195147,sku_three_499,499,2017-06-26,and,M,BRA,17
2,2017-11-12,41195147,sku_four_599,599,2017-06-26,and,M,BRA,17
3,2017-09-26,91591874,sku_two_299,299,2017-01-05,and,M,TUR,17
4,2017-12-01,91591874,sku_four_599,599,2017-01-05,and,M,TUR,17


In [20]:
df_uid_combined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9006 entries, 0 to 9005
Data columns (total 9 columns):
date        9006 non-null datetime64[ns]
uid         9006 non-null int64
sku         9006 non-null object
price       9006 non-null int64
reg_date    9006 non-null datetime64[ns]
device      9006 non-null object
gender      9006 non-null object
country     9006 non-null object
age         9006 non-null int64
dtypes: datetime64[ns](2), int64(3), object(4)
memory usage: 1023.6+ KB


In [21]:
current_date = pd.to_datetime('2018-03-17 00:00:00')
current_date

Timestamp('2018-03-17 00:00:00')

In [22]:
# Subtract timedelta(days=28) from current_date to find the last date that we will count purchases from. 
# Compute max_purchase_date 
max_purchase_date = current_date - timedelta(days=28)

In [23]:
# Filter out all users in purchase_data who registered in the last 28 days. 
# That is, users whose purchase_data.reg_date is less than max_purchase_date.
# Filter to only include users who registered before our max date
purchase_data_filt = df_uid_combined_data[df_uid_combined_data.reg_date < max_purchase_date]

In [24]:
# Filter this dataset to only include purchases that occurred on a date within the first 28 days. 
# Recall that the date of purchase is stored in the date column

# Filter to contain only purchases within the first 28 days of registration
purchase_data_filt = purchase_data_filt[(purchase_data_filt.date <=
                                         purchase_data_filt.reg_date + 
                                         timedelta(days=28))]

In [25]:
# Output the mean price paid per purchase
print(purchase_data_filt.price.agg("mean"))

414.4237288135593


**Average purchase price by cohort**          

Building on the previous exercise, let's look at the same KPI, average purchase price, and a similar one, median purchase price, within the first 28 days. Additionally, let's look at these metrics not limited to 28 days to compare.        

We can calculate these metrics across a set of cohorts and see what differences emerge. This is a useful task as it can help us understand how behaviors vary across cohorts.         

Note that in our data the price variable is given in cents.

Use np.where to create an array month1 containing:         

the price of the purchase purchase, if           

the user registration .reg_date occurred at most 28 days ago (i.e. before max_reg_date), and       

the date of purchase .date occurred within 28 days of registration date .reg_date;         

NaN, otherwise.

In [26]:
# Set the max registration date to be one month before today
max_reg_date = current_date - timedelta(days=28)

# Find the month 1 values:
month1 = np.where((df_uid_combined_data.reg_date < max_reg_date) &
                    (df_uid_combined_data.date < df_uid_combined_data.reg_date + timedelta(days=28)),
                  df_uid_combined_data.price, 
                  np.NaN)
                 
# Update the value in the DataFrame 
df_uid_combined_data['month1'] = month1

In [27]:
# Group the data by gender and device 
purchase_data_upd = df_uid_combined_data.groupby(by=['gender', 'device'], as_index=False)

In [28]:
# Aggregate the month1 and price data 
purchase_summary = purchase_data_upd.agg(
                        {'month1': ['mean', 'median'],
                        'price': ['mean', 'median']})

# Examine the results 
print(purchase_summary)

  gender device      month1              price       
                       mean median        mean median
0      F    and  388.204545  299.0  400.747504    299
1      F    iOS  432.587786  499.0  404.435330    299
2      M    and  413.705882  399.0  416.237308    499
3      M    iOS  433.313725  499.0  405.272401    299


In [29]:
df_uid_combined_data.head()

Unnamed: 0,date,uid,sku,price,reg_date,device,gender,country,age,month1
0,2017-07-10,41195147,sku_three_499,499,2017-06-26,and,M,BRA,17,499.0
1,2017-07-15,41195147,sku_three_499,499,2017-06-26,and,M,BRA,17,499.0
2,2017-11-12,41195147,sku_four_599,599,2017-06-26,and,M,BRA,17,
3,2017-09-26,91591874,sku_two_299,299,2017-01-05,and,M,TUR,17,
4,2017-12-01,91591874,sku_four_599,599,2017-01-05,and,M,TUR,17,


## 2. Exploring and visulisation

In [None]:
# using the Timedelta class
current_date = pd.to_datetime("2018-03-17")

#to add or subtract an time interval to a date, we can use the timedelta class (specify unit of time, and no. of unit)

max_lapse_date = current_date - timedelta(days=14)
conv_sub_data = sub_data_demo[sub_data_demo.lapse_date < max_lapse_date]

#no. of days between a user's lapsed and subscribed dates
# return no. of days.
sub_time = (conv_sub_data.subscription_date - conv_sub_data.lapse_date)
# add this as a new col
conv_sub_data["sub_time"] = sub_time

# To convert this value to an integer rather than unit of days we can extract the number 
# with .dt.days
# can be used to extract other interval such as weeks or months
conv_sub_data["sub_time"] = conv_sub_data.sub_time.dt.days

# no. of users who are not subscribed in week 1 and who have been on the platform two or more weeks
conv_base = conv_sub_total[(conv_sub_data.sub_time.notnull())|(conv_sub_data.sub_time > 7)]
total_users = len(conv_base)

# no. of remaining users who have a sub_day between 8 and 14. 
total_subs = np.where(conv_sub_data.sub_time.notnull() & (conv_base.sub_time<=14),1,0)
total_subs = sum(total_subs)

# conversion rate 
conversion_rate = total_subs / total_users

## 3. A/B test

An A/B test is an experiment we test 2 different values of the same variable against one another, to determine which one is "better" by evaluating their performance among two randomly assigned groups of real users.      

Often, we will be testing the current state, against a variant, in which case our 2 varialbes are called the control and treatment respectively.    

Consider a scenario where the initial subscription paywall in our mediatation app, we have 2 phrases pop out asking for subscription after free trial is over, and we would like to see which phrase is better for conversion rate.    

To test this with an A/B test, we randomly select a subset of users and show them the new phrase, and monitor the conversion behavior of this group compared to the other users.      

We can then observe whether one group converts at a higher rate than the other.      

The key to A/B is **randomly** assigining the users. By randomly assigning the users, we ensure that no other confounding varialbe is likely to explain the change in performance we observe.       

Good problem for A/B testings are those where users are being impacted in an individual way. Bad problems are those in which there are network effects of users, that is the change we want to test impacts or is directly related to an area where users are interacting with one another.      



In an A/B test, we must define a response variable that we will use to measure our impact. This should be either a KPI or something directly related to KPI. Additionally, we should select a response that is directly measureable like purchases rather than something hard to measure.     

Next we have a set of factors that impact our Response such as the color of a paywall on purchases. Related are variants which are particular manifestations of that factor, such as a red and blue paywall.      

Next we have our experimental unit. This the unit over which metrics are measured before aggregating over the control or treatment group overall. e.g. if we were looking at purchases of a consumable as our response we could use users as our experimental unit and compare the average number of purchaes per user acorss our 2 groups.       

Experimental units: Revenue per user day           
We are going to check what happens when we add a consumable paywall to our app. A paywall is a feature of a website or other technology that requires payment from users in order to access additional content or services.          

Here, you'll practice calculating experimental units and baseline values related to our consumable paywall. Both measure revenue only among users who viewed a paywall. Your job is to calculate revenue per user-day, with user-day as the experimental unit.            

In [40]:
purchase_data = df_uid_combined_data
purchase_data.head()

Unnamed: 0,date,uid,sku,price,reg_date,device,gender,country,age
0,2017-07-10,41195147,sku_three_499,499,2017-06-26,and,M,BRA,17
1,2017-07-15,41195147,sku_three_499,499,2017-06-26,and,M,BRA,17
2,2017-11-12,41195147,sku_four_599,599,2017-06-26,and,M,BRA,17
3,2017-09-26,91591874,sku_two_299,299,2017-01-05,and,M,TUR,17
4,2017-12-01,91591874,sku_four_599,599,2017-01-05,and,M,TUR,17


In [41]:
# Extract the 'day'; value from the timestamp
purchase_data.date = purchase_data.date.dt.floor('d')

# Replace the NaN price values with 0 
purchase_data.price = np.where(np.isnan(purchase_data.price), 0, purchase_data.price)

# Aggregate the data by 'uid' & 'date'
purchase_data_agg = purchase_data.groupby(by=['uid', 'date'], as_index=False)
revenue_user_day = purchase_data_agg.sum()

# Calculate the final average
revenue_user_day = revenue_user_day.price.mean()
print(revenue_user_day)

407.95033407572384


## 3.1 Preparing to run an A/B test

There are two primary concerns in test design. Ensuring that our test can be practically run and that we can derive meaningful results from it. These two objectives are strongly connected. A good question to ask is what percentage change would be meaningful to detect in our response variable.       

The minimum level of change we want to detect is called **sensitivity**. A good exercise is to look at what different sensitivities look like for our experimental unit of choice. e.g. we can look at what different changes mean for our revenue per use in the period of our test.         

It is also important to understand **latent variability** in the data. In this case it makes sens to understand if the purchase amount is consistent across all users, or if it varies widely. A change due to treatment will be more easilty captured in the former case. We can get the standard deviation of test result to evalute test. But using the value of our initial data is important for planning. A metric such as std/mean can tell how much variability in the data. 

## 3.2 Null hypothesis

This the hypothesis that our control and treatment, that is our two phrases, have the same impact on the response. Any observed difference is just randomness. If we can conclude this is not the case, then we say our results are statistically significant and that there is a difference.        

Rejecting the null hypothesis when it is true is called type I error.        
Retaining the false null hypothesis is called type II error.       

We define our probability of **NOT** making type I error as the **confidence level**. The higher we make this value the larger of a smaple we wil need. A common value of this is 0.95.       

Related to this is the idea of **statistical power**. Power is the probability of finding statistically significant results when the Null hypothesis is false.        

Power and Confidence level are connected to the standard error and sensitivity of our test. To estimate our needed sample size, we can choose our desired sensitivity, set our desired confidence level ane power, then estimate standard error using these values.      

After A/B test, we can determine whether the difference between treatment and control group is statisically significant. That is, measuring if the values differ more than would be expected due to randomness. If yes, we can reject Null hypothesis. To do this we must calculate the *p-value*.         

The p-value is the probability of observing a value as or more extreme than the observed value under the Null hypothesis. If this value is low, then it means either our power is low or there is a low probability of observing this value if the Null hypothisis is true. This represents a meausre of evidence against retaining the null hypothesis.      

Standard p-value:         
<0.01: very strong evidence against the Null hypothesis.          
0.01 - 0.05: strong evidence against the Null hypothesis.        
0.05 - 0.10: very weak evidence against the Null hypothesis.       
more than 0.1: small to no evidence against the Null hypothesis.

Usually we will report:       
sample size, run time, mean and variance of the two groups. Also the esimated lift and confidence interval.