# Statistics with python: A/B test

A/B testing is a tool that allows us to test two or more different ideas against each other, and to see which one empirically performs better.      

Why is A/B testing important ?       
1) no guessing:       
we run this test in real world, we get to know which idea is better under the conditions that matters most.       
2) provide accurate answer quickly.       
3) allows to rapidly iterate on ideas.       
4) one of the only statistiaclly sound ways to establish casual relationships.      

**A/B testing**      
A/B testing works by exposing unique randomly assigned groups of users to each of the ideas you want to test. Then we can observe these users, and by measuring how they behave, untangle the impact of each of your ideas and determine which is the best.      

If we have users and ideas, then chances are we can run an A/B test.       
Before performing A/B testing, we must first understand what is worth testing and optimizing for.

**Key Performance Indicator (KPIs)**            
Typically A/B tests are run to improve KPIs.       
These are the metrics that are most important to the business or organisation.       
For drug company: likelihood of a side-effect etc,     
For mobile game company: revenue, play time per user etc.      

Identifying KPIs requires domain knowledge + experience + EDA.       



## 1. Key Performance Indicator (KPIs)

### 1.1 Identifying and understanding KPIs

Example: Meditation App      
A mobile app that offers meditation services for a paid subscription as well as one-off in-app purchases.       

We are motivated to maintain a strong free-trial to paying user conversion rate. (KPIs)       

Dataset 1: User demographics      
tied to a unique user ID number.

In [1]:
import pandas as pd

In [2]:
df_customer_data = pd.read_csv("../python_basics/data/user_demographics_v1.csv")
df_customer_data=df_customer_data.rename(columns = {'reg_date':'date'})
df_customer_data['date'] = df_customer_data['date'].str.slice(0,10) # only the year and month and day

In [3]:
df_customer_data.head()

Unnamed: 0,uid,date,device,gender,country,age
0,54030035.0,2017-06-29,and,M,USA,19
1,72574201.0,2018-03-05,iOS,F,TUR,22
2,64187558.0,2016-02-07,iOS,M,USA,16
3,92513925.0,2017-05-25,and,M,BRA,41
4,99231338.0,2017-03-26,iOS,M,FRA,59


Dataset 2: User actions        
Containing the data trial period and date of puchase and price paid in cent.

In [4]:
df_app_purchases = pd.read_csv("../python_basics/data/purchase_data_v1.csv")

In [5]:
df_app_purchases.head()

Unnamed: 0,date,uid,sku,price
0,2017-07-10,41195147,sku_three_499,499
1,2017-07-15,41195147,sku_three_499,499
2,2017-11-12,41195147,sku_four_599,599
3,2017-09-26,91591874,sku_two_299,299
4,2017-12-01,91591874,sku_four_599,599


KPI: conversion rate       
One question in defining our KPI is over what interval should we consider the conversion rate. The conversion immediately after lapse, 1 week or 1 month after.       

**Chooseing KPIs**:                 
One way to decide this is to see the generalisability of these statistics across different demographic groups.      
Stability in this way is desired.       
We can also check if one is more correlated with important factors like retention or spending than the others.      

We can match demographic data to subscription data to explore specific relationship.


In [6]:
# Merge on the 'uid' field
df_uid_combined_data = df_app_purchases.merge(df_customer_data, on=['uid'], how='inner') #SQL-style JOIN, df1.merge(df2)

# Examine the results 
print(df_uid_combined_data.head())
print(len(df_uid_combined_data))

       date_x       uid            sku  price      date_y device gender  \
0  2017-07-10  41195147  sku_three_499    499  2017-06-26    and      M   
1  2017-07-15  41195147  sku_three_499    499  2017-06-26    and      M   
2  2017-11-12  41195147   sku_four_599    599  2017-06-26    and      M   
3  2017-09-26  91591874    sku_two_299    299  2017-01-05    and      M   
4  2017-12-01  91591874   sku_four_599    599  2017-01-05    and      M   

  country  age  
0     BRA   17  
1     BRA   17  
2     BRA   17  
3     TUR   17  
4     TUR   17  
9006


In [7]:
# To look at purchases that happened on the date of registration
# Merge on the 'uid' and 'date' field
df_uid_date_combined_data = df_app_purchases.merge(df_customer_data, on=['uid', 'date'], how='inner')

# Examine the results 
print(df_uid_date_combined_data.head())
print(len(df_uid_date_combined_data))

         date       uid             sku  price device gender country  age
0  2016-03-30  94055095    sku_four_599    599    iOS      F     BRA   16
1  2015-10-28  69627745     sku_one_199    199    and      F     BRA   18
2  2017-02-02  11604973  sku_seven_1499    499    and      F     USA   16
3  2016-06-05  22495315    sku_four_599    599    and      F     USA   19
4  2018-02-17  51365662     sku_two_299    299    iOS      M     TUR   16
35


In [None]:
# Calculate the mean purchase price 
purchase_price_mean = purchase_data.price.agg('mean')

# Examine the output 
print(purchase_price_mean)

In [None]:
sub_data_grp = sub_data_demo.groupby(by=["country","device"],axis=0,as_index=False)

In [None]:
sub_data_grp.price.mean()

In [None]:
sub_data_grp.price.agg("mean")
sub_data_grp.price.agg(["mean","median"])
#pass a dict of col names and aggregation functions
sub_data_grp.agg({"price":["mean","min","max"],
                 "age":["mean","min","max"]})

#custom function
def truncated_mean(data):
    top_val = data.quantile(.9)
    bot_val = data.quantile(.1)
    trunc_data = data[(data <= top_val) & (data >= bot_val)]
    mean = trunc_data.mean()
    return(mean)

sub_data_grp.agg({"age":[truncated_mean]})