# A/B TEST Project

Chi Zhang

## Introduction to A/B Test and Background

A/B tests are controlled experiments that run one or more variations against the original control group. It's Often used online to test a feature and see how users respond differently. This project comes from an online course presented by Google, and the course teaches how to pick and characterize metrics for the test, design experiments with required statistical power, conduct results analysis and provide recommnedation.

## Project Introduction

### Initial Setup

Currently, Udacity, an online learning platform, has two options on its course overview page. One is " Start Free Trail", and the other one is " Access Course Materials". If someone chooses the first option, they will be asked to enter their credit card information and automatically charged after 14 days if not cancelled until then. If someone chooses the second option, they will be able to view materials for free, but without the support from the coaching staff. 

### Experimental Change 

If someone clicks "start Free Trail", they will then be asked how many hours they will devote to this course. If the student is able to commit 5 or more hours per week, they would be taken through the checkout process as usual. Otherwise, they will be suggested to access the materials for free. Then, it's the student's choice to continue enrolling in the course or access the materials for free instead. 

### Hypothesis

Null Hypothsis: the change will not affect students enrollment behaviros and further improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

Alternative:  The change will reduce the number of frustrated students who left the free trial because they didn't have enough time without significantly reducing the number of students to continue past the free trial and eventually complete the course.

### Additional information

Unit of diversion: cookie
For studnets enrolled in the free trail, they are tracked by user-id going forward. 

## Experiment Design

### Metric Choice

For each metric provided, I justify wheather to use it as an invariate or evaluation metric. 
Invariate metrics are used to check if the test is inherently wrong as a part of sanity check. Thus, invariate metrics are not expected to change in both control and experiment groups.
Evaludation metrics are used to indicate whether we achieve the goal of the test. Thus, we expect a change between control and experiment groups, and the dmin here indicates the practical significance level of that metric. 

Number of cookies: number of unique cookies to view the course overview page. (dmin=3000)
Invariant. Because this metric will not be affected by the experiment, and we are likely to have  
the same numbers of cookies in both control and experiment groups. 

Number of user-ids: number of users who enroll in the free trial. (dmin=50)
Neither. Number of users enrolled in the free trial will not be the same for both control and experiment groups, thus I did not use it as an invariant metric. As for evaluation metric, this metrics is not normalized to indicate whether there is a reduction for the number of students who did not have enough time. Thus, I did not pick this for evaluation metric. 

Number of clicks: number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
Invariant. Because number of unique cookies to click the button is determined before the test, and the number are likely to be the same for both groups. Thus I choose this as a invariant metric. 

Click-through-probability: number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
Invariant. Similar to number of clicks, the click through probability describes the conditions before the test, thus click through probability will be the same for both groups. 

Gross conversion: number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
Evaluation. To achieve the goal of reducing frustrated students with limited time, I expect the gross conversion to decrease, and the practical significance level is 0.01. 

Retention: number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
Evaluation. I expect to see an increased retention and pick this metric for evaluation because it indicates the effectiveness of the change.

Net conversion: number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)
Evaluation. Relating to the goal of allocating coaching resources more effectively, I expect to see a slightly increased net conversion and the practical significance level is 0.0075. 

To conclude:
Invariant metrics: No. of cookies, No. of clicks, click through probability
Evaluation metrics: gross conversion, net conversion, retention


### Baseline value of each metric

In [16]:
# Baseline values are provided by Udacity, 
# and I will store the estimates of invariant and evaludation metrics in a dictionary
baseline_values = {"No_Cookies":40000, "No_Clicks":3200,"CTP":0.08,"Gross_Conversion":0.20625,
                   "Retention":0.53,"Net_Conversion":0.109313}

### Measuring Standard Deviation

Given those baseline values, I then estimate the standard deviation of the sample distribution for the evaluation metrics. For a sample size of 5000 cookies visiting the course overview page, I produce an analytic estimate of SD assuming the sample distributions of evaluation metrics are binomial. 

In addition, the analytic estimate are comparable to the empirical variability when the unit of diversion (how samples are differentiated and assinged to each group) of the experiment is equal to the unit of analysis which is the denominator of the possibility metrics. Otherwise, we will ned to compute the variability empirically. 

We calculate the variability of evaludation metrics analytically becuase the metrics vary too much, the practical significance we look for may not be feasible. 

In [17]:
import math
# Measuring SD
# gross conversion 
p = baseline_values['Gross_Conversion']
prop = 3200/40000
n = prop*5000 # sample size 
SD_gc = math.sqrt(p*(1-p)/n) # formula for SD
SD_gc

0.020230604137049392

In [9]:
# net conversion 
p = baseline_values['Net_Conversion']
prop = 3200/40000
n = prop*5000
SD_nc = math.sqrt(p*(1-p)/n) 
SD_nc

0.015601575884425905

In [5]:
# retention 
p = baseline_values['Retention']
prop = 660/40000
n = prop*5000
SD_re = math.sqrt(p*(1-p)/n)
SD_re

0.05494901217850908

In conclusion, I expect gross conversion and net conversion to be comparable to the empirical variability because unit of analysis here is the same as unit of diversion. Whereas they are not the same in the retention case. 

### Sizing

Determination of the sample size is fundamental in AB test. Next, I will use the basedline values and variability of the metrics to calculate the size of the experiment given the statistical power and significance level. 

I was asked to estimate the number of pageviews/samples needed for both groups given that alpha = 0.05 and beta = 0.2. 

In [30]:
# first approach: use online sample size calculation tool
# link of the tool: https://www.evanmiller.org/ab-testing/sample-size.html
# takes parameters alpha, beta, baseline conversion and dmin

online_results = {"Gross_Conversion":25835, "Net_Conversion": 27413,  "Retention": 39115}


Next, I will calculate sample size for each metric based on the online calculation results. I will condiser the duration of the test and then pick the largest sample size given by the metrics


In [31]:
# For Gross Conversion, online calculation gives that we need at least 25835 clicks for each group. 
# TO get the sample size for pageview:
prop = 3200/40000
page_views_gc = round(online_results['Gross_Conversion']/prop*2)
page_views_gc

645875

In [32]:
# For Net Conversion
prop = 3200/40000
page_views_nc = round(online_results['Net_Conversion']/prop*2)
page_views_nc

685325

In [33]:
# For Retention 
prop = 660/40000
page_views_re = round(online_results['Retention']/prop*2)
page_views_re

4741212

As a result, retention needs over 4700000 pageviews, and the estimate traffic per day is 40000. The duration of the test would be too long, thus I will drop the retention metric. Therefore, 685325 is the sample size I will use. Depending on how much data I collect per day, the duration may vary. if 50% of traffic would be diverted to this experiment, then it will take 34 days to run the experiment. 

In [19]:
# second approach: calculate z score using alpha, then loop through Ns to compute and compare beta using dmin, 
# standard deviation of sampling distrbution 
from scipy.stats import norm

In [26]:
def get_z_score(alpha_value):
    z = norm.ppf(alpha_value/2)
    return -z

In [19]:
def get_beta(z_star, s, d_min, N):
    SE = s / math.sqrt(N)
    return(norm.cdf(z_star * SE, loc=d_min, scale=SE))

In [20]:
def get_size(s, d_min, Ns=20000, alpha=0.05, beta=0.2):
    for i in range (1, Ns):
        if get_beta(get_z_score(alpha), s, d_min, i) <= beta:
            return i

## Experiment Analysis

Given the data for each two groups, below is the data dictionary:

    Pageviews: Number of unique cookies to view the course overview page that day.

    Clicks: Number of unique cookies to click the course overview page that day.

    Enrollments: Number of user-ids to enroll in the free trial that day.

    Payments: Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

In [5]:
import pandas as pd
# read in the data

df_c = pd.read_excel('Final Project Results.xlsx', sheet_name = 'Control')
df_e = pd.read_excel('Final Project Results.xlsx', sheet_name = 'Experiment')

### Sanity Check of invariant metrics

Sanity check is used to ensure that data is collected correctly and other factors will not affect the experienment as expected. As mentioned above, there are 3 invariant metrics I identified, and here I will use two different approaches (difference btw counts & difference btw probabilities) to complete the sanity check. 

In [22]:
# First metric: Number of cookies for pageviews
# we should expect the sum of the count of pageviews to be the same; 
# if not, the experiment will be biased and I will not contitnue
# For count/sum metrics like this, we can define a binomial distrbution 
# where number of samples assigned to each group has a probability of 0.5, then we assume a normal distribution 
# due to CLT with a mean of 0.5 and SD sqrt(p(1-p)/N)
# lastly, we test if the actual proportion of number of cookies is in the 95% confidence interval

def get_z_score(alpha_value):
    z = norm.ppf(alpha_value/2)
    return -z

pageviews_total = df_c['Pageviews'].sum() + df_e['Pageviews'].sum()
p_actual = round(df_e['Pageviews'].sum()/pageviews_total, 4)
print('The actual proportion of number of cookies in the experiment group '+ str(p_actual))

p = 0.5
alpha = 0.05
SD = math.sqrt(p*(1-p)/pageviews_total)
z_score = get_z_score(alpha)
CI_upper = round(0.5 + z_score*SD, 4)
CI_lower = round(0.5 - z_score*SD, 4)

print('The 95% confidence interval is bewteen '+ str(CI_lower) + ' and ' +  str(CI_upper))

# Number of cookies passed the sanity check. 

The actual proportion of number of cookies in the experiment group 0.4994
The 95% confidence interval is bewteen 0.4988 and 0.5012


In [24]:
# Second Metric: Number of clicks 

clicks_total = df_c['Clicks'].sum() + df_e['Clicks'].sum()
p_actual = round(df_e['Clicks'].sum()/clicks_total, 4)
print('The actual proportion of number of clicks in the experiment group '+ str(p_actual))

SD_click = math.sqrt(p*(1-p)/clicks_total)
CI_upper = round(0.5 + z_score*SD_click, 4)
CI_lower = round(0.5 - z_score*SD_click, 4)

print('The 95% confidence interval is bewteen '+ str(CI_lower) + ' and ' +  str(CI_upper))

# Number of clicks passed the sanity check. 

The actual proportion of number of clicks in the experiment group 0.4995
The 95% confidence interval is bewteen 0.4959 and 0.5041


In [27]:
# Third Metric: CTP
# To test whether there is a significant difference of CTP between the control and experiment groups, 
# We calculate pooled standard error

p_pool = clicks_total/pageviews_total
SD_pool = math.sqrt(p_pool*(1-p_pool)*(1/df_c['Pageviews'].sum()+1/df_e['Pageviews'].sum()))
CI_upper = round(0 + z_score*SD_pool, 4)
CI_lower = round(0 - z_score*SD_pool, 4)

print('The 95% confidence interval for the difference is bewteen '+ str(CI_lower) + ' and ' +  str(CI_upper))

CTP_control = df_c['Clicks'].sum()/df_c['Pageviews'].sum()
CTP_experiment = df_e['Clicks'].sum()/df_e['Pageviews'].sum()
CTP_diff = round(CTP_control - CTP_experiment, 4)

print('The actual difference of CTP between two groups '+ str(CTP_diff))

# CTP passed the sanity check. 

The 95% confidence interval for the difference is bewteen -0.0013 and 0.0013
The actual difference of CTP between two groups -0.0001


## Results Analysis

### Effect Size Tests
For each of the evaluation metrics, we will calculate a 95% confidence interval around the difference between the experiment and control groups and indicate whether each metric is statistically and practically significant.

In [37]:
## Gross Conversion again is defined as number of enrollments over number of clicks 

new_c = df_c[df_c['Enrollments'].isna() == False] # delete the days where there are not enrollments 
new_e = df_e[df_e['Enrollments'].isna() == False]

GC_c = new_c['Enrollments'].sum() / new_c['Clicks'].sum()
GC_e = new_e['Enrollments'].sum() / new_e['Clicks'].sum()
diff_actual = round(GC_e - GC_c, 4)
print('The difference of gross conversion between two groups is ' + str(diff_actual))

GC_pool = (new_c['Enrollments'].sum() + new_e['Enrollments'].sum())/(new_c['Clicks'].sum() + new_e['Clicks'].sum())
SD_pool = math.sqrt(GC_pool*(1-GC_pool)*(1/new_c['Clicks'].sum()+1/new_e['Clicks'].sum()))

CI_upper = round(diff_actual + z_score*SD_pool, 4)
CI_lower = round(diff_actual  - z_score*SD_pool, 4)
print('The 95% confidence interval for the difference is bewteen '+ str(CI_lower) + ' and ' +  str(CI_upper))

The difference of gross conversion between two groups is -0.0206
The 95% confidence interval for the difference is bewteen -0.0292 and -0.012


Based on the results above, the range from -0.012 to -0.0292 deso not include neither 0 or dmin, which means the change is both statistically and practically significant. we see a significant decrease in gross conversion as expected. 

In [38]:
## Net Conversion is defined as number of payments divided by the number of clicks. (dmin= 0.0075)

NC_c = new_c['Payments'].sum() / new_c['Clicks'].sum()
NC_e = new_e['Payments'].sum() / new_e['Clicks'].sum()
diff_actual = round(NC_e - NC_c, 4)
print('The difference of net conversion between two groups is ' + str(diff_actual))

NC_pool = (new_c['Payments'].sum() + new_e['Payments'].sum())/(new_c['Clicks'].sum() + new_e['Clicks'].sum())
SD_pool = math.sqrt(NC_pool*(1-NC_pool)*(1/new_c['Clicks'].sum()+1/new_e['Clicks'].sum()))

CI_upper = round(diff_actual + z_score*SD_pool, 4)
CI_lower = round(diff_actual  - z_score*SD_pool, 4)
print('The 95% confidence interval for the difference is bewteen '+ str(CI_lower) + ' and ' +  str(CI_upper))

The difference of net conversion between two groups is -0.0049
The 95% confidence interval for the difference is bewteen -0.0116 and 0.0018


Becuase 0 and dmin:-0.0075 is in the range, the change here is neither statistically significant nor practically significant. 

### Sign Tests

For each of your evaluation metrics, we will conduct a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant. 

In [48]:
## Creating the signs for each row for Gross Conversion
import numpy as np
df_new = df_c.merge(df_e, on = 'Date', suffixes=('_c', '_e'))
df_new = df_new[df_new['Enrollments_c'].isna() == False]
gc_c = df_new['Enrollments_c']/df_new['Clicks_c']
gc_e = df_new['Enrollments_e']/df_new['Clicks_e']
df_new['Sign_gc'] = np.where(gc_e < gc_c, 1, 0)
df_new

Unnamed: 0,Date,Pageviews_c,Clicks_c,Enrollments_c,Payments_c,Pageviews_e,Clicks_e,Enrollments_e,Payments_e,Sign_gc
0,"Sat, Oct 11",7723,687,134.0,70.0,7716,686,105.0,34.0,1
1,"Sun, Oct 12",9102,779,147.0,70.0,9288,785,116.0,91.0,1
2,"Mon, Oct 13",10511,909,167.0,95.0,10480,884,145.0,79.0,1
3,"Tue, Oct 14",9871,836,156.0,105.0,9867,827,138.0,92.0,1
4,"Wed, Oct 15",10014,837,163.0,64.0,9793,832,140.0,94.0,1
5,"Thu, Oct 16",9670,823,138.0,82.0,9500,788,129.0,61.0,1
6,"Fri, Oct 17",9008,748,146.0,76.0,9088,780,127.0,44.0,1
7,"Sat, Oct 18",7434,632,110.0,70.0,7664,652,94.0,62.0,1
8,"Sun, Oct 19",8459,691,131.0,60.0,8434,697,120.0,77.0,1
9,"Mon, Oct 20",10667,861,165.0,97.0,10496,860,153.0,98.0,1


In [49]:
nc_c = df_new['Payments_c']/df_new['Clicks_c']
nc_e = df_new['Payments_e']/df_new['Clicks_e']
df_new['Sign_nc'] = np.where(nc_e < nc_c, 1, 0)
df_new

Unnamed: 0,Date,Pageviews_c,Clicks_c,Enrollments_c,Payments_c,Pageviews_e,Clicks_e,Enrollments_e,Payments_e,Sign_gc,Sign_nc
0,"Sat, Oct 11",7723,687,134.0,70.0,7716,686,105.0,34.0,1,1
1,"Sun, Oct 12",9102,779,147.0,70.0,9288,785,116.0,91.0,1,0
2,"Mon, Oct 13",10511,909,167.0,95.0,10480,884,145.0,79.0,1,1
3,"Tue, Oct 14",9871,836,156.0,105.0,9867,827,138.0,92.0,1,1
4,"Wed, Oct 15",10014,837,163.0,64.0,9793,832,140.0,94.0,1,0
5,"Thu, Oct 16",9670,823,138.0,82.0,9500,788,129.0,61.0,1,1
6,"Fri, Oct 17",9008,748,146.0,76.0,9088,780,127.0,44.0,1,1
7,"Sat, Oct 18",7434,632,110.0,70.0,7664,652,94.0,62.0,1,1
8,"Sun, Oct 19",8459,691,131.0,60.0,8434,697,120.0,77.0,1,0
9,"Mon, Oct 20",10667,861,165.0,97.0,10496,860,153.0,98.0,1,0


In [53]:
df_new.shape[0]
sign_p_gc = df_new[df_new['Sign_gc'] == 1].shape[0]
sign_p_nc = df_new[df_new['Sign_nc'] == 1].shape[0]
print('number of gc in experiment lower than the one in control is '+ str(sign_p_gc) + ' out of of ' + str(df_new.shape[0]))
print('number of nc in experiment lower than the one in control is '+ str(sign_p_nc) + ' out of of ' + str(df_new.shape[0]))

number of gc in experiment lower than the one in control is 19 out of of 23
number of nc in experiment lower than the one in control is 13 out of of 23


### Sign Test Design

The Sign Test uses the number of positive or negative signs to determine whether the difference is based on evidence or simply happens by chance. Assuming the chance of sign == 1 is 50%, and the sign follows a binomial distribution. We can then use the binomail distribution formula to calculate the p value through probabilites of successes to further determine whether the difference is significant. 

Here we use a two tailed test becuase we want to determine if there is any difference between the groups which should covers the possibility of both positibve and negative differences. One tailed test, on the other hand, only determines the difference in a specific direction. In our case, we are not competely unsterested in the outcome that experiment group being higher than control group. 

In [63]:
def pvalue(x, n):
    """
    Takes inputs number of successes and sample size 
    outputs the two talied p value 
    """
    p = 0
    for i in range(23, x-1, -1):
        p += math.factorial(n)/(math.factorial(i)*math.factorial(n-i)) * 0.5**i * (1-0.5)**(n-i)
    p_value = 2 * p
    return p_value

In [64]:
pvalue(19, 23) # p value for gc

0.002599477767944336

In [65]:
pvalue(13,23) # p value for nc

0.6776394844055176

As a results, p value for gc is lower 0.05, therefore the difference is significant. But net conversion is not the case. 

## Recommendation 

Going back to our goal, Udacity wans to find out whether the change would reduce the number of student who do not have enough time so that the overall student learning experience will be improved without significantly reducing the number of students who end of purchase the course. The AB test shows that the drease in gross conversion is both statistically and practically significant; Net Conversion does not show a significant difference. Thus, we see less students enrolled in the free trail without a significant decrease in the number of students who ended up purchasing the course. The goal is then achieved by the change, and I would recommend to launch the change. 