                                 A/B Testing Project - Udacity Free Trial Screener 

The Udacity course home pages have two options: "start free trial" and "access course materials." Clicking "start free trial" prompts the user to enter their credit card information, subsequently enrolling them in a 14 day free trial of the course, after which they are automatically charged. Users who click "access course materials" will be able to view course content but receive no coaching support, verified certificate, or project feedback.

For this experiment Udacity tested a change wherein those users who clicked "start free trial" were asked how much time they were willing to devote to the course. Users choosing 5 or more hours per week would be taken through the checkout process as usual. For users indicating fewer than 5 hours per week a message would appear indicating the need for a greater time commitment to enable success and suggesting they might like to access the free content. At this point the student would have the option to continue enrolling in the free trial or access the course materials for free.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

In [1]:
# import tools
import numpy as np
import pandas as pd
from __future__ import division

In [2]:
# import baseline data
df_baseline = pd.read_csv("Final Project Baseline Values.csv", index_col=False,header = None, names = ['metric','baseline_value'])
df_baseline

Unnamed: 0,metric,baseline_value
0,Unique cookies to view page per day:,40000.0
1,"Unique cookies to click ""Start free trial"" per...",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


In [3]:
# import experiment data
df_control = pd.read_csv("Final Project Results _Control.csv")
df_experiment = pd.read_csv("Final Project Results _Experiment.csv")

In [4]:
# Preview control group data
df_control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [5]:
# Preview experiment group data
df_experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


Measuring Standard Deviation

The number of clicks and enrollments follows a binomial distribution, and by the central limit theorem, the distribution of the three rates (gross conversion, retention, and net conversion) is Gaussian. 

Given the daily sample of 5000 cookies, the number of clicks and enrollments can be calculated using the baseline values

In [6]:
# Give Basedline values
cookie_base = 40000
click_base = 3200
enrollment_base =660 

# Daily pageview cookies
dailycookie =5000
n_click = dailycookie*click_base/cookie_base
n_enrollment = dailycookie*enrollment_base/cookie_base

In [7]:
# Print out the results
print 'the number of clicks is %d' % (n_click)
print 'the number of enrollment is %f' % (n_enrollment)

the number of clicks is 400
the number of enrollment is 82.500000


In [8]:
# Calculate standard deviation
# Gross conversion
grossconv_sd = round(np.sqrt((.206250*(1-.206250))/400),4)
print 'the standard deviation of gross conversion is %s' %(grossconv_sd)

the standard deviation of gross conversion is 0.0202


In [9]:
# Retention
rentention_sd = round(np.sqrt((.53*(1-.53))/82.5),4)
print 'the standard deviation of retention is %s' %(rentention_sd)

the standard deviation of retention is 0.0549


In [10]:
# Net conversion
netconv_sd = round(np.sqrt((.109313*(1-.109313))/400),4)
print 'the standard deviation of net conversion is %s' %(netconv_sd)

the standard deviation of net conversion is 0.0156


Sizing: Number of Samples vs. Power

The three evaluation metrics in the experiment have high correlation, thus the Bonferroni correction will be too conservative and will not be used during the analysis phase.

Given the type I error rate of α equals 0.05, type II error β equals 0.20, and the minimum detectable effect for each evaluation metric, the sample size required to power the experiment appropriately can be calculated using Evan Miller. Then, the total number of pageviews can be calculated using the given unit to pageview ratio. 
Click/pageview ratio = 3200/40000 =0.08
Enrollment/pageview ratio = 660/40000 = .0165

Table2. Results of sample size calculation
Evaluation Metric	Baseline Value	Minimum Detectable Effect	Sample size 	Unit/pageview ratio	Total number of pageviews
Gross conversion	0.20625	                0.0202	                25838	           0.08	                645,950
Retention	        0.53	                0.0549	                39115	           0.0165	            4,741,213
Net conversion	    0.1093125	            0.0156	                27413	           0.08	                685,325


Duration vs. Exposure

With daily pageview baseline value of 40000, the number of pageview for retention would need about 119 days, even if we divert 100% of traffic. It is unreasonably long for an A/B testing experiment. Therefore, I eliminate retention as the evaluation metric. The total number of required pageviews is decreased to 685,325. Considering that this is not a risky experiment as the change is small and it won't cause too much trouble in the overall business, I choose to direct 70% of the traffic (40000*0.7 =28000) to the experiment. Thus, it would takes approximately 25 days (685,325/28000 =25) to run the experiment. 


Sanity Checks

Having conducted the experiment, each of the invariant metrics needs double-check whether the underlying assumptions are being met. Cookies and clicks are expected to be divided evenly between the control and experimental groups. Using an expected rate of diversion of 0.5, the standard deviation can be calculated and a 95% confidence interval can be constructed around the expected value. 

In [21]:
# Summarize data to create a table 
results = {"Control":pd.Series([df_control.Pageviews.sum(),df_control.Clicks.sum()],
                                  index = ["cookies","clicks"]),
           "Experiment":pd.Series([df_experiment.Pageviews.sum(),df_experiment.Clicks.sum()],
                               index = ["cookies","clicks"])}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,345543,344660
clicks,28378,28325


In [22]:
# Calculate SE, ME, CI and Diff
df_results['Total'] = df_results.Control + df_results.Experiment
df_results['Expected_value'] = 0.5
df_results['Observed_value'] = df_results.Experiment/df_results.Total
df_results['Standard_Error'] = np.sqrt((df_results.Expected_value * (1-df_results.Expected_value))*(1/df_results.Control) + (1/df_results.Experiment))
df_results['Marginal_Error'] = df_results.Standard_Error * 1.96
df_results['CI_lower'] = df_results.Expected_value - df_results.Marginal_Error
df_results['CI_upper'] = df_results.Expected_value + df_results.Marginal_Error
df_results["Pass_Sanity"] = df_results.apply(lambda x: (x.Observed_value > x.CI_lower) and (x.Observed_value < x.CI_upper),axis=1)
df_results['Diff'] = abs((df_results.Experiment - df_results.Control)/df_results.Total)

In [23]:
# Overview the results
df_results

Unnamed: 0,Control,Experiment,Total,Expected_value,Observed_value,Standard_Error,Marginal_Error,CI_lower,CI_upper,Pass_Sanity,Diff
cookies,345543,344660,690203,0.5,0.49936,0.001904,0.003732,0.496268,0.503732,True,0.001279
clicks,28378,28325,56703,0.5,0.499533,0.006642,0.013018,0.486982,0.513018,True,0.000935


Result Analysis

Effect Size Tests

For each evaluation metric, statistical and practical significance (whether or not the size of the effect is relevant from a business standpoint) should be tested. The minimum detectable effect is the smallest difference that we will accept between experimental and control groups in order to be practically significant. 

Using the data collected, we calculate the rate in experimental and control groups for each evaluation metric (gross conversion, net conversion),and then define a new variable that is the difference between the rates (experiment - control). Using this newly defined variable, we construct a confidence interval which will then set a range for the expected difference. 

In [24]:
# subset data by deleting null values for both control and experiment group

df_control_sub = df_control[pd.isnull(df_control.Enrollments) != True]
df_experiment_sub = df_experiment[pd.isnull(df_control.Enrollments) != True]

In [27]:
# summarize data to create a table

results_sub = {"Control":pd.Series([df_control_sub.Pageviews.sum(),df_control_sub.Clicks.sum(),
                                  df_control_sub.Enrollments.sum(),df_control_sub.Payments.sum()],
                                  index = ["cookies","clicks","enrollments","payments"]),
           "Experiment":pd.Series([df_experiment_sub.Pageviews.sum(),df_experiment_sub.Clicks.sum(),
                               df_experiment_sub.Enrollments.sum(),df_experiment_sub.Payments.sum()],
                               index = ["cookies","clicks","enrollments","payments"])}
df_results_sub = pd.DataFrame(results_sub)

# Preview new data summary
df_results_sub

Unnamed: 0,Control,Experiment
cookies,212163.0,211362.0
clicks,17293.0,17260.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


In [28]:
# Create a new variable: Total
df_results_sub['Total'] = df_results_sub.Control + df_results_sub.Experiment
df_results_sub

Unnamed: 0,Control,Experiment,Total
cookies,212163.0,211362.0,423525.0
clicks,17293.0,17260.0,34553.0
enrollments,3785.0,3423.0,7208.0
payments,2033.0,1945.0,3978.0


In [30]:
# Get values for calculation

# experiment values

enrollments_exp = df_results_sub.loc["enrollments"].Experiment
clicks_exp = df_results_sub.loc["clicks"].Experiment
payments_exp = df_results_sub.loc["payments"].Experiment

# control values

enrollments_cont = df_results_sub.loc["enrollments"].Control
clicks_cont = df_results_sub.loc["clicks"].Control
payments_cont = df_results_sub.loc["payments"].Control

In [34]:
# Calculate GrossConversion for both control and experiment groups, as well as for total 

GrossConversion_exp = enrollments_exp/clicks_exp
GrossConversion_cont = enrollments_cont/clicks_cont
GrossConversion_total = (enrollments_exp + enrollments_cont)/(clicks_exp + clicks_cont)

print 'the Gross Conversion for experiment group is %s, '\
      'the Gross Conversion for control group is %s, '\
      'the Gross Conversion for total is %s' %(GrossConversion_exp, GrossConversion_cont,GrossConversion_total )

the Gross Conversion for experiment group is 0.1983198146, the Gross Conversion for control group is 0.218874689181, the Gross Conversion for total is 0.208607067404


In [37]:
# Calculate NetConversion for both control and experiment groups, as well as for total 

NetConversion_exp = payments_exp/clicks_exp
NetConversion_cont = payments_cont/clicks_cont
NetConversion_total = (payments_exp + payments_cont)/(clicks_exp + clicks_cont)

print 'the Net Conversion for experiment group is %s, '\
      'the Net Conversion for control group is %s, '\
      'the Net Conversion for total is %s' %(NetConversion_exp, NetConversion_cont,NetConversion_total )

the Net Conversion for experiment group is 0.11268829664, the Net Conversion for control group is 0.117562019314, the Net Conversion for total is 0.115127485312


In [38]:
# calculate the difference between two groups
GrossConversion_diff = GrossConversion_exp - GrossConversion_cont
GrossConversion_diff

-0.020554874580361565

In [39]:
NetConversion_diff = NetConversion_exp - NetConversion_cont
NetConversion_diff

-0.0048737226745441675

In [40]:
# define a function to construct 95% CI

def construct_CI (p_hat,z_score,N_cont,N_exp,diff):
    std_err = np.sqrt((p_hat * (1- p_hat ))*(1/N_cont + 1/N_exp))
    marg_err = z_score * std_err
    ci_lower = diff - marg_err
    ci_upper = diff + marg_err
    
    return std_err,marg_err,ci_lower,ci_upper

In [42]:
# Get statistics for Gross Conversion

se_gross,me_gross,cil_gross,ciu_gross = construct_CI(GrossConversion_total,1.96,clicks_cont, clicks_exp,GrossConversion_diff)

print se_gross,me_gross,cil_gross,ciu_gross

0.00437167538523 0.00856848375504 -0.0291233583354 -0.0119863908253


In [43]:
# Get statistics for Net Conversion

se_net,me_net,cil_net,ciu_net = construct_CI(NetConversion_total,1.96, clicks_cont, clicks_exp, NetConversion_diff)

print se_net,me_net,cil_net,ciu_net

0.00343413351293 0.00673090168535 -0.0116046243599 0.0018571790108


Summary of effect size tests: since 95% confidence interval does not include zero and the minimum detectable effect value, gross conversion is both statistically and practically significant. In terms of net conversion, the 95% confidence interval includes zero and the minimum detectable effect value, indicating neither statistically nor practically significant.

Sign Tests

To further test each of the evaluation metrics, a binomial sign test can be conducted to evaluate Each day of the experiment to see if there is a positive or negative difference across groups (experimental-control). Each positive difference is counted as a success, and each negative difference as a failure. Then comparing the resulting p-values for each metric to determine significance.

In [52]:
# Create a summary table for each day
df_signtest =pd.DataFrame()
df_signtest['Date']=df_control_sub.Date
df_signtest['GC_cont'] = df_control_sub.Enrollments/df_control_sub.Clicks
df_signtest['GC_exp'] = df_experiment_sub.Enrollments/df_experiment_sub.Clicks
df_signtest['NC_cont'] = df_control_sub.Payments/df_control_sub.Clicks
df_signtest['NC_exp'] = df_experiment_sub.Payments/df_experiment_sub.Clicks

In [61]:
# Calculate positive or negative difference across groups 

df_signtest['GC_Sign'] = df_signtest.GC_exp - df_signtest.GC_cont
df_signtest['NC_Sign'] = df_signtest.NC_exp - df_signtest.NC_cont

df_signtest.head()

Unnamed: 0,Date,GC_cont,GC_exp,NC_cont,NC_exp,GC_Sign,NC_Sign
0,"Sat, Oct 11",0.195051,0.153061,0.101892,0.049563,-0.04199,-0.05233
1,"Sun, Oct 12",0.188703,0.147771,0.089859,0.115924,-0.040933,0.026065
2,"Mon, Oct 13",0.183718,0.164027,0.10451,0.089367,-0.019691,-0.015144
3,"Tue, Oct 14",0.186603,0.166868,0.125598,0.111245,-0.019735,-0.014353
4,"Wed, Oct 15",0.194743,0.168269,0.076464,0.112981,-0.026474,0.036517


In [62]:
# Count the number of success 
Total = len(df_signtest)
GC_success = len(df_signtest[df_signtest.GC_Sign > 0])
NC_success = len(df_signtest[df_signtest.NC_Sign > 0])

print 'The number of trial is %s,'\
      'The number of Gross Conversion success is %s,'\
      'The number of Net Conversion success is %s.' %(Total,GC_success,NC_success)


The number of trial is 23,The number of Gross Conversion success is 4,The number of Net Conversion success is 10.


In [59]:
# %R sum(dbinom(19:23,size=23,p=0.5)) * 2
# P= 0.002599478  < 0.05 reject Ho

# %R sum(dbinom(13:23,size=23,p=0.5)) * 2
# P= 0.6776395  < 0.05   accept Ho

Significance definitions
A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)