# A/B Testing Analysis: Free Trial Screener 

In [111]:
import math
import pandas as pd
from scipy.stats import binom_test 

## Experiment Overview:
At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.


In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.


The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.


The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.
Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.

## Metric Choice
* Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
* Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)
* Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
* Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
* Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
* Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
* Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

Since the free trial screener will only influence the steps at and after enrollment, any metrics related to the steps before enrollment should be invariant.

Invariants:
* Number of cookies
* Number of clicks
* Click-through probability

Evaluation Metrics:
* Gross conversion
* Retention
* Net conversion

### Calculate the Standard Deviation for Evaluation Metrics

Whether the person clicks/enrolls or not is a "Yes" or "No" problem, so it is bernoulli distribution. The formula for the standard deviation of bernoulli distribution is $$SD = \sqrt{p*(1-p)}$$

However, we use the equation below in this project: $$SD = \sqrt{\frac{p*(1-p)}{n}}$$

In [2]:
n_pageviews=40000 #Unique cookies to view course overview page per day
n_clicks=3200 #Unique cookies to click "Start free trial" per day
n_enroll=660 #Enrollments per day
n_sample=5000 # number of cookies visiting the course overview page

click_through_probability=0.08 #clicks / pageviews
gross_conversion=0.20625 # enroll / click
retention=0.53 # payment / enroll
net_conversion=0.1093125 # payment / click

In [11]:
# calculate the standard deviation

se_gross_conversion = round(math.sqrt(gross_conversion*(1-gross_conversion)/(n_clicks/n_pageviews*n_sample)), 4)
se_retention = round(math.sqrt(retention*(1-retention)/(n_enroll/n_pageviews*n_sample)), 4)
se_net_conversion = round(math.sqrt(net_conversion*(1-net_conversion)/(n_clicks/n_pageviews*n_sample)), 4)

In [12]:
print("standard error of gross_conversion is: " + str(se_gross_conversion))
print("standard error of se_retention is: " + str(se_retention))
print("standard error of se_net_conversion is: " + str(se_net_conversion))

standard error of gross_conversion is: 0.0202
standard error of se_retention is: 0.0549
standard error of se_net_conversion is: 0.0156


### Calculate Sizing of Evaluation Metrics

We use the online calculator to calculate the size of samples: https://www.evanmiller.org/ab-testing/sample-size.html#!20;80;5;5;0

* Gross conversion:
    * Baseline rate: 20.625%
    * Minimum Detectable Effect: 0.01
    * Sample size: 25,835 clicks/group
    * Total sample size: 25,835*2=51670 clicks
    * Pageviews= 51670 / 0.08 (clicks / pageviews)=645875

* retention:
    * Baseline rate: 53%
    * Minimum Detectable Effect: 0.01
    * Sample size: 39,115 enrolls/group
    * Total sample size: 39,115*2=78230 enrolls
    * Pageviews= 78230 / (660/40000) (enrolls / pageviews)=4741212

* net conversion:
    * Baseline rate: 10.93125%
    * Minimum Detectable Effect: 0.0075
    * Sample size: 27,413 clicks/group
    * Total sample size: 27,413*2=54826 clicks
    * Pageviews= 54826 / 0.08 (clicks / pageviews)=685325

### Calculate the Duration of Running the Experiments
Because the retention needs to many pageviews to achive our expectation (alpha = 0.05 and beta = 0.2), we only choose gross conversion and net conversion as our evaluation metrics going forward. 24 days which are less than a month is not too risky for the company to do the expepriment. 

In [28]:
maximum_pageviews = 685325
fraction = 0.7
duraction = maximum_pageviews/(fraction*n_pageviews)
duraction

24.475892857142856

### Sanity Check for the Invariants

Start by checking whether the invariant metrics are equivalent between the two groups. If the invariant metric is a simple count that should be randomly split between the 2 groups, we can use a binomial test. Otherwise, you will need to construct a confidence interval for a difference in proportions, then check whether the difference between group values falls within that confidence level.

In [32]:
df_control = pd.read_csv("Final Project Results - Control.csv")
df_exp = pd.read_csv("Final Project Results - Experiment.csv")

In [33]:
df_control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


The observed pageviews is around 0.5006 and falls into the confidence interval and the invariant metric looks good to us.

In [42]:
# Pageviews
sum_pageviews_con = df_control['Pageviews'].sum()
sum_pageviews_exp = df_exp['Pageviews'].sum()
prob_of_assign = sum_pageviews_con/(sum_pageviews_con+sum_pageviews_exp)
SD_pageviews = math.sqrt(0.5*(1-0.5)/(sum_pageviews_con+sum_pageviews_exp))
lower_bound = 0.5 - 1.96*SD_pageviews
upper_bound = 0.5 + 1.96*SD_pageviews

print("Confidence Interval for pageviews: [{},{}]".format(round(lower_bound,4),round(upper_bound,4)))
print("Observed: ",round(prob_of_assign,4))

Confidence Interval for pageviews: [0.4988,0.5012]
Observed:  0.5006


In [44]:
# enrollment
sum_enrollment_con = df_control['Enrollments'].sum()
sum_enrollment_exp = df_exp['Enrollments'].sum()
prob_of_assign = sum_enrollment_con/(sum_enrollment_con+sum_enrollment_exp)
SD_enrollment = math.sqrt(0.5*(1-0.5)/(sum_enrollment_con+sum_enrollment_exp))
lower_bound = 0.5 - 1.96*SD_enrollment
upper_bound = 0.5 + 1.96*SD_enrollment

print("Confidence Interval for enrollment: [{},{}]".format(round(lower_bound,4),round(upper_bound,4)))
print("Observed: ",round(prob_of_assign,4))

Confidence Interval for enrollment: [0.4885,0.5115]
Observed:  0.5251


The observed clicks which is 0.5005 falls into our 95% confidence interval as well.

In [45]:
# Clicks
sum_clicks_con = df_control['Clicks'].sum()
sum_clicks_exp = df_exp['Clicks'].sum()
prob_of_assign = sum_clicks_con/(sum_clicks_con+sum_clicks_exp)
SD_clicks = math.sqrt(0.5*(1-0.5)/(sum_clicks_con+sum_clicks_exp))
lower_bound = 0.5 - 1.96*SD_clicks
upper_bound = 0.5 + 1.96*SD_clicks

print("Confidence Interval for clicks: [{},{}]".format(round(lower_bound,4),round(upper_bound,4)))
print("Observed: ",round(prob_of_assign,4))

Confidence Interval for clicks: [0.4959,0.5041]
Observed:  0.5005


The observed click-through-probability is around 0.0001 and falls into our 95% confidence interval as well. We passed all the sanity checks and can start to do the analysis.

In [47]:
# Click-through-probability
ctp_control = sum_clicks_con/sum_pageviews_con
ctp_exp = sum_clicks_exp/sum_pageviews_exp
diff_exp = abs(ctp_control-ctp_exp)

pool_ctp = (sum_clicks_con+sum_clicks_exp)/(sum_pageviews_con+sum_pageviews_exp)
SE_pool = math.sqrt(pool_ctp*(1-pool_ctp)*(1/sum_pageviews_con + 1/sum_pageviews_exp))
lower_bound = 0 - 1.96*SE_pool
upper_bound = 0 + 1.96*SE_pool

print("Confidence Interval for click-through-probability: [{},{}]".format(round(lower_bound,4),round(upper_bound,4)))
print("Observed: ",round(diff_exp,4))

Confidence Interval for click-through-probability: [-0.0013,0.0013]
Observed:  0.0001


### Effect Size Tests

Significance definitions
A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)

The Z-Test is a statistical test used to compare the means/proportions from two different populations when the variances are known and the sample size is large(usually above 30 to be considered a normal distribution).

The confidence interval is:
$$\mathit{P}_{exp} = \frac{\mathit{X}_{exp}}{\mathit{N}_{exp}}$$
$$\mathit{P}_{con} = \frac{\mathit{X}_{con}}{\mathit{N}_{con}}$$

$$\mathit{Diff} = \mathit{P}_{exp} - \mathit{P}_{con}$$

$$\mathit{P}_{pool} = \frac{\mathit{X}_{exp} + \mathit{X}_{con}}{\mathit{N}_{exp} + \mathit{N}_{con}}$$
$$\mathit{Pool Standard Deviation} = \sqrt{\mathit{P}_{pool}*(1-\mathit{P}_{pool})*(\frac{1}{\mathit{N}_{exp}} + \frac{1}{\mathit{N}_{con}})}$$

$$ \mathit{CI} = \mathit{Diff} \pm 1.96*\mathit{Pool Standard Deviation}$$

The gross conversion from the experiment is statistically and practically less than the gross conversion in the controlled group. It makes sense because the number of enrollment should be less with the screener. However, the net conversion is not significantly different from the controlled group.


In [83]:
# Gross conversion

n = len(df_control) - df_control['Enrollments'].isna().sum() # if it doesn't have enrollment on that day, we don't count the clicks
sum_clicks_cont = df_control['Clicks'][:n].sum()
sum_clicks_exp = df_exp['Clicks'][:n].sum()
gross_conversion_con = sum_enrollment_con/sum_clicks_cont
gross_conversion_exp = sum_enrollment_exp/sum_clicks_exp
diff_gc_exp = gross_conversion_exp-gross_conversion_con

pool_gc = (sum_enrollment_con+sum_enrollment_exp)/(sum_clicks_con+sum_clicks_exp)
SE_pool_gc = math.sqrt(pool_gc*(1-pool_gc)*(1/sum_clicks_con + 1/sum_clicks_exp))
lower_bound = diff_gc_exp - 1.96*SE_pool_gc
upper_bound = diff_gc_exp + 1.96*SE_pool_gc

print("Confidence Interval for gross conversion: [{},{}]".format(round(lower_bound,4),round(upper_bound,4)))
print("Observed: ",round(diff_gc_exp,4))

practical_sig_gc = 0.01
if 0 < lower_bound or 0 > upper_bound:
    if practical_sig_gc < lower_bound or practical_sig_gc > upper_bound:
        print("the difference is statistically and practically significant")
    else:
        print("the difference is only statistically significant")
else:
    print("there is no statistically significant")

Confidence Interval for gross conversion: [-0.0275,-0.0137]
Observed:  -0.0206
the difference is statistically and practically significant


In [92]:
# net conversion
n = len(df_control) - df_control['Payments'].isna().sum()
sum_payments_con = df_control['Payments'].sum()
sum_payments_exp = df_exp['Payments'].sum()
sum_clicks_con = df_control['Clicks'][:n].sum()
sum_clicks_exp = df_exp['Clicks'][:n].sum()
gross__net_conversion_con = sum_payments_con/sum_clicks_con
gross_net_conversion_exp = sum_payments_exp/sum_clicks_exp
diff_net_gc_exp = gross_net_conversion_exp-gross__net_conversion_con

pool_net_gc = (sum_payments_con+sum_payments_exp)/(sum_clicks_con+sum_clicks_exp)
SE_pool_net_gc = math.sqrt(pool_net_gc*(1-pool_net_gc)*(1/sum_clicks_con + 1/sum_clicks_exp))
lower_bound = diff_net_gc_exp - 1.96*SE_pool_net_gc
upper_bound = diff_net_gc_exp + 1.96*SE_pool_net_gc

print("Confidence Interval for net gross conversion: [{},{}]".format(round(lower_bound,4),round(upper_bound,4)))
print("Observed: ",round(diff_net_gc_exp,4))

practical_sig_net_gc = 0.0075
if 0 < lower_bound or 0 > upper_bound:
    if practical_sig_net_gc < lower_bound or practical_sig_net_gc > upper_bound:
        print("the difference is statistically and practically significant")
    else:
        print("the difference is only statistically significant")
else:
    print("there is no statistically significant")

Confidence Interval for net gross conversion: [-0.0116,0.0019]
Observed:  -0.0049
there is no statistically significant


### Sign Tests

For each evaluation metric, we do a sign test using the day-by-day breakdown. If the sign test does not agree with the confidence interval for the difference, then figure out the reason.

We use binomial test to see if the p_value is less than the alpha. If the p_value is very small, then we can reject the null hypothesis. Both test results agree with the effect size tests.

In [103]:
df_exp['Pageviews_sign'] = df_exp['Pageviews'] - df_control['Pageviews']
df_exp['Clicks_sign'] = df_exp['Clicks'] - df_control['Clicks']
df_exp['Enrollments_sign'] = df_exp['Enrollments'] - df_control['Enrollments']
df_exp['GC_sign']= df_exp['Enrollments']/df_exp['Clicks'] - df_control['Enrollments']/df_control['Clicks']
df_exp['NC_sign'] = df_exp['Payments']/df_exp['Clicks'] - df_control['Payments']/df_control['Clicks']

In [114]:
# what is the probability that the positive change in the experiment is happen by chance? Use binomial distribution
days = len(df_exp) - df_exp['GC_sign'].isna().sum()
n_positive_change_gc = df_exp[df_exp['GC_sign'] > 0]['GC_sign'].count()
n_positive_change_nc = df_exp[df_exp['NC_sign'] > 0]['NC_sign'].count()

# the chance that the gross conversion in the experiment is higher than the gc in the control group is 0.5
p_value_gc=binom_test(n_positive_change_gc, n=days, p=0.5)
print("p-value of gross conversion is:",p_value_gc,"Statistically Significant:",p_value_gc<0.05)

p_value_nc=binom_test(n_positive_change_nc, n=days, p=0.5)
print("p-value of net conversion is:",p_value_nc,"Statistically Significant:",p_value_nc<0.05)


p-value of gross conversion is: 0.0025994777679443364 Statistically Significant: True
p-value of net conversion is: 0.6776394844055175 Statistically Significant: False


### Recommendation

The number of user-ids would be lower in the experiment group, but what does it measure? The user-id can be divided into two groups: students with less than 5 hours and students with more than 5 hours. When the first group see the screener, they may be discouraged and therefore will not choose to enroll the course. This group can be labeld as "frustrated students". The second group is really determinant and have more time on the course, so they can be "resolute students". The screener wants to reduce the number of "frustrated students" who might quit before the course ends but doesn't want to reduce the number of "resolute students". The gross conversion measures the frustrated portion and the net conversion measures the resolute students. As we expected, the gross conversion in the experiment group shows statistically signicant difference than the gross conversion in the controlled group. Also, the net conversion doesn't have statistically difference. Therefore, we would recommend lunch the product. 