# Udacity AB Testing Final Project

## Experiment overview
### Description of the experiment
In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.
### Experiment hypothesis
The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.
### Metrics
* Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
* Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)
* Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
* Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
* Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
* Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
* Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)


## Metric choice
* invariate metric (used for sanity check): number of cookies, number of clicks, click-through-probability
* evaluation metric: gross conversion, retention, net conversion

## Baseline values for these metrics

In [15]:
import pandas as pd
df = pd.read_csv("Final Project Baseline.csv", names=["Metric", "Baseline Value"])
pd.set_option('display.max_colwidth', None)
df

Unnamed: 0,Metric,Baseline Value
0,Unique cookies to view course overview page per day:,40000.0
1,"Unique cookies to click ""Start free trial"" per day:",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


In [16]:
baselines = {
    "Cookies":40000,
    "Clicks":3200,
    "Enrollments":660,
    "CTP":0.08,
    "Gross_Conversion":0.206250,
    "Retention":0.53,
    "Net_Conversion":0.109313
    }
baselines

{'Cookies': 40000,
 'Clicks': 3200,
 'Enrollments': 660,
 'CTP': 0.08,
 'Gross_Conversion': 0.20625,
 'Retention': 0.53,
 'Net_Conversion': 0.109313}

## Estimating standard deviation

### Calculate standard deviation for evaluation metrics
* Gross conversion = Number of enrollments / Number of clicks
* Retention = Number of payments / Number of enrollments
* Net conversion = Number of payments / Number of clicks

In [17]:
import math
import numpy as np
GC = {}
GC["d_min"] = 0.01
GC['p'] =baselines['Gross_Conversion']
GC['n'] = baselines['Clicks']
GC['sd'] = round(np.sqrt((GC['p']*(1-GC['p']))/GC['n']),4)
Retention = {}
Retention["d_min"] = 0.01
Retention['p'] = baselines['Retention']
Retention['n'] = baselines['Enrollments']
Retention['sd'] = round(math.sqrt((Retention['p']*(1-Retention['p']))/Retention['n']),4)
NC = {}
NC["d_min"] = 0.0075
NC['p'] = baselines['Net_Conversion']
NC['n'] = baselines['Clicks']
NC['sd'] = round(math.sqrt((NC['p']*(1-NC['p']))/NC['n']),4)
GC, Retention, NC

({'d_min': 0.01, 'p': 0.20625, 'n': 3200, 'sd': 0.0072},
 {'d_min': 0.01, 'p': 0.53, 'n': 660, 'sd': 0.0194},
 {'d_min': 0.0075, 'p': 0.109313, 'n': 3200, 'sd': 0.0055})

 Given the sample size of cookies is 5000, we need to scale each n to 5000 to the ratio of 5000/40000.

In [18]:
scale_ratio = 5000/40000
gc_sd = round(np.sqrt((GC['p']*(1-GC['p']))/(GC['n']*scale_ratio)),4)
retention_sd = round(np.sqrt((Retention['p']*(1-Retention['p']))/(Retention['n']*scale_ratio)),4)
nc_sd = round(np.sqrt((NC['p']*(1-NC['p']))/(NC['n']*scale_ratio)),4)
print("The standard deviation for Gross Conversion is", gc_sd, ", given a sample size of 5000 cookies")
print("The standard deviation for Retention is", retention_sd, ", given a sample size of 5000 cookies")
print("The standard deviation for Net Conversion is", nc_sd, ", given a sample size of 5000 cookies")

The standard deviation for Gross Conversion is 0.0202 , given a sample size of 5000 cookies
The standard deviation for Retention is 0.0549 , given a sample size of 5000 cookies
The standard deviation for Net Conversion is 0.0156 , given a sample size of 5000 cookies


## Sample Size Calculation
* alpha = 0.05
* beta = 0.2

In [19]:
alpha = 0.05
beta = 0.2

In [20]:
import scipy.stats as stats
def get_sd(p, dmin):
    sd1 = math.sqrt(2*p*(1-p))
    sd2 = math.sqrt(p*(1-p) + (p+dmin)*(1-(p+dmin)))
    return [sd1, sd2]
def get_z_score(x): # x is the percentage of the distribution
    return stats.norm.ppf(x)

### Gross conversion

In [21]:
sd1,sd2 = get_sd(GC['p'], GC['d_min'])
z_alpha = get_z_score(1-alpha/2)
z_beta = get_z_score(1-beta)
GC['sample_size'] = round((z_alpha*sd1 + z_beta*sd2)**2/GC['d_min']**2)
print("Number of clicks needed per group:",GC['sample_size'])

Number of clicks needed per group: 25835


#### Derive the needed page views for gross conversion
* The ratio of clicks to page views is 0.08, so the number of page views needed is Number of clicks / 0.08
* there are two groups in the experiment, so the number of page views needed for each group is Number of page views needed per group * 2

In [22]:
GC['pageviews'] = GC['sample_size'] / 0.08 * 2
print("Number of pageviews needed in experiment:",GC['pageviews']) #GC['n']

Number of pageviews needed in experiment: 645875.0


### Retention

In [23]:
sd1, sd2 = get_sd(Retention['p'], Retention['d_min'])
z_alpha = get_z_score(1-alpha/2)
z_beta = get_z_score(1-beta)
Retention['sample_size'] = round((z_alpha*sd1 + z_beta*sd2)**2/Retention['d_min']**2)
print("Number of enrollments needed per group:",Retention['sample_size'])


Number of enrollments needed per group: 39087


#### Derive the needed page views for Retention
* The ratio of enrollments to clicks is 0.20625, so the number of clicks needed is Number of enrollments / 0.20625
* Then, The ratio of clicks to page views is 0.08, so the number of page views needed is Number of clicks / 0.08
* there are two groups in the experiment, so the number of page views needed for each group is Number of page views needed per group * 2

In [24]:
Retention['pageviews'] = round(Retention['sample_size'] / baselines["Gross_Conversion"] / baselines["CTP"]* 2)
print("Number of pageviews needed in experiment:",Retention['pageviews'])

Number of pageviews needed in experiment: 4737818


### Net conversion

In [25]:
sd1, sd2 = get_sd(NC['p'], NC['d_min'])
z_alpha = get_z_score(1-alpha/2)
z_beta = get_z_score(1-beta)
NC['sample_size'] = round((z_alpha*sd1 + z_beta*sd2)**2/NC['d_min']**2)
print("Number of clicks needed per group:",NC['sample_size']) 

Number of clicks needed per group: 27413


#### Derive the needed page views for Net conversion
* The ratio of clicks to page views is 0.08, so the number of page views needed is Number of clicks / 0.08
* there are two groups in the experiment, so the number of page views needed for each group is Number of page views needed per group * 2

In [26]:
NC['pageviews'] = round(NC['sample_size']/0.08*2)
print("Number of pageviews needed in experiment:",NC['pageviews'])

Number of pageviews needed in experiment: 685325


### How many page views are needed for the experiment?
* The number of page views needed for the experiment is the maximum number of page views needed for each metric.

In [27]:
print("Number of pageviews needed in experiment:",max(GC['pageviews'], Retention['pageviews'], NC['pageviews']))

Number of pageviews needed in experiment: 4737818


### Approximate the duration of the experiment for each metric
* As we see from the below results, the retention metric requires unreasonable amount of time to run the experiment. Therefore, we will not run the experiment for the retention metric.
* To run the experiment for the gross conversion and net conversion metrics, we need to run the experiment for 18 days.

In [28]:
print("Duration for Gross Conversion:",GC['pageviews']/40000, "days")
print("Duration for Retention:",Retention['pageviews']/40000, "days")
print("Duration for Net Conversion:",NC['pageviews']/40000, "days")

Duration for Gross Conversion: 16.146875 days
Duration for Retention: 118.44545 days
Duration for Net Conversion: 17.133125 days


## Sanity check for invariate metrics
* Number of cookies
* Number of clicks
* Click-through-probability

For each of the invariate metrics, we will calculate the 95% confidence interval and compare the observed value with the confidence interval.

In [29]:
data_control = pd.read_csv('Final Project Results - Control.csv')
data_exp = pd.read_csv('Final Project Results - Experiment.csv')
data_control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [30]:
data_exp.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


### Number of cookies(pageviews)
* H0: p  = 0.5
* H1: p != 0.5
* Check if the ratio of control cookies to total cookies is significantly different from 0.5. 
* The observed value is 0.5006, which is within the confidence interval [0.4988, 0.5012]. Therefore, we can't reject the null hypothesis. The number of cookies metric passes the sanity check.


In [31]:
control_pageviews = data_control['Pageviews'].sum()
exp_pageviews = data_exp['Pageviews'].sum()
total_pageviews = control_pageviews + exp_pageviews

In [32]:
p = 0.5
alpha = 0.05
ME_pageviews = 1.96 * math.sqrt(p*(1-p)/total_pageviews)
p_pageviews = control_pageviews/total_pageviews
CI_pageviews = [p - ME_pageviews, p + ME_pageviews]
print("Pageviews CI:",CI_pageviews)
print("Observed Pageviews:",p_pageviews)
# ME_clicks = 1.96 * math.sqrt(p*(1-p)/total_clicks)

Pageviews CI: [0.49882039214902313, 0.5011796078509769]
Observed Pageviews: 0.5006396668806133


### Number of clicks
* H0: p  = 0.5
* H1: p != 0.5
* Check if the ratio of control clicks to total clicks is significantly different from 0.5. 
* The observed value is 0.5005, which is within the confidence interval [0.4959, 0.5041]. Therefore, we can't reject the null hypothesis. The number of clicks metric passes the sanity check.

In [33]:
control_clicks = data_control['Clicks'].sum()
exp_clicks = data_exp['Clicks'].sum()
total_clicks = control_clicks + exp_clicks

In [34]:
p = 0.5
alpha = 0.05
ME_clicks = 1.96 * math.sqrt(p*(1-p)/total_clicks)
p_clicks = control_clicks/total_clicks
CI_clicks = [p - ME_clicks, p + ME_clicks]
print("Clicks CI:",CI_clicks)
print("Observed Clicks:",p_clicks)

Clicks CI: [0.49588449572378945, 0.5041155042762105]
Observed Clicks: 0.5004673474066628


### Click through probability
* H0: d  = 0
* H1: d != 0
* Here, we calculate the difference between control ctp and experiment ctp and check if significantly different from 0.
* The observed value is 0.0001, which is within the confidence interval [-0.0013, 0.0013]. Therefore, we can't reject the null hypothesis. The click through probability metric passes the sanity check.

In [35]:
control_ctp = control_clicks / control_pageviews
exp_ctp = exp_clicks / exp_pageviews
d_hat = round(exp_ctp - control_ctp,4)
pool_p = (control_clicks+exp_clicks)/(control_pageviews+exp_pageviews)
sd = math.sqrt(
    pool_p*(1-pool_p)*(1/control_pageviews+1/exp_pageviews)
)
CI = [0 - 1.96*sd, 0 + 1.96*sd]
print("Click-through-probability CI:",CI)
print("Observed Click-through-probability:",d_hat)

Click-through-probability CI: [-0.0012956791986518956, 0.0012956791986518956]
Observed Click-through-probability: 0.0001


## Effect size tests

### Gross Conversion

* From below results, we see that the confidence interval is [-0.0291, -0.0120], which does not include 0. Therefore, we can reject the null hypothesis. The gross conversion is statistically significant. On the other hand, the practical significance is also significant because the confidence interval does not include -0.01 and 0.01. Therefore, the change is practically significant.

In [36]:
data_control= data_control[data_control['Enrollments'].notna()]
data_exp= data_exp[data_exp['Enrollments'].notna()]
control_clicks = data_control['Clicks'].sum()
exp_clicks = data_exp['Clicks'].sum()
control_enroll = data_control['Enrollments'].sum()
exp_enroll = data_exp['Enrollments'].sum()
control_GC = control_enroll / control_clicks
exp_GC = exp_enroll / exp_clicks

In [37]:
dmin = GC['d_min']
d_hat = exp_GC - control_GC
alpha = 0.05
pool_p = (control_enroll+exp_enroll)/(control_clicks+exp_clicks)
se = math.sqrt(pool_p*(1-pool_p)*(1/exp_clicks+1/control_clicks))
ME = get_z_score(1-alpha/2)*se
print("Confidence interval: ",[d_hat-ME, d_hat+ME])
print ("The change is statistically significant if the CI doesn't include 0. On the other hand, it is practically significant if",-GC["d_min"],"is not in the CI as well.")

Confidence interval:  [-0.02912320088750467, -0.011986548273218463]
The change is statistically significant if the CI doesn't include 0. On the other hand, it is practically significant if -0.01 is not in the CI as well.


### Net conversion

* From below results, we see that the confidence interval is [-0.0116, 0.0019], which includes 0. Therefore, we can't reject the null hypothesis. The net conversion is not statistically significant. Therefore, the practical significance is also not significant.

In [38]:
control_clicks = data_control['Clicks'].sum()
exp_clicks = data_exp['Clicks'].sum()
control_payments = data_control['Payments'].sum()
exp_payments = data_exp['Payments'].sum()
control_NC = control_payments / control_clicks
exp_NC = exp_payments / exp_clicks

In [39]:
alpha = 0.05
dmin = NC['d_min']
d_hat = exp_NC - control_NC
pool_p = (control_payments+exp_payments)/(control_clicks+exp_clicks)
se = math.sqrt(pool_p*(1-pool_p)*(1/control_clicks+1/exp_clicks))
ME = get_z_score(1-alpha/2)*se
print("Confidence interval: ",[d_hat-ME, d_hat+ME])
print ("The change is NOT statistically significant as the CI includes 0. Therefore, it is NOT practically significant as well.")

Confidence interval:  [-0.011604500677993734, 0.0018570553289053993]
The change is NOT statistically significant as the CI includes 0. Therefore, it is NOT practically significant as well.


## Sign tests

* H0：正負號均等 (p=0.5)
* Ha：正負號不均等 (p!=0.5)
* From the Gross Conversion results, we see that the p-value is 0.0026, which is less than 0.05. Therefore, we can reject the null hypothesis. The sign test is statistically significant. The observed success rate is 0.174, far lower than the null hypothesis expectation of 0.5, meaning that the positive signs are much rarer than the negative signs.

* From the Net Conversion results, we see that the p-value is 0.6776, which is greater than 0.05. Therefore, we can't reject the null hypothesis. The sign test is not statistically significant. Therefore, the change is not practically significant.

In [None]:
full = pd.merge(data_control,data_exp, on='Date', suffixes=('_control','_exp'))
full["control_GC"] = full["Enrollments_control"]/full["Clicks_control"]
full["control_NC"] = full["Payments_control"]/full["Clicks_control"]
full["exp_GC"] = full["Enrollments_exp"]/full["Clicks_exp"]
full["exp_NC"] = full["Payments_exp"]/full["Clicks_exp"]
full['sign_GC'] = np.where(full["exp_GC"]>full["control_GC"],1,0)
full['sign_NC'] = np.where(full["exp_NC"]>full["control_NC"],1,0)

In [64]:
import scipy.stats as stats
GC_res = stats.binomtest(full['sign_GC'].sum(), full.shape[0], p=0.5)
print(f"Sign test result for Gross Conversion:")
print(f"p-value:{GC_res.pvalue}, success rate:{GC_res.statistic}")
NC_res = stats.binomtest(full['sign_NC'].sum(), full.shape[0], p=0.5)
print(f"Sign test result for Net Conversion:")
print(f"p-value:{NC_res.pvalue}, success rate:{NC_res.statistic}")

Sign test result for Gross Conversion:
p-value:0.002599477767944336, success rate:0.17391304347826086
Sign test result for Net Conversion:
p-value:0.6776394844055176, success rate:0.43478260869565216


## Conclusion
* We define number of cookies, number of clicks, click-through-probability as invariate metrics. We use these metrics to check if the experiment was conducted correctly.
* We define gross conversion, retention, net conversion as the evaluation metrics. We use these metrics to check if the experiment was successful.
* Evaluation metric "Retention" requires unreasonable amount of time to run the experiment. Therefore, we will not run the experiment for the retention metric.
* We use sanity check to check if the invariate metrics are consistent with the expected values. All invariate metrics pass the sanity check.
* We use effect size tests to check if the evaluation metrics are statistically significant and practically significant. Gross conversion is statistically significant and practically significant. Net conversion is not statistically significant and practically significant.
* We use sign tests to check if the evaluation metrics are statistically significant. Gross conversion is statistically significant, that is, the experimental group indeed has lower Gross conversion rate. Net conversion is not statistically significant.

## Recommendation
In the experimental group, we see that the Gross conversion rate is lower than the control group. This means that the number of students who enroll in the free trial and complete the checkout process is lower in the experimental group. However, the Net conversion rate is not significantly different between the two groups, which indicates that the change does not increase the number of users who make at least one payment. Therefore, we recommend not to implement the change in the experiment.