# Udacity AB Testing

In [1]:
import numpy as np
import pandas as pd
from ab_tools.Metrics import Metric, MetricAnalyzer
from ab_tools.Power import Power_analyzer
from ab_tools.utils import generate_ci_report
import scipy.stats as ss

---

### ⚡ Project Objective

This experiment is a replica of one that the online educational platform "Udacity" has performed in order to test the significance of a change to its free trial subscription procedure.

---

### 💡 Free Trial procedure Overview

#### 1- Control

Without any change introduced, there are two options offered on each course overview page: "start a free trial" and "access course materials". If the student chooses the "start free trial" option, they will be prompted to provide their credit card information and enroll in a free trial of the paid course. However, after 14 days, they will be automatically charged unless they choose to cancel before the trial period ends. On the other hand, if the student selects "access course materials" they can view the videos and take the quizzes for free. Nevertheless, they will not get coaching support, a verified certificate, or submit their final project for feedback.

#### 2- Experiment

The change that Udacity desires to test is that if a student clicked "start free trial", they would have been asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial or access the course materials for free instead.
The proposed pop-up message looks as follows:

![Free Trial screener](Free_trial_screener.png)

---

### ⚖ The Hypothesis

$\large H_{0}: $ There is no difference in the choosen metric before the change and after the change     

$\large Vs$     
      
$\large H_{1}: $ There is a difference in the choosen metric before the change and after the change

---

### 📏 Metrics

In the [experiment instructions](Final_project_instructions.pdf), there are several metrics provided along with their corresponding practical significance. Two types of metrics have to be chosen from them: invariant metrics for sanity checks and evaluation metrics for measuring the significance of the $\large H_{1}$.

#### - Invariant Metrics

The invariant metrics are the ones that should not change across the control and the experiment groups, and based on that, the chosen invariant metrics are:    
- **Number of cookies:** The number of unique cookies to view the course overview page. $d_{min} = 3000$     
- **Number of clicks:** The number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). $d_{min} = 240$     

#### - Evaluation Metrics

The choosen Evaluation metrics are:      
- **Retention:** The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. $d_{min} = .01$     
- **Net conversion:** The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. $d_{min} = 0.0075$

---

### 📊 Baseline Metrics' estimates

There is a baseline estimate for each metric attached with the experiment, they look as follows: 

In [2]:
baseline = pd.read_csv('data/Baseline_Values.csv', names=["Metric Description", "Baseline estimate"])

baseline["metric"] = ["cookies_total", "clicks_trial", "enrollments", "CTP", "gross_conversion", "retention", "net_conversion"]
baseline.set_index("metric", inplace=True)

baseline.style.set_properties(**{'text-align': 'left', 'display.max_colwidth': None})

Unnamed: 0_level_0,Metric Description,Baseline estimate
metric,Unnamed: 1_level_1,Unnamed: 2_level_1
cookies_total,Unique cookies to view course overview page per day:,40000.0
clicks_trial,"Unique cookies to click ""Start free trial"" per day:",3200.0
enrollments,Enrollments per day:,660.0
CTP,"Click-through-probability on ""Start free trial"":",0.08
gross_conversion,"Probability of enrolling, given click:",0.20625
retention,"Probability of payment, given enroll:",0.53
net_conversion,"Probability of payment, given click",0.109313


### Calculating standard deviation Analytically 

According to the [experiment instructions](Final_project_instructions.pdf) it is given a sample size of **5000 cookies** visiting the course overview page. Therefore, to calculate the standard deviation for each metric, we have to rescale the baseline estimates first.

In [3]:
sample_size = 5000
sample_ratio = sample_size/40_000

In [4]:
baseline["Baseline estimate rescaled"] = baseline["Baseline estimate"].copy()

In [5]:
baseline.loc[:'enrollments', "Baseline estimate rescaled"] = baseline.loc[:'enrollments', "Baseline estimate rescaled"] * sample_ratio

In [6]:
baseline

Unnamed: 0_level_0,Metric Description,Baseline estimate,Baseline estimate rescaled
metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cookies_total,Unique cookies to view course overview page pe...,40000.0,5000.0
clicks_trial,"Unique cookies to click ""Start free trial"" per...",3200.0,400.0
enrollments,Enrollments per day:,660.0,82.5
CTP,"Click-through-probability on ""Start free trial"":",0.08,0.08
gross_conversion,"Probability of enrolling, given click:",0.20625,0.20625
retention,"Probability of payment, given enroll:",0.53,0.53
net_conversion,"Probability of payment, given click",0.109313,0.109313


In [7]:
print(f"CTP = {3200/40_000}")
print(f"gross conversion = {660/3200}")
print(f"retention = {.53}")
print(f"net conversion = {660*.53/3200}")

CTP = 0.08
gross conversion = 0.20625
retention = 0.53
net conversion = 0.1093125


In [8]:
pwr_analysis = Power_analyzer(alpha=.05, beta=.2)

In [9]:
rescaled_data = baseline.loc[:, "Baseline estimate rescaled"]
rescaled_data

metric
cookies_total       5000.000000
clicks_trial         400.000000
enrollments           82.500000
CTP                    0.080000
gross_conversion       0.206250
retention              0.530000
net_conversion         0.109313
Name: Baseline estimate rescaled, dtype: float64

By defining each metric as an instance of the ***Metric class*** [[Source Code](ab_tools/Metrics.py)], the standard error is automatically calculated given the baseline estimate and the corresponding number of analysis unit provided in the initialization of the Metric instance.

In [10]:
retentiton = Metric(p=rescaled_data["retention"], n=rescaled_data["enrollments"], d_min=.01)
retentiton

Metric(p=0.53, n=82.5, d_min=0.01, se=0.0549)

In [11]:
net_conversion = Metric(p=rescaled_data["net_conversion"], n=rescaled_data["clicks_trial"], d_min=.0075)
net_conversion

Metric(p=0.1093125, n=400.0, d_min=0.0075, se=0.0156)

---

### ⚖ Sizing

#### - retention

In [12]:
sample_size_retention = np.ceil( pwr_analysis.calculate_sample_size(retentiton.p, retentiton.d_min) )
print(f"sample size for retention per group (control/treatment) = {int(sample_size_retention)} enrollment")

sample size for retention per group (control/treatment) = 39099 enrollment


The result of **sample_size_retention** is for each of the two groups, control and experiment. Therefore, the total sample size for **retention** is double that number.

In [13]:
retention_total_enrollments = 2 * sample_size_retention
print(f"total sample size for retentiton is {int(retention_total_enrollments)} enrollment")

total sample size for retentiton is 78198 enrollment


retention = $\frac{\text{The number of remained enrollments past the 14-days}}{\text{Total number of enrollments}} \ \Rightarrow \quad$ Unit of analysis is **enrollments**    
       
$\therefore \quad$ The ratio of the unit of analysis to the total number of unique cookies viewing the home page per day = $\frac{660}{40,000} = 0.0165$ *enrollment/page view*     
     
$\large \Rightarrow \quad$ The total number of page views required to get **78198** enrollments is     
     
$\frac{78198}{n} = 0.0165 \quad \Rightarrow \quad n = \ \frac{78198}{0.0165} = 4,739,273$ *page view*

#### - net conversion

In [14]:
sample_size_net_conversion = np.ceil( pwr_analysis.calculate_sample_size(p=net_conversion.p, 
                                                                         d_min=net_conversion.d_min) )

print(f"sample size for net conversion per group (control/treatment) = {int(sample_size_net_conversion)} enrollment")

sample size for net conversion per group (control/treatment) = 27422 enrollment


In [15]:
net_conversion_total_enrollments = 2 * sample_size_net_conversion
print(f"total sample size for net conversion is {int(net_conversion_total_enrollments)} enrollment")

total sample size for net conversion is 54844 enrollment


Unit of analysis is **clicks** on free-trial  
$\therefore \quad$  The ratio of unit of analysis to total unique cookies viewing the home page per day = $ \frac{3200}{40,000} = 0.08$ *click/page view*     
     
$\large \Rightarrow \quad$ The total number of page views required to get **54844** enrollments that paid at least once is     
     
$ \frac{54844}{0.08} = 685,550$ *page view*

#### -Metrics Summary

|       | Retention | Net Conversion |
| :---- | :---      | :---           |
| Baseline estimate |.53 | .1093125|
| d_min | .01 |.0075 |
| sample size | 39,099 enrollment | 27,422 enrollment |
| total sample size = (sample size) x 2 | 78,198 | 54,844 |
| standard error | .0549 | .0156 |
| total page views required| 4,739,273 | 685,550 |

$\therefore \quad$*The required page views* = ***4,739,273*** *page view*

---

### ⏱ Duration vs Exposure

- Is the change risky enough that you wouldn't want to run on all traffic?      
No, Because if the whole traffic is exposed to the change, it is not expected to see a drastic effect on other metrics nor user experience. Therefore it is safe to run the experiment on all the traffic if possible.
- What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)?     
To choose a percentage I would experiment with durations for collecting data for each metric.     
1. 100% Traffic:     
    a. duration for retention:  $\large \frac{4,739,273}{40,000}$ ≈ *119 days*    
        
    b. duration for net conversion: $\large \frac{685,550}{40,000}$ ≈ *18 days*
2. 50% Traffic:     
    a. duration for retention:  $\large \frac{4,739,273}{20,000}$ ≈ *238 days*    
        
    b. duration for net conversion: $\large \frac{685,550}{20,000}$ ≈ *35 days*     
     
> Although there is no red flags to run the experiment on 100% of the traffic, I choose to run it on 50% just to leave a room in case of a need to perform another experiment.     
     
- It is mentioned in the [experiment instructions](Final_project_instructions.pdf) that if the experiment takes longer than a few weeks, then it is unreasonably long and an earlier decision needs to be reconsidered.      
In the case of the retention metric, the experiment needs to run for several months, which is way longer than a few weeks. In light of that, ***retention*** metric is dropped from the evaluation metrics, and we are only left with ***net conversion***.

---

### 🔬 Analysis

There are two files of data provided with the experiment, one for the [control group](data/collected_data/control.csv) and the other one for the [experiment group](data/collected_data/experiment.csv). Both of the files register daily data about pageviews, clicks on free-triall button, enrollments and payments.

#### - Data exploration

In [16]:
data_cont = pd.read_csv('data/collected_data/control.csv', index_col='Date')
data_cont.head()

Unnamed: 0_level_0,Pageviews,Clicks,Enrollments,Payments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Sat, Oct 11",7723,687,134.0,70.0
"Sun, Oct 12",9102,779,147.0,70.0
"Mon, Oct 13",10511,909,167.0,95.0
"Tue, Oct 14",9871,836,156.0,105.0
"Wed, Oct 15",10014,837,163.0,64.0


In [17]:
data_exp = pd.read_csv('data/collected_data/experiment.csv', index_col='Date')
data_exp.head()

Unnamed: 0_level_0,Pageviews,Clicks,Enrollments,Payments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Sat, Oct 11",7716,686,105.0,34.0
"Sun, Oct 12",9288,785,116.0,91.0
"Mon, Oct 13",10480,884,145.0,79.0
"Tue, Oct 14",9867,827,138.0,92.0
"Wed, Oct 15",9793,832,140.0,94.0


#### - Sanity checks

In [18]:
page_views_control = data_cont.Pageviews.sum()
page_views_experiment = data_exp.Pageviews.sum()
total_page_views = page_views_control + page_views_experiment

In [19]:
print(f"pageviews in the control group = {page_views_control}")
print(f"pageviews in the control group = {page_views_experiment}")
print(f"total pageviews = {total_page_views}")

pageviews in the control group = 345543
pageviews in the control group = 344660
total pageviews = 690203


Total pageviews are greater than the required pageviews required for the net conversion metric that we calculated earlier, which means the analysis of the net conversion metric will have a power of not less than 80%. From another point of view, the pageviews in the control group are greater than the pageviews in the experiment group by 883 *pageview*, We have to check if this difference in sample sizes is due to chance or is significant by performing a sanity check.

***H₀***: P(pageview diverted to control group) = P(pageview diverted to experiment group) = .5     
***Vs***      
***H₁***: P(pageview diverted to control group) ≠ P(pageview diverted to experiment group)

In [20]:
metric_analyzer = MetricAnalyzer()

In [21]:
metric_analyzer.sanity_check(page_views_control, page_views_experiment, return_report=True)

p_hat=0.5006, ci=(0.4988, 0.5012), 0.4988 ≤ p_hat ≤ 0.5012, PASSED


As we see, the probability of a pageview being assigned to the control group falls within the boundries of the confidence interval; therefore, we fail to reject ***H₀*** $\quad \Rightarrow \quad$ aaccording to the data, there is no evidence that the pageviews in the control group and the experiment group differ significantly.

In [22]:
clicks_control = data_cont.Clicks.sum()
clicks_experiment = data_exp.Clicks.sum()
total_clicks = clicks_control + clicks_experiment

In [23]:
metric_analyzer.sanity_check(clicks_control, clicks_experiment, return_report=True)

p_hat=0.5005, ci=(0.4959, 0.5041), 0.4959 ≤ p_hat ≤ 0.5041, PASSED


The same procedure is applicable to the number of clicks on the free-trial button. There is no evidence that the number of clicks on the free-trial button in the control group differs from its counterpart in the experiment group.

#### - Analysis of the Evaluation metrics

##### -Analytically

In [24]:
payments_cont = data_cont.Payments[~data_cont.Payments.isnull()]
clicks_cont = data_cont.Clicks[~data_cont.Payments.isnull()]

payments_cont_sum = payments_cont.sum()
clicks_cont_sum = clicks_cont.sum()
net_conversion_cont = payments_cont_sum/clicks_cont_sum

In [25]:
payments_exp = data_exp.Payments[~data_exp.Payments.isnull()]
clicks_exp = data_exp.Clicks[~data_exp.Payments.isnull()]

payments_exp_sum = payments_exp.sum()
clicks_exp_sum = clicks_exp.sum()
net_conversion_exp = payments_exp_sum/clicks_exp_sum

In [26]:
net_conversion_diff = net_conversion_exp - net_conversion_cont
np.round(net_conversion_diff, 4)

-0.0049

***H₀***: conversion rate in control group = conversion rate in experiment group     
***Vs***      
***H₁***: conversion rate in control group ≠ conversion rate in experiment group 

In [27]:
net_conversion_pooled = (payments_cont_sum + payments_exp_sum) / (clicks_cont_sum + clicks_exp_sum)

In [28]:
net_conversion_se = metric_analyzer.calculate_pooled_proportion_se(net_conversion_pooled, 
                                                                clicks_cont_sum, 
                                                                clicks_exp_sum)

In [29]:
ci = metric_analyzer.construct_CI(net_conversion_diff, net_conversion_se)
ci

(-0.0116, 0.0018)

In [30]:
generate_ci_report(0, ci)

d_hat=0, ci=(-0.0116, 0.0018), -0.0116 ≤ d_hat ≤ 0.0018, 


The confidence interval contains 0. Therefore, we fail to reject H₀, which means the experiment is statistically insignificant and, moreover, practically insignificant too.

##### -Empirically 

In [31]:
nc_cont = payments_cont/clicks_cont
nc_exp = payments_exp/clicks_exp

nc_change = nc_cont < nc_exp

***H₀***: chances for positive change = 50% $\quad Vs \quad$ ***H₁***: chances for positive change ≠ 50%

In [32]:
binomtest = ss.binomtest(k=nc_change.sum(), n=nc_change.count())
binomtest

BinomTestResult(k=10, n=23, alternative='two-sided', proportion_estimate=0.43478260869565216, pvalue=0.6776394844055176)

From the binomial test above, ***the p-value = .678 > .05***, which indicates the failure to reject ***H₀***. This means that there is no evidence that the positive change in the experiment group happened by chance.     
</br> 
***Finally***, the binomial test supports the analytical test, indicating that the change has neither statistical nor practical significance.

### 🧾 Recommendation

In a nutshell, the experiment hypothesis was to check whether filtering out the students who do not have enough time to dedicate to courses would result in better utilization of the coaching resources and a better student experience for those who continue their subscription after the free-trial period without decreasing the enrollment rate. The results show no statistical nor practical significance for the net conversion metric. In light of that, it is recommended not to launch the experiment.