# Design an A/B Test

[Cédric Campguilhem](https://github.com/ccampguilhem/Udacity-DataAnalyst), March 2018

<a id="Top">

## Table of contents

- [Introduction](#Introduction)
- [Experiment design](#Design)
    - [Metric choice](#Metric)
    - [Measuring standard deviation](#Standard deviation)
    - [Sizing](#Sizing)
- [Experiment analysis](#Analysis)
    - [Sanity checks](#Sanity)
    - [Result analysis](#Result)
    - [Recommendations](#Recommendations)
- [Follow-up experiment](#Followup)
- [Appendix](#Appendix)

<a id="Introduction">

## Introduction [*top*](#Top)

This project is related to A/B testing course for Udacity Data Analyst Nanodegree program. The purpose of this project is to analyse an experiment made at Udacity.

The experiment is related to a change when student clicks "start free trial" button. A message asks them how much time they would dedicate to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

<a id="Design">

## Experiment design [*top*](#Top)

<a id="Metric">

### Metric choice [*Experiment design*](#Design)

The following parameters have been selected as invariants of the analysis (i.e. parameters which should not be affected by the change being analyzed). 

Invariant                    | Description                      
:----------------------------|:---------------------------------
Number of cookies            | Number of unique cookies to view the course overview page. 
Number of clicks             | Number of unique cookies to click on "Start free trial" button 
Click-through-probability    | Number of unique cookies to click on "Start free trial" button divided by the number of unique cookies to view the course page overview.

The following parameters have been selected as evaluation metrics because it collects information downstream to change and are related to the objectives of this A/B test which are: 

- minimizing the proportion of enrolled students quiting during the trial
- keeping the same proportion of students clicking the start free trial and continuing the course afterwards

Evaluation metrics  | Description | Practical significance boundary
:-------------------|:------------|:-------------------------------
Gross conversion    | Number of user-ids to enroll in the free trial divided by the number of unique cookies to click on the "Start free trial" button. | ${d}_{min} = 0.01$
Retention           | Number of user-ids to remain enrolled after the trial divided by the number of user-ids enrolled during the trial. | ${d}_{min} = 0.01$
Net conversion      | Number of user-ids to remain enrolled after the trial divided by the number of unique cookies to click the "Start free trial" button. | ${d}_{min} = 0.0075$

Reasons of choice for evaluation and invariant metrics:

- **Number of cookies**: Is a good invariant metric because it's being directly randomized between experiment and control group. Is a bad evaluation metric because it won't be different between experiment and control.
- **Number of user-ids**: Is a bad invariant metric because the experiment may change the number of users involved in the free trial period. It's only at enrollment that a user checkout is made. Could have been used as an evaluation metric but is redundant with gross conversion. The later one has the advantage of being normalized so that comparisons between control and experiment group are easier.
- **Number of clicks**: Is a good invariant as it is recorded before user has a chance to see the change brought by the experiment. Is a bad evaluation metric for the same reason: it's recorded upstream to the change.
- **Click-through-probability**: For the same reason that the number of clicks, the click-through-probability is recorded upstream to the change. It is then a good invariant and a bad evaluation metric.
- **Gross conversion**: Is a bad invariant because it's recorded downstream to change and may be affected by it. Is a good metric because it brings information related to the hypothesis being tested by capturing the proportion of students changing their mind after the time commitment warning.
- **Retention**: Is a bad invariant because it's recorded downstream to change and may be affected by it. Is a good evaluation metric because it brings information related to the hypothesis being tested by assessing the first objective: minimize number of students quitting during the trial.
- **Net conversion**: Is a bad invariant because it's recorded downstream to change and may be affected by it. Is a good evaluation metric because it is relevant to the second objective being tested: keep the same proportion of students enrolled in the long term.

To launch this experiment we expect the gross conversion and retention to decrease with practical significance and the net conversion not to decrease with practical significance. 

<a id="Standard deviation">

### Measuring standard error [*Experiment design*](#Design)

The standard error for each evaluation metrics will be calculated with the following data:

Parameter | Value
:---------|:------
Unique cookies to view course overview page per day |	40000
Unique cookies to click "Start free trial" per day | 3200
Enrollments per day |	660
Click-through-probability on "Start free trial" |	0.08
Probability of enrolling, given click |	0.20625
Probability of payment, given enroll |	0.53
Probability of payment, given click	| 0.1093125

The *probability of enrolling, given click* is linked to the *gross conversion* metric. *Probability of payment, given enroll* is in relation with the *retention* metric. Finally, *probability of payment, given click* is related to *net conversion* metric. As we are dealing with probabilities, we will assume to have a binomial distribution. We can then estimate the standard error for each metric using the binomial standard deviation:

\begin{align}
SE = \sqrt{\frac{p(1-p)}{n}}
\end{align}

Where:
- p is the probability of event
- n is the number of repetitions of the event

In [153]:
nb_cookies = 5000.
nb_clicks = nb_cookies * 3200. / 40000.
nb_enrollments = nb_clicks * 660. / 3200.
print nb_clicks, nb_enrollments

400.0 82.5


In [154]:
import math
stddev_gross = math.sqrt(0.20625 * (1 - 0.20625) / nb_clicks)
stddev_retention = math.sqrt(0.53 * (1 - 0.53) / nb_enrollments)
stddev_conversion = math.sqrt(0.1093125 * (1 - 0.1093125) / nb_clicks)
print stddev_gross, stddev_retention, stddev_conversion

0.020230604137 0.0549490121785 0.0156015445825


The sample size is 5000 cookies. We can then assume that we will have 400 clicks on the "Start free trial button" and 82.5 enrollments. The standard deviations are reported in the table below:

Evaluation metrics | Units of analysis (n) | Estimated standard deviation
:------------------|:----------------------|:----------------------------
Gross conversion   | cookie (400)          | 0.0202
Retention          | user-id (82)          | 0.0549
Net conversion     | cookie (400)          | 0.0156

Gross conversion and net conversion use cookie as unit of analysis and unit of diversion, so the analytical standard error calculated here shall be quite close from empirical values. This is not the case for retention metrics as it uses user-id and we could have differences between empirical variability and the one estimated above.

<a id="Sizing">

### Sizing [*Experiment design*](#Design)

#### Number of samples vs power

We have multiple hypothesis tested in this experiment. Testing multiple hypothesis increases the risk of making type I error (incorrect rejection of null hypothesis). The Bonferroni correction may be used in a context where type I error shall be avoided but is discouraged where type II errors shall be avoided (see this [paper](https://www.onlinelibrary.wiley.com/doi/pdf/10.1111/opo.12131) for reference). I will not use Bonferroni correction to avoid type II error (incorrect retaining of null hypothesis).

I have used the online calculator provide by [Evan Miller](http://www.evanmiller.org/ab-testing/sample-size.html) to estimate sample size for A/B test. The results are provided in the table below:

Parameter            | Base conversion rate | Practical significance | $\alpha$ | $1 - \beta$ | Sample size per variation
:--------------------|:---------------------|:-----------------------|:---------|:------------|:-------------------------
Gross conversion     | 20.625 %             | 1.0 %                  | 5.0 %    | 80.0 %      | 25835
Retention            | 53.0 %               | 1.0 %                  | 5.0 %    | 80.0 %      | 39115
Net conversion       | 10.93125 %           | 0.75 %                 | 5.0 %    | 80.0 %      | 27413

The retention metrics is the one requiring the most samples per variation. But as this metrics is also using user-id as units of analysis, it also need to be converted to clicks, increasing again the number of page views (only 8% of view lead to clicks):

\begin{equation}
{pageviews} = \frac{39115 * 2}{0.08 * 0.20625}
\end{equation}

The equation above assumes that both control and test groups are seeing the same number of pages and leads to 4741212 page views.

In [156]:
print 1 - (0.95 * 0.95)
print 39115. / (0.08 * 0.20625) * 2.

0.0975
4741212.12121


#### Duration vs exposure

We have 40000 unique cookies to view course overview per day. If we redirect half of the traffic, the duration would be:

\begin{equation}
duration = \frac{4741212}{40000 * 0.5}
\end{equation}

The equation above leads to 238 days ! That's a long experiment and Udacity does not want to spend that long. We need to rework some of the previous decisions we have made.

The retention metric is really demanding in terms of page views. If we drop this metric, the dimensionning metric is net conversion which now requires 685325 page views. If we increase the redirection factor to two-third of the traffic, this lead to a duration of 26 days which is much more manageable.

In [160]:
print 4741212 / (40000 * 0.5)

237.0606


In [161]:
print 27413 / 0.08 * 2.
print 685325.0 / (40000 * 0.66)

685325.0
25.959280303


This desing exposes one third of students to a new feature during less than one month. The nature of feature is to minimize students starting the free trial without willing to dedicate more than 5 hours a week to follow the course. This feature shall not change the mind of students wanting to take the course and agreeing to dedicate a long time to it. Additionally no personal data is required and so there is no risk in terms of ethics. Also, the nature of change is harmless to the users. Overall, running this test is probably an acceptable risk.

<a id="Analysis">

## Experiment analysis [*top*](#Top)

<a id="Sanity">

### Sanity checks [*Experiment analysis*](#Analysis)

With a alpha of 0.05, the critical $z^*$ value for a two tailed test is 1.96.

The total number of cookies in each group is:

- experiment group: $n_1$
- control group: $n_2$
- probability to be in experiment group (by design): $p=0.5$
- observed probability to be in experiment group: ${p}_{obs}=\frac{n_2}{n_1+n_2}$


The pooled standard error is:

\begin{equation}
SE = \sqrt{\frac{p(1-p)}{n_1+n_2}}
\end{equation}

The margin of error is:

\begin{equation}
margin = SE * z^*
\end{equation}

The confidence interval is:

\begin{equation}
CI = [0.5 - margin, 0.5 + margin]
\end{equation}

The sanity check is passed if observed probability $p_{obs}$ is within confidence interval.

In [162]:
import scipy.stats

#Probability
alpha = 0.05
z_star = -scipy.stats.norm.ppf(alpha / 2.) #two-tailed tests
n1 = 344660.
n2 = 345543.
p = 0.5
p_obs = n2 / (n1 + n2)

#Pooled Standard error
SE = math.sqrt(p * (1 - p) / (n1 + n2))

#Confidence interval
me = SE * z_star
ci = p - me, p + me

#Measure
print z_star, SE, ci, p_obs, ci[0] <= p_obs <= ci[1]

1.95996398454 0.000601840740294 (0.49882041382459419, 0.50117958617540581) 0.500639666881 True


For number of cookies:
- confidence interval: [0.4988, 0.5012]
- observed probability: 0.4994
- sanity check: **passed**

In [163]:
#Probability
alpha = 0.05
z_star = -scipy.stats.norm.ppf(alpha / 2.) #two-tailed test
n1 = 28325.
n2 = 28378.
p = 0.5
p_obs = n2 / (n1 + n2)

#Pooled Standard error
SE = math.sqrt(p * (1 - p) / (n1 + n2))

#Confidence interval
me = SE * z_star
ci = p - me, p + me

#Measure
print z_star, SE, ci, p_obs, ci[0] <= p_obs <= ci[1]

1.95996398454 0.0020997470797 (0.49588457134714631, 0.50411542865285364) 0.500467347407 True


For number of clicks:
- confidence interval: [0.4959, 0.5041]
- observed probability: 0.5005
- sanity check: **passed**

In [164]:
#Probability
alpha = 0.05
z_star = -scipy.stats.norm.ppf(alpha / 2.) #two-tailed test
n1 = 344660.
c1 = 28325.
n2 = 345543.
c2 = 28378.
p = c2 / n2
p_obs = c1 / n1

#Standard error
SE = math.sqrt(p * (1 - p) / n2)

#Confidence interval
me = SE * z_star
ci = p - me, p + me

#Measure
print z_star, SE, ci, p, p_obs, ci[0] <= p_obs <= ci[1]

1.95996398454 0.000467068276555 (0.081210376574208529, 0.083041250574945116) 0.0821258135746 0.0821824406662 True


For the click-through-probability we use a slightly different approach. The average click-through-probability is 0.0821 based on control group. We are no longer using a pooled standard error. We then need to check that click-through-probability for experiment group lies within the confidence interval:

- confidence interval: [0.0812, 0.0830]
- observed probability: 0.0822
- sanity check: **passed**

<a id="Result">

### Result analysis [*Experiment analysis*](#Analysis)

#### Effect size tests

The dataset records 23 days of experiment in terms of unique cookies, clicks, enrollments and payments. This duration is lower than the duration we have designed. All other things being equal, this means that we are losing statistical power $1 - \beta$ (percent of the time the minimum effect size will be detected, assuming it exists). A reduction of samples is related to an increase of $\beta$.

That being said, we can calculate confidence interval and state whether each evaulation metric is statistically significant and practically significant).

The method used to calculate confidence interval changes:

\begin{equation}
p = \frac{events_{control} + events_{experiment}}{clicks_{control} + clicks_{experiment}} \\
SE = \sqrt{p(1-p)*\Bigl(\frac{1}{clicks_{control}}+\frac{1}{clicks_{experiment}}\Bigr)} \\
margin = SE * z^* \\
d = \frac{events_{control}}{clicks_{control}} - \frac{events_{experiment}}{clicks_{experiment}} \\
CI = [d - margin, d + margin]
\end{equation}

The $z^*$ value is calculated from alpha assuming a two-tailed test:

\begin{equation}
z^* = 1.96
\end{equation}

A metric is statistically significant if 0 is not included in the confidence interval (there is high chance that there is a difference between experiment and control). Additionaly, it becomes practically significant if the practical difference $d_{min}$ is not in the confidence interval: there is high chance that business sees a difference.

In [165]:
z_star = -scipy.stats.norm.ppf(0.05 / 2.) #two-tailed test + Bonferroni correction
#z_star = -scipy.stats.norm.ppf(0.025 / 2.) #two-tailed test + Bonferroni correction
print z_star

1.95996398454


In [166]:
#Gross conversion
events_control = 3785
clicks_control = 17293.
events_experiment = 3423
clicks_experiment = 17260.
d_min = 0.01
p = (events_control + events_experiment) / (clicks_control + clicks_experiment)
SE = math.sqrt(p*(1-p)*(1./clicks_control + 1./clicks_experiment))
margin = SE * z_star
d = events_experiment / clicks_experiment - events_control / clicks_control
CI = d - margin, d + margin
print p, SE, margin, d, CI, not(CI[0] <= 0. <= CI[1]), not(CI[0] <= d_practical <= CI[1])

0.208607067404 0.00437167538523 0.00856832630714 -0.0205548745804 (-0.029123200887504669, -0.011986548273218461) True True


In [167]:
#Net conversion
events_control = 2033
clicks_control = 17293.
events_experiment = 1945
clicks_experiment = 17260.
d_min = 0.0075
p = (events_control + events_experiment) / (clicks_control + clicks_experiment)
SE = math.sqrt(p*(1-p)*(1./clicks_control + 1./clicks_experiment))
margin = SE * z_star
d = events_experiment / clicks_experiment - events_control / clicks_control
CI = d - margin, d + margin
print p, SE, margin, d, CI, not(CI[0] <= 0. <= CI[1]), not(CI[0] <= d_practical <= CI[1])

0.115127485312 0.00343413351293 0.00673077800345 -0.00487372267454 (-0.011604500677993734, 0.0018570553289054001) False True


The results are reported in the table below:

Evaluation metric | Lower bound | Upper bound | Statistical significance | Practical significance
:-----------------|:------------|:------------|:-------------------------|:----------------------
gross conversion  | -0.0291     | -0.0120     | Yes                      | Yes
net conversion    | -0.0116     |  0.0019     | No                       | No

#### Sign tests

We have records of 23 days of experiment. The number of enrollments is higher in the experiment group in 4 days (in a row). The number of payments is higher in the experiment group in 10 different days.

With the use of [GraphPad](https://www.graphpad.com/quickcalcs/binomial1.cfm), setting the number of "successes" to 4 and 10 respectively and a probability of 0.5 we get the following numbers:

Gross conversion:
- Number of successes: 4
- Probability: 0.5
- Two-tail p-value: 0.0026
- Alpha value: 0.025 (0.05 divided by 2 for a two-tailes test)
- The result is statistically significant.

Net conversion:
- Number of successes: 10
- Probability: 0.5
- Two-tail p-value: 0.6776 
- Alpha value: 0.025 (0.05 divided by 2 for a two-tailes test)
- The result is statistically unsignificant.

#### Summary

As introduced in section [Number of samples vs power](#Sizing), I have not used Bonferroni correction. Bonferroni correction shall not be used where type II errors have to be avoided.

In this experiment, the null hypothesis could be written like this:

- $H_{0,1}$: the proportion of students quitting during the trial period remains the same.
- $H_{0,2}$: the proportion of students clicking the "start trial" button and continuing after the trial period remains the same.

The second null hypothesis is critical: if the proportion decreases, it may affect the Udacity revenues.

A type II error in this contect would be not to find a change in the proportion of students continuing after the trial period while there is actually a decrease in the proportion. This is probably not an acceptable risk for the business.

<a id="Recommendations">

### Recommendations [*Experiment analysis*](#Analysis)

The experiments shows that the impact on gross conversion is both statistically and practically significant. The change in net conversion is unsignificant.

The initial objective of the change is:

- minimizing the proportion of enrolled students quiting during the trial (retention)
- keeping the same proportion of students clicking the start free trial and continuing the course afterwards (net conversion)

As a reminder, the null hypothesis are:

- $H_{0,1}$: the proportion of students quitting during the trial period remains the same.
- $H_{0,2}$: the proportion of students clicking the "start trial" button and continuing after the trial period remains the same.

The significant (statistical and practical) decrease of gross conversion shows that number of students enrolled in the free trial has decreased. We are sure about that because both lower and upper bounds of confidence interval are negative numbers.

Yet, the experiment failed to reject the second null hypothesis. Failing to reject the null hypothesis is not a proof that null hypothesis is valid. The confidence interval for net conversion ranges from negative values to positive values, meaning that there is a risk that proportion of students remaining enrolled after the free trial decreases.

This is a risk of revenue loss for Udacity and I would not recommend implementing the change.

<a id="Followup">

## Follow-up experiment [*top*](#Top)

I really wanted to follow the Data Analyst course, to me quitting before the end of free trial was not an option. I already knew that I had to dedicate time to the course, my feeling is that you cannot learn something if you are not willing to spend some time with it !

However, as a French, I must say that the way the Udacity courses work is pretty different than my experience with French colleges and high schools. And I would have advised any of my colleagues interested in the course to have a taste of that way before taking commitment. The balance between theory and practical examples is completly in opposition. Udacity course offers much more practical examples than theory. I like it, but it's pretty different to what I had experienced so far.

I am not sure how other countries balance theory and practice in their education. But different cultures may lead to different kind of frustrations.

One experiment idea would be to suggest a specific lesson explaining educational choices made by Udacity to give a taste to students before enrollment. The idea would be similar to the experiment conducted here: reduce frustrations of students in the free trial period while focusing Udacity coaches on long-term enrolled students without decreasing significantly long-term enrollments. The specific lesson may not have any coaching at all and students willing to start a new course would be redirected to such lesson first (whatever the course they have chosen).

Due to the very close nature of the experiment with the one we have conducted here, I would make almost the same design choices:

- Evaluation metrics: net conversion, gross conversion, retention.
- Invariants: number of unique cookies, number of clicks on "start free trial" button, click-through-probability for "start free trial button".

The risk of taking such an experiment is the entry cost with no guarantee of break even:

- Modification of the user interface to redirect free trials to that specific lesson
- Creation of the lesson itself requires some time

I would keep the same unit of diversion (cookies) as this specific lesson would not require user checkout. Without changing the practical significane, the duration of the experiment would be the same than this one: 26 days if we consider a redirection of two-third of traffic.

The Bonferroni correction would not be used either to prevent the risk of not detecting decrease in net conversion while there is. 

The test may be conducted at the same time than other tests as long as they do not overlap (other test related to free trials and enrollments) with this experiment. This would increase the risk of false positives (type I error) just like if we were considering more null hypotheses. Preventing such increase of type I error would be possible by using correction like Bonferroni but at the risk of increasing type II errors. As this is something we want to avoid I would discourage any other overlapping experiment.

If at the end of this experiment gross conversion decreases practically and statistically significantly and net conversion does not decrease practically and statistically significantly then this new feature may be launched.

<a id="Appendix">

## Appendix [*top*](#Top)

In [2]:
#Convert notebook to html
!jupyter nbconvert --to html --template html_minimal.tpl Design_an_AB_test.ipynb

[NbConvertApp] Converting notebook Design_an_AB_test.ipynb to html
[NbConvertApp] Writing 285654 bytes to Design_an_AB_test.html


[Evan Miller](http://www.evanmiller.org/ab-testing/sample-size.html) online calculator for A/B tests sizing.<hr>

[Discussion](https://www.widerfunnel.com/3-mistakes-invalidate-ab-test-results/) on whether or not to use Bonferroni correction. [Paper](https://www.onlinelibrary.wiley.com/doi/pdf/10.1111/opo.12131) on the same theme.<hr>

[Bonferroni](https://en.wikipedia.org/wiki/Bonferroni_correction) correction on Wikipedia.<hr>

Multiple A/B tests at the [same time](https://conversionxl.com/blog/can-you-run-multiple-ab-tests-at-the-same-time/) ?