Udacity is considering online experiments to test potential improvements to their website. Two versions of the website are shown to different users - usually the existing website and a potential change. My goal is to design and analyze an A/B test and write up a recommendation on whether Udacity should introduce new version of the website. 

The project involves choosing and characterize metrics to evaluate experiments, designing an experiment with enough statistical power, analyzing the results and draw valid conclusions, and ensuring that the the participants of experiments are adequately protected.

Here is a screenshot of what the experiments look like:-
![Free trial screener](images/experiment_screenshot.png)

## Experiment Design

### Metric Choice
>For each metric, explain both why you did or did not use it as an invariant metric and why you did or did not use it as an evaluation metric. Also, state what results you will look for in your evaluation metrics in order to launch the experiment.

**Invariant metrics**
- a) Number of cookies: Number of unique cookies to view the course overview page.
- b) Number of clicks: Number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). 
- c) Click-through-probability: Number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page.

**Evaluation metrics**
- a) Gross conversion: Number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button.
- b) Retention: Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout.
- c) Net conversion: Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button.

**Reasons behind metric selection**
- Number of cookies will be a good population sizing invariant because it is randomly assigned between control and experiment group.
- Number of cookies, Number of clicks, and Click-through-probability are measured before free trial screener is triggered, so will be invariant.
- Gross conversion, Retention and Net conversion can be used as an evaluation metric because the numerator of the metric happens after free trial screener is triggered. 
- These are the metric that Udacity is trying to track in this experiment. By adding free trial screener after Start Free Trial button, the experiment could affect the number of user-ids to complete checkout. 

** Results to look for in order to launch the experiment**
- Our evaluation metrics need to be both statistically and practically significant. In other words, we need to be confident that the true difference between experimental and control group is at least what is required to declare practical significance.
- Since the workload expectations have been set upfront, number of students completing the checkout might go down, so the Gross conversion will decrease.
- We would expect the Retention to increase.
- Net conversion might go down, however, it is hard to predict beforehand.

### Measuring Standard Deviation
>For each of your evaluation metrics, indicate whether you think the analytic estimate would be comparable to the empirical variability, or whether you expect them to be different (in which case it might be worth doing an empirical estimate if there is time). Briefly give your reasoning in each case.

For binomial distribution with probability p and population N, the analytical standard deviation is computed as $sd = \sqrt\frac{p(1-p)}{N}$

**_Analytical Estimate of Standard Deviation given 5000 cookies per day_**

| Evaluation Metric | Standard Deviation |
|:-------------------:|:--------------------:|
| Gross Conversion  | .02023 |
| Retention         | .05495 |
| Net Conversion    | .01560 |

- Analytically computed variability is likely to be close to empirically computed variability when the unit of diversion and unit of analytics are the same - cookie in this case. This is the case for Gross conversion and Net conversion.
- For the Retention metric, the unit of analytics is user-id, which is not the same as the unit of diversion cookie. The variability of this metric will be much higher. It might be worth doing an empirical estimate of variability for this metric if there is time.

### Sizing
#### Number of Samples vs. Power
>Using the analytic estimates of variance, how many pageviews total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric.

I did not use Bonferroni correction, because we are measuring three highly correlated metrics. The sample size needed to adequately power the experiments were calculated using the following [link](http://www.evanmiller.org/ab-testing/sample-size.html).

| Evaluation Metrics | Baseline Conversion Rate | dmin | alpha | beta | sample size | # Pageviews
|:---------------:|:----------------:|:----------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| Retention         | .53 | 0.01 | 0.05 | 0.2 | 39115 | 4741212 |
| Net Conversion    | .10931 | 0.0075 | 0.05 | 0.2 | 27413 | 685325 |
| Gross Conversion  | .20625 | 0.01 | 0.05 | 0.2 | 25835 | 645875 |

#### Duration vs. Exposure
>What percentage of Udacity's traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)? Is the change risky enough that you wouldn't want to run on all traffic?

>Given the percentage you chose, how long would the experiment take to run, using the analytic estimates of variance? 

| Evaluation Metrics | Traffic diverted | Duration (days) |
|:-------------------:|:--------------------:|:--------------------:|
| Retention       | 1 | 119 |
| Net Conversion  | 1 | 18 |
| Gross Conversion    | 1 | 17 |

If we divert 100% of Udacity’s traffic, the experiment will still run for 119 days. This is an unreasonably long time.
However, if we drop Retention and just measure Net conversion and Gross conversion, the experiment can be run in 18 days with enough pageviews to adequately power the experiment.

The experiment constitutes a minimal risk to both students and Udacity. Entire traffic can be directed to this experiment if there are no other parallel experiment. Even if we divert 50% of Udacity's traffic the experiment can be done in 36 days.

## Experiment Analysis

### Sanity Checks
> For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check.

| Invariant metrics | Lower bound | Upper bound | Observed | Passses |
|:--------------:|:---------------:|:---------------:|:---------------:|:---------------:|
| Number of cookies                               | 0.4988  | 0.5012 | 0.5006 | Yes |
| Number of clicks on “Start free trial”          | 0.4959  | 0.5041 | 0.5005 | Yes |
| Click through probability on “Start free trial” | -0.0013 | 0.0013 | 0.0001 | Yes |

### Result Analysis
#### Effect Size Tests
>For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant.

| Evaluation metrics | Lower bound | Upper bound | Observed | Statistical significance | Practical significance |
|:-------------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
| Retention        | 0.008104  | 0.05409  | 0.03109  | Yes | No  |
| Net conversion   | -0.01160  | 0.001857 | -0.0049  | No  | No  |
| Gross conversion | -0.02912  | -0.01199 | -0.02055 | Yes | Yes |

Statistical significance was determined based on whether or not the Confidence interval bound contained 0. Practical significance was determined based on minimum detectable effect ($d_{min}$) parameter.

#### Sign Tests
>For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant.

We performed the sign test using this [calculator](http://graphpad.com/quickcalcs/binomial1.cfm).

Sign test. If there is no change, there is 0.5 probability of positive change on each day (Null hypothesis), then:

| Evaluation metrics | # days with positive change | # days | two-tail P value  | 
|:--------------:|:---------------:|:---------------:|:---------------:|
| Retention        | 13 | 23 | 0.6776  | 
| Net conversion   | 10 | 23 | 0.6776  | 
| Gross conversion | 4  | 23 | 0.0026  | 

#### Summary
>State whether you used the Bonferroni correction, and explain why or why not. If there are any discrepancies between the effect size hypothesis tests and the sign tests, describe the discrepancy and why you think it arose.

We did not use Bonferroni correction for sign test because we were only measuring one metric at a time.

The effect size test for Retention is statistically significant, whereas sign test is not. There are two possible reasons. 
1)	We did not run the experiment long enough to have adequate page views, and hence enough Power for this metric.
2)	Sign test is a non-parametric test and has lower power than the effect size test.

Effect size test and sign test for Net conversion are not statistically significant whereas for Gross conversion both the tests are statistically significant.

### Recommendation

There was a statistically and practically significant decrease in Gross conversion, but no significant change in Net conversion. The "free trial screener" decreased the number of students who completed checkout, but did not affect the number of students who remain enrolled past 14 days boundary. My recommendation will be to not launch the change.

## Follow-Up Experiment
>Give a high-level description of the follow up experiment you would run, what your hypothesis would be, what metrics you would want to measure, what your unit of diversion would be, and your reasoning for these choices.

Udacity can test a change where they add a “**Book Welcome Videochat**” button after enrolling in the course. So, if the student clicks Start free trial and subsequently checkout and enroll, they will be required to attend a video chat appointment with a Udacity coach within the first 14 days free trial.

The **hypothesis** is that talking to Udacity coach will provide student with a personal touch, boost student morale’s, set clear expectations about the course and benefits of completing the course with past examples. If the hypothesis is held true Udacity will boost student’s enthusiasm and provide momentum to complete the course.

The **unit of diversion** will be user-id because the experiment is being done after student enrolls in the course. We want each student to have consistent experience independent of platform and device. The same user-id cannot book video chat appointment twice.

**Invariant metrics**
- a) Number of cookies
- b) Number of clicks
- c) Click-through-probability
- d) Number of user-ids
- e) Gross conversion 

- User-ids are being explicitly randomized between control and experiment group. One user-id can correspond to multiple cookies, so cookies should be split evenly as well. User-id and cookies will serve as good population sizing invariant.

- Number of clicks, Click-through-probability, and Gross conversion is all measured before our experiment and hence should be invariant metrics.

**Evaluation metrics**

- a) Retention
- b) Net conversion

Retention and Net conversion are the metrics that Udacity is trying to measure in this experiment. The experiment will affect the number of user-ids to remain enrolled past the 14-day boundary.

## References

1) [A/B testing by Google](https://www.udacity.com/course/ab-testing--ud257)

2) [How not to run A/B test](http://www.evanmiller.org/how-not-to-run-an-ab-test.html)

3) [Sign and binomial test](http://graphpad.com/quickcalcs/binomial2/)

4) [Sample size per variation for A/B test](http://www.evanmiller.org/ab-testing/sample-size.html)

5) [Bootstrap explained](http://stats.stackexchange.com/questions/26088/explaining-to-laypeople-why-bootstrapping-works)