# AB Testing Notes

### Overview

A/B testing is a general methodology used to test out a new product or feature. A general overview of the procedure is as follows:

- Take two sets of users
- One set is shown an existing product
- Second set is given a treated version
- How do the customers respond differently? determine which one is better based on some metric

Can you use A/B tests for everything? No.

A/B testing is good for optimizing an existing product but not good for developing a new product based on an existing one.

A few examples of big companies doing A/B tests:

- Amazon did A/B tests for personalized recommendations and found that they had an increase in revenue when given the personalized recommendations.
- Google tested 41 different shades of blue to use on their website using A/B tests
- LinkedIn A/B tested a ranking process where they checked whether it's better to show news articles or an encouragement to add more contacts on a users "stream".
- Amazon determined that every 100ms increase in page load time decreased sales by 1% by using A/B tests

For all A/B tests you need a consistent measurable response from your control and experiment groups to determine whether the experience was improved (or made worse) by the treatment.

### What can't you do with AB tests?

- **Test out brand new experiences**
    - Change aversion - Users refuse to participate in the test
    - Novelty effect - Too drastic of a change leads customers to "try out everything". Can't test out a specific treatment
- **No baseline for comparison**
    - Can't set up a control group if there is no baseline.
    - Same as the first point
- **A short term decision**
    - You need the plateaued experience to make a "robust decision". The metric being observed will be noisy in the beginning and only when it has stabilised can you check for any statistically significant changes to your metric. Thus you need time to complete A/B tests
- **Long term effects are difficult to test**
    - It is difficult to measure changes in your metric over a long time period where other aspects of your product or users will change (can't attribute the change in your metric to the treatment)
- **Can't test whether your missing something in your product**
    - There is no treatment for a "missing feature"
    - Can't set up a control and treatment group because, what do you change about the treatment group?

### Techniques other than A/B tests

- Logs of what users did on your website. Analyse them retrospectively or observationally to see if a hypothesis can be developed about what caused changes in their behaviour. This can then be used to design an experiment. 
- User experience research, focus groups, surveys, human evaluation
- A/B testing gives quantitative data, other techniques give qualitative data.

In online A/B tests, you don't know much about your users. You're using online user data and so it's difficult to distinguish whether a user is a single person, internet cafe etc.

The goal is to determine whether a new feature is desirable. To do this, you need to design an experiment that can be **repeatable**.

### Online Case Study

Audacity

Creates online finance courses

**User flow/Customer funnel**

<img src="images/Customer_Funnel.png" width=600>

Customers who start at the top of the funnel wil be passed down with more and more users dropping out along the way.

**The hypothesis:**

Changing the "Start Now" button from orange to pink will **increase** how many students explore Audacity's courses

**Possible metrics to use**

- **Total number of courses completed**
    - Will take too much time. Students make take months to complete a course
- **How many users click on the "Start Now" button**
    - Assumes that users who progress through the top of the customer funnel will eventually lead to more users being passed through the rest of the customer funnel i.e. not students will drop out of the funnel along the way
    - An unequal sizing of treatment and control groups will also lead to different total number of clicks. Doesn't just measure student's exploration of courses.
- **CTR: $\frac{\text{Number of clicks}}{\text{Number of page views}}$**
    - Called Click-Through-Rate
    - Single users can click more than once and inflate the CTR
- **CTP: $\frac{\text{Unique visitors who click}}{\text{Unique vistors to the page}}$**
    - Called Click-Through-Probability
    - The best metric to use in this case.
    
Changing the "Start Now" button from orange to pink will **increase** the Click-Through-Probability of the button. We assume an increase probability of clicking the button leads to more students exploring Audacity's courses.

**When to use CTR vs. CTP?**

Generally:

- Use a rate when you want to measure **usability**
    - Users have a number of different places they can press, you use a rate to measure how often the users clicked a specific button
    - Will have to change the website to log every page view of the website and every click of a button
- Use a probability when you want to measure **total impact**
    - You don't want to count when users double clicked, reloaded, etc when measuring a total effect (e.g. getting to the second level of a page)
    - Will have to change the website to match each page view with all of their "child clicks" to count at most one click per page view

<br>
<br>

___

<br>
<br>

# The statistics

**Which distribution?**

Say for example the CTP was computed to be $p_1 = \frac{100}{1000} = 0.1$ for one particular sample. When using a different sample, the CTP was instead computed to be $p_2 = 0.15$. 

Is $0.15$ considered to be a surprising result? Could this difference in proportion be a result of sampling variation? How inherently variable are our sample computed CTPs anyway? 

A standard result of statistical inference states that if the [Success-Failure Conditions](https://www.statisticshowto.datasciencecentral.com/success-failure-condition/) are satisfied then, assuming the two proportions $p_1$ and $p_2$ are the same, the **difference of two proportions** (in our case $\hat{d} = p_2 - p_1 = 0.05$) should be distributed normally around $0$ with a given **Standard Error**. We can then perform a **Hypothesis Test** or compute a **Confidence Interval** to verify statistically whether there is actually a difference between the two proportons or not. If there is a difference then we reject the notion that the two proportions $p_1$ and $p_2$ are the same.

Mathematically, this is written as 

$$H_0: \hat{d} = p_1 - p_2 = 0$$
$$H_A: \hat{d} = p_1 - p_2 \neq 0$$
$$\hat{d} \sim N(0,SE)$$
where
$$SE_{d} = \sqrt{\frac{p_{pooled}(1-p_{pooled})}{n_1} + \frac{p_{pooled}{(1-p_{pooled}})}{n_2}}$$
$$p_{pooled} = \frac{Number\,of\,successes}{Number\,of\,cases}$$

The full mathematical procedure for computing a confidence interval is outlined below

<img src="images/difference_of_two_proportions.png" width=600>

**Practical significance**

We have to decide how big of a change is practically significant (aka substantive) to warrant changing the existing system. Statistical significance of any arbitrary difference in proportion can be achieved with a big enough sample size, however a small difference may not be practically significant. 

- Each change may require an investment in resources and so a small change may not warrant the investment
- Online A/B tests have a smaller margin for practical significance
- We need to make sure for online A/B tests that the change is **repeatable**.
- We want a big enough sample size to have it so that the statistical significance bar is lower than the practical significance bar to ensure repeatability

We will decide that a 2% change in CTP is practically significant

**Size vs. Power tradeoff**

The power of a hypothesis test is the probability that the test rejects the null hypothesis  $H_0$  when a specific alternative hypothesis  $H_1$  is true. The idea is that, given a practically significant effect size and a significance level, we want the hypothesis test to be able detect the effect (by rejecting the null hypothesis) at a high enough probability, which can be controlled by increasing the sample size.

$\alpha$ is the probability of making a type 1 error i.e. False Positive (rejecting a true null hypothesis). $\beta$ is the probability of making a type 2 error (failing to reject a false null hypothesis). **Statistical Power** is equal to $1-\beta$

When the CI captures the null hypothesis $H_0$, the test is statistically insignificant (recall that you create a CI around the point estimate $\hat{p}$ or $\hat{d} = \hat{p}_1 - \hat{p}_2$). When the CI is outside of both $H_0$ **and** the practical significance level $d_{\text{min}}$, then we can agree to launch the change. For cases inbetween where the CI is too wide or does not capture $H_0$ but does capture $d_{\text{min}}$, we have to use our best judgement. More on these decisions and their subsequent recommendations later.

<br>
<br>

___

<br>
<br>

# The Ethics

Participants in any kind of experimental test need to be adequately protected

**Example**

Facebook experiment to gauge the effect of altering user's news feeds on emotions. In particular for this study, there is no discussion by the experimenters on the benefits of the study being conducted.

Four main principles govern the ethics of experiments with regard to the safety of their participants:

1. Risk - What risks are the participants being exposed to?
2. Benefit - What benefits might be the outcome of the study?
3. Choice - What other choices do participants have?
4. Privacy - What expectation and confidentiality do participants have?

**Risk**

Does the risk exceed that of "minimal risk". **Minimal risk** is defined as the probability (and magnitude) of harm that a participant would be exposed to in normal daily life. In most online A/B testing, the risk (of the test) does not exceed minimal risk although there are grey areas such as in the Facebook example.

**Benefits**

How might the results of the experiment help? It is important to be able to state what the benefit would be from completing the study. In most online A/B testing, the befits are around improving the product.

**Alternatives**

Do participants of the test really have a choice in whether to participate or not (and how does that effect the risks and the benefits)?. For example, in medical clinical trials testing our new drugs for terminal illnesses, the alternative for most participants is death. Thus, the risk allowable for participants, given **informed consent**, is quite high.

For online experiments, if users do not want to participate in the testing, we must consider how this may inconvenience the users (for example costs, time, information etc. required to switch services).

**Data Sensitivity**

How sensitive is the data? What is the re-identification risk of individuals from the data? As the sensitivity and the risk increases, then the level of data protection must increase: confidentiality, access control, security, monitoring & auditing, etc. Sensitive data includes bank information, health information etc. whilst the re-identification risk of the data is determined by whether it is considered to be identified, pseudonymous, anonymous or anonymized.

**Identified data** means that data is stored and collected with personally identifiable information. This can be names, IDs such as a social security number or driver’s license ID, phone numbers, etc. HIPAA is a common standard, and that standard has 18 identifiers (see the Safe Harbor method) that it considers personally identifiable. Device id, such as a smartphone’s device id, are considered personally identifiable in many instances.

**Anonymous data** means that data is stored and collected without any personally identifiable information. This data can be considered **pseudonymous** if it is stored with a randomly generated id such as a cookie that gets assigned on some event, such as the first time that a user goes to an app or website and does not have such an id stored.

In most cases, anonymous data still has time-stamps -- which is one of the HIPAA 18 identifiers. Why? Well, we need to distinguish between anonymous data and anonymized data. **Anonymized data** is identified or anonymous data that has been looked at and guaranteed in some way that the re-identification risk is low to non-existent, i.e. given the data, it would be hard to impossible for someone to be able to figure out which individual this data refers to. Often times, this guarantee is done statistically, and looks at how many individuals would fall into every possible bucket (i.e., combination of values).

What this means is that anonymous data may still have high re-identification risk. **Aggregated data** is usually not sensitive

For online A/B testing, questions that must be considered include:

- Are users being informed about the data being gathered via a ToS or Privacy policy?
- What user identifiers are tied to the data being gathered? are there any identified data being gathered?
- What type of data is being collected? Any health or financial data?
- What level of confidentiality and security is the data subject to? Is the access of data being logged and audited.

**Informed consent**

Participants are told about the risks that they may face if they take part in the study, what benefits might result, what other options they have, what data is being gathered and how that data is being handled. Typically informed consent is handled by giving participants a document detailing all of the aforementioned information and participants can then choose whether they want to participate or not.



<br>
<br>

___

<br>
<br>

# Metrics

There are two main uses for metrics in A/B testing:

- Invariant checking - The metrics that shouldn't change across your treatment and control. 
- Evaluation - To check whether the treatment group is "performing better" than your evaluation group

Two types of Evaluation metrics

- High level metrics
- Well defined metric 



### High level metrics

These usually relate to the business objective. They are not directly used to perform the A/B test but they help to decide on which metrics will eventually be used to do so.

In a customer driven business, we can use a **customer funnel** to brainstorm the possible high level metrics that will be important.

<img src="">

Each level in the customer funnel is a high level metric that can be used to answer questions such as "what business objective are you tracking?" 

You can categorize every metric into three different types:
- Count
- Rate
- Probability

You want the metric that you eventually use in an A/B test to be a KPI for your business, but these high level metrics still need to be transformed into formal definitions.

### Difficult metrics

Some metrics will be difficult to use in an A/B test. This comes down to two specific reasons

- **Time**: For some metrics, the data will just take too long to collect. It is also difficult to ensure the smooth operation of a long-term ongoing online experiment  
- **Availability**: For other metrics, the data may not be readily available to use. The business may not have access to the data

### Other techniques for coming up with metrics

There are other techniques you can use to help you get an understanding of your users, which you can use to come up with ideas for metrics, validate your existing metrics, or even brainstorm ideas of what you might want to test in
your experiments in the first place. See [here](https://s3-us-west-2.amazonaws.com/gae-supplemental-media/additional-techniquespdf/additional_techniques.pdf)

<img src="images/gathering_additional_data.png" width=600>

### Well defined metrics

To convert a high level metric into a well defined metric, you must consider two mains things:

- What data are we actually going to look at to compute the metric (cookies? page views? clicks? are we going to filter the data?)
- Given the events (e.g. clicks) how will we summarize the metric? (mean? median? etc)

Essentially you must now consider the logistics of collecting and summarizing the metric

Example:

$\text{CTP} = \frac{\text{Unique visitors who click}}{\text{Unique vistors to the page}}$

This is not a well defined metric yet

- how do we determine when two events are from the same user? 
    - Let's use a cookie
- What time period do we use to count events? hour? day? week?


A few well defined metrics that act like CTP are:

1 - $\text{Cookie Probability} = \frac{\text{Number of cookies that click}}{\text{Total number of cookies}}$ for every hour (or any other time interval)

2 - $\text{Pageview probability} = \frac{\text{Pageview with click}}{\text{Total number of page views}}$ for every hour (or any other time interval)

3 - $\text{CTR (with time period)} = \frac{\text{Total number of clicks}}{\text{Total number of page views}}$ for every hour (or any other time interval)

### Segmenting and filtering data

Often times you want to filter out any **unexpected** traffic or participants from your experiments. An example would be spam IP addresses that don't represent a typical user (and thus you do not care how a change to your website affects these participants and thus do not want these participants to skew your metric). 

Filtering out changes that are targeted towards a specific subset of users is also important because you want to avoid **diluting your results**. Results are diluted when a change in your metric can no longer be detected due a **reduction in power** of the A/B test where you have included participants in your experiment whom you know will not be affected. $N$ is then much larger than it should be, and legitimate changes to the behaviour of the participants (that you suspect should be affected) may be considered as **sampling variation** in the hypothesis test.

Filtering is used to **de-bias** the data whilst avoiding **introducing** bias to the data. Bias may be introduced if you choose to use a filter but it is removing participants from certain subsets disproportionately. The experiment is no longer **randomized**.

Use **slicing** to check if your filter is introducing bias

### Summary metrics

For count metrics in particular and other well defined metrics (e.g. load time of a video) you have many different ways to summarize the metric e.g. mean, median, 50%ile, 75%ile, 90%ile etc. 

To choose between these metrics, you should consider two main factors:

- **Sensitivity** - You want your metric to be sensitive enough to changes in the treatment
- **Robustness** - You want your metric to be robust to ongoing changes in your system that aren't the treatment but you have no control over

To choose how to summarize the metric, you can perform a **retrospective analysis** and produce a **histogram** of the metric. For skewed distributions, the mean is not necessarily the best measure of central tendency and a median or %ile may be more appropriate.

>Let’s talk about some common distributions that come up when you look at real user data. For example, let’s measure the rate at which users click on a result on our search page, analogously, we could measure the average staytime on the results page before traveling to a result. In this case, you’d probably see what we call a **Poisson distribution**, or that the stay times would be exponentially distributed. Another common distribution of user data is a “power-law,” Zipfian or **Pareto distribution**. That basically means that the probability of a more extreme value, z, decreases like 1/z (or 1/z^exponent). This distribution also comes up in other rare events such as the frequency of words in a text (the most common word is really really common compared to the next word on the list). These types of heavy-tailed distributions are common in internet data. Finally, you may have data that is a composition of different distributions - latency often has this characteristic because users on fast internet connection form one group and users on dial-up or cell phone networks form another. Even on mobile phones you may have differences between carriers, or newer cell phones vs. older text-based displays. This forms what is called a **mixture distribution** that can be hard to detect or characterize well. The key here is not to necessarily come up with a distribution to match if the answer isn’t clear - that can be helpful - but to choose summary statistics that make the most sense for what you do have. If you have a distribution that is lopsided with a very long tail, choosing the mean probably doesn’t work for you very well - and in the case of something like the Pareto, the mean may be infinite!

There are 4 main types of summary metrics

<img src="images/categories_for_summary_metrics.png" width=600>

A review of the literature may need to be considered to select the best summary metric (e.g. say a study showed that people take 5 seconds to internalize the information of a web page, then use the number of users who spent 5 seconds or longer on a page)

### Sensitivity or Robustness

The mean is **not a robust metric** as it is sensitive to outliers, the mean of a metric is heavily influnced by values that proportional to their size. The median is robust but may not be sensitive enough.

To measure the **sensitivity** of a summary metric, you can perform a **retrospective study** and look at previous changes to the website (that match the change you're trying to test) to see if it affected the metric in ways that you expect. You can also perform separate simple experiments to check if your summary metric responds to the change in expected ways. Looking back at previous experiments performed may also be done.

To measure the **robustness** of a summary metric, you can perform an **A/A test** where you don't change anything and see if the metric picks up on any spurious changes. The metric is not robust if there are more statistically significant changes to your metric between A/A groups than expected due to sampling variation. (See the spreadsheet for example). You can also perform retrospective studies or simple experiments and check to see if your metric is robust between comparable groups.

Here is an example of a simple experiment to check for robustness where each video should be comparable in latency (each video is the same size, resolution etc). 

<img src="images/robustness.png" width=600>


Here is an example of a simple experiment to check for sensitivity where the higher the video number, the lower the resolution (i.e. we expect the latency to decrease w.r.t video number)

<img src="images/sensitivity.png" width=600>

**From the experiments, the 85th percentile might be a good metric to use based on sensitivity and robustness**

### Absolute vs Relative difference

How do you compute the difference in your metric between treatment and control? If you run multiple experiments, then using a relative difference (e.g. percentage change) means that you only need to define one practical effect size (e.g. 2% increase).

Absolute vs. relative difference

Suppose you run an experiment where you measure the number of visits to your homepage, and you measure 5000 visits in the control and 7000 in the experiment. Then the absolute difference is the result of subtracting one from the other, that is, 2000. The relative difference is the absolute difference divided by the control metric, that is, 40%.

Relative differences in probabilities

For probability metrics, people often use percentage points to refer to absolute differences and percentages to refer to relative differences. For example, if your control click-through-probability were 5%, and your experiment click-through-probability were 7%, the absolute difference would be 2 percentage points, and the relative difference would be 40 percent. However, sometimes people will refer to the absolute difference as a 2 percent change, so if someone gives you a percentage, it's important to clarify whether they mean a relative or absolute difference!

### Variability

You need to compute the variance of the metric in order to check whether a statistically (or practically) significant difference in your metric has resulted from the treatment.

For some non-standard metrics (e.g. ratios, 90%ile) it is difficult to calculate the variance analytically, and so it is better to compute the variance empirically. Here is a list of common metrics and their sample distribution/sample variance

<img src="images/metrics_and_variance.png" width=600>

For some metrics (e.g. median, Xth percentiles, ratios, etc), it is difficult to calculate the variance analytically because of complicated SE formulas and assumptions that are required. You may also not want to make any assumptions about the distribution of these kinds of metrics. In these cases, there are two simple solutions to help you proceed with performing the A/B test:

- **Sign test** - For example, run multiple experiments (e.g. 20 experiments) using the same sample size $N$. If on 15 of those experiments the metric increased, use the binomial distribution to perform a hypothesis test to verify the chance of this happening due to sampling variation (assume $p=0.5$ is the probability that there is an increase in the metric).
    - Doesn't help estimate the size of the effect (practical significance)
    - Part of a broader range of methods called non-parametric methods

- **Emperically compute SE and CI** - This can be done by again running multiple experiments using the same sample size $N$, but instead we do not administer a treatment (i.e. there is no difference between the control and treatment groups). We then compute the standard deviation of the metric and use that as an **unbiased estimate** of the Standard Error of our metric. This type of test is called an A/A test.
    - Can be done beforehand to estimate the SE of our metric
    - We can extrapolate SE to different sample sizes N, thus A/A tests can be done way before an A/B tests is ready

See the [spreadsheet](Emperical_Variance_Notes.xlsx) for an example of performing A/A tests to empirically compute Confidence Intervals and estimate the Standard Errors for CTP. 

If you don't have the time to perform multiple small experiments, you can instead perform one large experiment and estimate the CI and SE by sampling with replacement from the large experiment to simulate performing multiple smaller experiments. This technique is called **bootstrapping**

See this [spreadsheet](Empiral_Variance_Bootstrapping.xlsx) for an example of performing bootstrapping to empircally estimate CI and SE of CTP.

<br>
<br>

___

<br>
<br>

# Designing an Experiment

### Units of diversion

In A/B testing, we need to assign different subjects to our control and treatment groups. How do you decide what is a subject in the experiment? With a user visible change, ideally we want a unique person to be a subject in our experiment, however there only exists imperfect proxys for a unique person in online testing. The way we define/approximate a subject in our experiment is called the **unit of diversion** where you divert each unit into the control or treatment groups.

We may try to distinguish users by

- **User ID**
    - A single user may sometimes have multiple accounts
    - is personally identifiable
- **Cookie** 
    - If you switch browser, you get assigned a different cookie
    - If you clear your history, you get assigned a different cookie
- **Event**
    - Users will not get a consistent experience
- **Device ID**
    - Only available for mobile
    - tied to a specific device
    - is personally identifiable
- **IP address**
    - Can change

### Consistency of diversion

There are three main things to consider when selecting the unit of diversion

**Consistency and Statefulness**

The idea is to make sure that the subjects in your experiment has a consistent experience during testing i.e. the experience of each subject is the same throughout the lifetime of the experiment.

For user visible changes, it's best to use User ID or Cookies to ensure consistency of diversion

For non user visible changes e.g. latency changes, backend infrastructure changes, ranking changes etc, you do not need to worry about consistency and thus you can used an event based unit of diversion


If the metric you're measuring is some kind of learning effect i.e. if a user adapts to changes made, then you need to track the same user over a long period of time. You then need to use a **stateful** unit of diversion that is attached to a single user for which User ID and Cookie is most appropriate.

**Ethical considerations**

Cookie and User ID are personally identifiable units of diversion. To use these then, you need to consider whether your prepared to go through all of the necessary data protection steps that are required when collecting such data. This includes for example collecting informed consent from your participants.

**Variability**

When the **unit of analysis** is the same as the unit of diversion, you have a lower **empirical variability** in your metric due to sampling variation (which is desirable as it increases the power of the A/B test). This means that the analytical estimate of the variance of your metric is an overestimate. The unit of analysis is the denominator of the metric (if your metric is a ratio).

You then want to choose a **unit of diversion** that matches the denominator of your metric when you can (which usually ends up being an event based diversion).

<img src="images/unit_of_diversion_unit_of_analysis.png" width=600>

### Inter vs. Intra user experiment

A/B testing is an inter user experiment where there are different participants in each of the control and treatment groups. More information in [this paper](http://www.cs.cornell.edu/people/tj/publications/chapelle_etal_12a.pdf)

### Target population

There are various reasons why you would want to target a specfic group in the population. This is particularly important if you know in advance who will be affected by the changes made in your treatment group. Some reasons to target a specific group in your population include (but is not limited to):

- If you're testing a high-profile feature that you're unsure about whether it will be released or not, you may want to restrict the change so that only a limited amount of users experience it and as a result avoid getting press coverage etc.
- You may want to restrict the target population by language as you want to avoid going through the trouble of testing in different languages.
- You may not be sure whether your feature works on all browsers and thus you may want to restrict the target population to specific browsers
- If you're running multiple experiments at your company, you may not want to overlap participants of the study (have subjects take part in multiple experiments)
- You may not want to dilute the effect of your experiment if you know your change will only affect a specific group of your population. Increasing $N$ whilst keeping the number of participants affected the same will reduce the effect size i.e. the difference between proportions for CTP and thus reduce the power of your experiment (see filtering section)

You should to ask the engineering team if they have an idea about who the changes will target.

After the experiment, you will want to test the changes on your global population just to check if there are any unwanted effects on the traffic you were not targetting. An example of diluting the effect size is shown below:

<img src="images/diluting_variance_1.png" width=600>

<img src="images/diluting_variance_2.png" width=600>

Adding all of the unaffected traffic outside of the population that would be affected (New Zealand traffic) diluted the difference of the poportions (disproportionately to the reduction in SE) and thus the difference is no longer statistically significant. 

### Cohort

If you just divert your users by Cookie or User ID, you may have participants drop out of your groups or join your groups during the lifetime of your experiment. Cohorts are groups of users who entered the experiment at the same time and/or have a similar level of activity. You typically want to use a cohort in experiments when:

- You're looking for learning effects (whether users are adapting to a change or not)
- Examining user retention
- Want to increase user activity
- Anything requiring the user to be established i.e. has used the site for at least a specific number of hours

This is essentially a form of filtering to ensure that the participants of the study match the target population of the test. In the audacity case study, they may change the structure of a specific course and only target a cohort of participants who haven't completed the course yet.

### Sizing the experiment

Before we only considered how to size the experiment based on practical significance, statistical significance and sensitivity (e.g. ensuring the experiment has at least 60% power). From what we just learned, we now also need to take into account the unit of diversion vs. unit of analysis as well as the target population and then decide whether the size is realistic relative to how long you have to run the experiment.

To perform the calculation, you just do a power calculation for $N$, the sample size required to ensure a certain $\alpha$ (usually 0.05), $\beta$ (usually 0.2) and $d_{\text{min}}$ whilst estimating SE empirically (or analytically for simple metrics).

To reduce the size of the experiment required whilst maintaining the same $\alpha$, $\beta$ and $d_{\text{min}}$, you could try:
- Changing the unit of diversion to match the unit of analysis
- Target experiment to specific traffic (filtering or using cohort)

### Sizing triggering

Even after deciding on a well-defined metric, you often don't know beforehand whether the population that is supposed to be exposed to the changes in your treatment are truly exposed to the changes or not. A good idea is to run a "pilot" where you turn on the changes and observe who actually are affected and see if it matches your intutions.

### Duration

Depending on the sizing calculations relative to the traffic on your website, you will need to decide on a suitable **duration** and **exposure proportion**.

For example, if your website gets $100000$ users per day, and the sizing calculations resulted in a required sample size of $N=1000000$, if you expose your change to your website's entire traffic, then you will need to run your experiment for 10 days.

However, exposing the change to the entire traffic isn't a good idea because of reasons mentioned before in the "Target Population" section such as avoiding press coverage, overlapping experiments, filtering by cohorts etc.

Reducing the exposure proportion will then increase the required duration of your experiment.

Another reason is the fact that you are trying to account for confounding factors by **randomizing across your unit of diversion**. However, other **temporal factors** might be affecting your subjects/metric which you have to also account for. For example, if you run your online experiment over a holiday, the results of the experiment (if statistically and practically significant) may not apply all year round. 

By having a longer running experiment, or by running your experiment on specific days of the week/year, you can ensure that the results of your experiment are **generalizable**.

<img src="images/limit_exposure.png" width=600>

### Learning effects

When you want to measure learning effects, you're trying to quantify whether users are adapting to the changes (in your treatment group) or not. Measuring this type of effect is difficult due to the two reasons mentioned before:

**Change aversion** - Users refuse to participate in the test
**Novelty effect** - Too drastic of a change leads customers to "try out everything". Can't test out a specific treatment

The key idea is that for learning effects, you need to consider the duration of the experiment that is required for the user behaviour to plateau or converge.

These are the things you need to keep in mind when measuring learning effects (some of which has been mentioned already)

- You need a **stateful** unit of diversion
- You need to consider the dosage (how often the user experiences the change) and thus you should select a **cohort** based on dosage
- Learning effects are usually high risk and thus you'll probably want to test on a smaller proportion of users for a longer duration of time (to meet the sample size $N$ quota)

**pre-period vs. post-period experiments** is a method you can use to ensure that user experience and populations are comparable between your control and treatment groups. The idea is that you run A/A tests **before and after** the actual A/B test on your treatment and control groups to ensure that there are no inherent differences between the two groups. This ensures that after your treatment group is exposed to the change, the differences between the two groups can be attributed to the treatment and not any other confounding factors.

<br>
<br>

___

<br>
<br>

# Analyzing Results

### Sanity checks

Before analyzing and interpreting the results of an A/B tests, you should do a few sanity checks to make sure your experiment was run properly. Specifically, you need to check that the **invariant metrics**, which are quantitative measures that you expect to be the same between the treatment and control groups, are actually the same. Invariant metrics come in two types:

- **Population sizing** - Experiment and control groups should be comparable in size and types of users in each group
- **Other invariants** - Other aspects of the experiment that isn't part of the treatment needs to be comparable between the two groups

The total number of units of diversion  (e.g. total number of signed in Users) in each group is always a population sizing invariant as it defines the size of your treatment and control groups.

Other invariants are those that measure an effect that subjects in both groups experience equally (e.g. CTR of a button that is the same between treatment and control)

The metric you're testing (or quantities used to compute the metric) are not and should not be invariant as you expect them to change as a result of the treatment.

<img src = "images/checking_invariants.png" width=600>

To actually check if the invariant metrics are the same between the two groups, you perform the appropriate hypothesis test (e.g. **one sample proportion inference** for testing the total number of units of diversion in each group) or compute a confidence interval and see if it captures the null hypothesis.

If your sanity checks fail, do not proceed with the conclusions of your A/B test. Here are a couple of solutions and/or next steps:

- Check with the engineers to see if there is any problem with the infrastructure which is resulting in different population sizing
- Perform a retrospective analysis that tries to recreate the experiment diversion (sample from a previous data capture that emulates the diversion used in the experiment) to see if there is something endemic about the experiment conditions i.e. perhaps the invariant metric is not invariant as expected
- Perform pre-period and post-period A/A tests to check if there's something wrong with the treatment conditions (change in an invariant during the A/B test but not during pre-period) or if there's something that's systematically wrong with the setup or data capture (a change in an invariant in pre-period and post-period).

Some common problems that causes invariants to change

- **Data capture** - Capturing a new experience or something that happens rarely. Effects aren't being measured correctly or in the same way between your treatment and control groups
- **Setup** - Having a filter (e.g. English traffic only) but the filter works differently between your treatment and control groups
- **Infrastructure** - Something systematically wrong, for example resetting cookies for one group but not the other.

Recall that invariants are what should be kept the same in both experiment groups, if the change is something that in reality shouldn't be an invariant and was previously not considered then it must now be accounted for in a new experimental setup (e.g by targetting a new specific population or cohort). 

Learning effects should appear to slowly deviate from invariance and so can be ruled out if there are big immediate changes to invariants between treatment and control groups.

### Single metrics

To evaluate the results of an A/B test on a single metric, you follow the standard statistical procedure of computing a confidence interval or performing a hypothesis test.

Recall that for a **difference of two proportions** test, the following relationship holds:

$$
\begin{align}
SE & \sim \sqrt{\frac{1}{N_1} + \frac{1}{N_2}} \\
\Rightarrow \frac{SE}{\sqrt{\frac{1}{N_1} + \frac{1}{N_2}}} & = \frac{\hat{SE}}{\sqrt{\frac{1}{\hat{N_1}} + \frac{1}{\hat{N_2}}}}
\end{align}
$$

Where $N_1$, $N_2$ are the total units of diversion in the control and experiment groups respectively, and the $\hat{\text{hat}}$ symbols are their respective quantities for a different sample size.

This relationship states that, because of the proportionality between SE and the sizes of the samples used in the test, we can extrapolate the empirically computed SE of a previous A/A test to compute the SE of our metric in the actual A/B test.

To see an example of analyzing the results of a single metric A/B test, see [this spreadsheet](Single_Metric_Example.xlsx). Use [this website](https://www.graphpad.com/quickcalcs/binomial1.cfm) to quickly perform a sign test. 

**Interpreting the results**

You need to decide if you've observed a statistically significant change in your experiment metric. We also want to estimate the magnitude and direction of the change via a confidence interval. Once you have this information, you want to make a decision on whether you recommend that your business launches the change. 

Things to keep in mind when interpreting the results:

- If there is no statistically significant change in the parametric test then try filtering or segmenting your target population e.g. by platform, days of the week, etc, to check for bugs or come up with a new hypothesis for who the affected population might be.
- Cross check your results with the sign test to see if it agrees with the parametric test. If they don't agree, which of the experiments are not as expected? is there a trend in conditions for these particular experiments?
    
If you find your feature performing differently for a specific filter of your target population, then a phenomenon called [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) may be occurring in which a trend appears in several different groups of data (e.g. increase in metric) but disappears or reverses when these groups are combined (e.g. decrease in metric).

<img src="images/Simpsons_paradox.png" width=600>

It usually occurs when population sizing invariants are violated.

### Multiple Metrics

When performing multiple hypothesis tests, it becomes increasingly likely that you observe statistical significance purely by chance (due to sampling variation). As a result the significance level $\alpha$ needs to be adjusted when analyzing the p-values of multiple hypothesis tests. See [this article](https://en.wikipedia.org/wiki/Multiple_comparisons_problem) for more details on multiple comparisons.

<img src="images/multiple_metric_tests.png" width=600>

The **Family-Wise Error Rate (FWER)** is the **probability of making one-or-more false discoveries** (type 1 errors i.e. rejecting a true null hypothesis) and replaces $\alpha$ when analyzing multiple p-values. Essentially the threshold shrinks such that $\text{FWER} < \alpha$ to account for multiple comparisons. You only declare a test as statistically significant when its p-value is below the FWER. One such method that controls the FWER is the **Bonferroni correction** which ensures that the FWER is at most equal to $\alpha$. See [this article](https://en.wikipedia.org/wiki/Family-wise_error_rate) for more details on FEWR.

The Bonferroni correction assumes that all of the p-values for the metrics being tested are **independent**. This results in too conservative of an estimate for the FEWR i.e. it's too low and reduces the power of your A/B tests too drastically. This likely occurs when you are monitoring more than one metrics that are likely to move in unison i.e. they're correlated.

Another class of methods modifies $\alpha$ to control the proportion of discoveries (rejected null hypotheses) that are false (type 1 errors). This proportion is the **expected proportion of false positives among all significant tests** and is called the **False Discovery Rate (FDR)**. Like FEWR you declare all hypothesis tests under the FDR to be statistically significant. One such method that controls the FDR is the **Benjamini–Hochberg procedure** which ensures that the FDR is at most equal to $\alpha$. See [this article](https://en.wikipedia.org/wiki/False_discovery_rate) for more details on FDR.

FDR-controlling procedures are a good solution when a huge number of A/B tests have been conducted. They have greater power, at the cost of increased numbers of Type I errors.

Links to more advanced methods: [Closed Testing](https://en.wikipedia.org/wiki/Closed_testing_procedure) [Holm-Bonferroni](https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method) [Boole's Inequality](https://en.wikipedia.org/wiki/Boole%27s_inequality)

### Interpreting multiple metrics

What you're hoping for is that **related metrics** are going to move in the same direction, e.g. CTR and CTP should hopefully both increase. **Composite metrics** that are computed using similar quantities should also hopefully move in the same direction and analysing the formula for the composite metrics will hopefully tell you why they have moved in a particular direction. 

Multiple metrics however can be unruly. It's usually a better idea to eventually come up with a single value **Overall Evaluation Criteria OEC** as a KPI and use that as a measure of whether a desirable change has been made to the business.

How do you come up with one?. It's usually a business decision where you have to decide how much you weight the improvement of particular metrics over the other. This is especially important when optimizing one metric compromises the other and thus you must come up with some OEC combination of multiple metrics that tells you, using a single number, that you're not improving one metric too much at the expense of another. See for example how [F1 score combines and weights two rate metrics (Precision and Recall)](https://en.wikipedia.org/wiki/F1_score#Definition) depending on which one the business finds more important.

Having an OEC as a weighted sum of other metrics doesn't necessarily have to be used to make a launch decision. It can be instead used to decide which metrics the business finds important (and reweighting can be done appropriately).  

### Conclusions - Making a recommendation

A summary of the possible scenarios, recommendations and conclusions that can be made after performing A/B tests:

- **Interpreting the p-values of multiple metrics**
    - Use sophisticated methods to control FEWR and FDR and correlation
    - Discuss results with decision makers and launch non-risky changes or run further experiments
- **If you have a statistically statistical change in some metrics, but for others you don't**
    - Try to understand why this is happening. Should these changes actually be moving in unison? Is it OK for small changes in multiple metrics not to to move in unison? Is it OK for big changes not to move in unison? 
    - For example, improving the design of a page can counter-intuitively reduce the reading time of a page and increase the CTR/CTP of buttons. 
- **If you have a positive impact for one slice (e.g. English traffic) and no impact or a negative impact for another slice (e.g. Korean Traffic) (Simpson's paradox)**
    - Again, try to undersand why this is happening. Initial intuitions of how different slices should react may be incorrect. Can you replicate the effect in an observational study? Do you have a bug?
    - For example, boldfacing words in English provides emphasis but this is not the case for Chinese, Japanese, Korean since it makes words harder to read.
    - Is there something wrong with the set-up of the experiment? are the different slices actually being exposed to the change in the same way?  
- **If your parametric tests (e.g. difference of two proportions) and non-parametric tests (e.g. sign test) disagrees with eachother**
    - which experiments in the sign tests didn't agree with the overall parametric test? can you modify the metric or target population?
    - For example sign test tells you that there is no impact throughout the week but there is a postive impact on weekends. Parametric test tells you that there is a positive impact throughout the week 
- **Invariant sanity checks fail**
    - Do not proceed with interpreting p-values or confidence intervals
    - Perform pre-period and post-period A/A tests to check for problems in the setup, data capture or treatment (is the treatment being exposed properly to thw two groups?)
    - Perform retrospective analysis to see if you can recreate the problem
    - Consult engineers to check if there is problem with the infrastructure for population invariants
- **When do you actually launch a change?**
    - Have statistical and practically significant changes
    - Do I understand why the change occurred?
    - Is it worth it? i.e. how much does it cost to launch or maintain the change?
    - Have you only ran 1 experiment? Should you run more than 1? How important is the change?
- **Slicing reveals that 30% of users experience a positive impact whilst 70% experience no impact. Or 70% is improved whilst 30% is negatively impacted**
    - Wait and fine tune the change?
    - Launch it as is?

### Adapting an A/B test over time

If you decide to launch the changes of an A/B test, it is often a good idea to instead **slowly ramp up** the experiment i.e. increase the traffic diverted to the experiment group until it is eventually fully launched, for example start with 1% of traffic and increase this, if tests are passed, until 100% of traffic is diverted to the experiment group.

You can slowly **remove filters** such that the target population is less and less segmented and becomes the entire traffic of the website. For example, test changes on English traffic, then English and French, then eventually all languages.

However increasing the % of traffic diverted to the experiment group or removing filters will naturally dilute the effect size. Confounding factors start taking effect e.g. more student traffic, temporal factors such as holidays, etc.

You can use a **hold-out** group which never see the change and continue comparing the effect size against that group to see if the dilution is also present in the hold-out to verify that the effect is still impactful and significant over time for the entire traffic.

Novelty effect and Change Aversion can come into play and also start diluting effect size once you ramp up and remove filters. This is when cohort analysis is useful once again (targetting participants that see the change at the same time). Pre-Period and Post-Period are also useful to monitor how users behaviours change over time.

<br>
<br>

___

<br>
<br>

# Final project

### Experiment Overview: Free Trial Screener

Experiment Overview: Free Trial Screener
At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.


In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. [This screenshot](https://drive.google.com/file/d/0ByAfiG8HpNUMakVrS0s4cGN2TjQ/view) shows what the experiment looks like.


The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.


The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

### Experiment Design

**Metric Choice**

The following metrics will be used as **invariant metrics:**
- **Number of cookies** - The number of unique cookies to view the course overview page
    - This is the unit of diversion for participants who have not enrolled yet. Unit of diversion metrics are always invariant
- **Number of clicks** - The number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is triggered)
    - This metric measures an effect that is experienced by users in both groups equally. The "Start free trial" button will not change in the experiment.
- **CTP** - $\frac{\text{Number of unique cookies to click the "Start free trial" button}}{\text{Number of unique cookies to view the course overview page}}$
    - Another metric that measures an effect that is the same for both groups. Since the "Start free trial" button is the same for both groups, the CTP should also remain the same between both groups.

To decide on the evaluation metrics to use, we must keep in mind what the hypothesis of the experiment is.

The following metrics will be used as **evaluation metrics:**
- **Gross Conversion** - $\frac{\text{Number of User-IDs to complete checkout and enroll in the free trial}}{\text{Number of unique cookies to click the "Start free trial" button}}$
    - This measures an effect that is expected to be different between the experiment and control groups so it will not be invariant. 
    - The experiment group will see the screener page and the control group will not, thus the number of User-IDs who enroll should be different between the two groups.
    - According to the hypothesis this metric should decrease. By setting clearer expectations for students upfront, only students who are committed to completing the course should enroll in the free trial (numerator decreases, denominator stays the same).
- **Retention** - $\frac{\text{Number of User-IDs to remain enrolled past the 14-day boundary (and thus make at least one payment)}}{\text{Number of User-IDs to complete checkout and enroll}}$
    - This measures an effect that is expected to be different between the experiment and control groups so it will not be invariant.
    - The experiment group will see the screener page and the control group will not, thus the number of User-IDs who remain enrolled as well as the Number of User-IDs to complete checkout should be different between the two groups.
    - According to the hypothesis this metric should increase. If only students who are committed to completing the course complete checkout then the idea is, a larger proportion of these students should remain enrolled (numerator increases, denominator decreases).
- **Net conversion** - $\frac{\text{Number of User-IDs to remain enrolled past the 14-day boundary (and thus make at least one payment)}}{\text{Number of unique cookies to click the "Start free trial" button}}$
    - This measures an effect that is expected to be different between the experiment and control groups so it will not be invariant.
    - The experiment group will see the screener page and the control group will not, thus the number of User-IDs who remain enrolled as well as the number of unique cookies to click the "Start free trial" button should be different between the two groups.
    - According to the hypothesis this metric should increase. If students have their expectations set early then the number of User-IDs to remain enrolled should increase (numerator increase, denominator stay the same)
    - In reality, it is **unclear** whether this metric will increase as the fraction is out of all cookies who click the "Start free trial" button. The number of students enrolling could decrease due to only committed students enrolling, which would decrease the overall number of students who remain enrolled (the numerator) but still increase the **Retention** rate 

**Calculating standard deviation**

Previous analysis shows the following baseline values for the following metrics:

- Unique cookies to view course overview page per day - 40000
- Unique cookies to click "Start free trial" per day - 3200
- Enrollments per day - 660
- CTP on "Start free trial" button - 0.08
- Probability of enrolling, given click - 0.20625
- Probability of remaining enrolled, given enroll - 0.53
- Probability of payment, given click - 0.1093125

Standard Error of a proportion:
$$SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Figure out how many units of analysis will correspond to 5000 cookies for each metric. 

$\text{Number of unique cookies to click the "Start free trial" button} = 5000 \times 0.08 = 400$

$\text{Number of User-IDs to complete checkout and enroll} = 400 \times 0.20625 = 82.5$

Substitute for $N$ and $p$ (the metric's corresponding probability) in the SE formula above. 

In [30]:
# Gross conversion SE
total_num_UOA = 5000 * 0.08
print("{:.4f}".format(np.sqrt((0.20625*(1-0.20625))/total_num_UOA)))

0.0202


In [31]:
# Retention SE
total_num_UOA = 5000 * 0.08 * 0.20625
print("{:.4f}".format(np.sqrt((0.53*(1-0.53))/total_num_UOA)))

0.0549


In [33]:
# Net conversion SE
total_num_UOA = 5000 * 0.08
print("{:.4f}".format(np.sqrt((0.1093125*(1-0.1093125))/total_num_UOA)))

0.0156


For Gross Conversion, the unit of diversion will be a cookie which matches the unit of analysis. Therefore the analytic variance will be an overestimate compared to the empirical variance for Gross Conversion.

For Retention and Net Conversion, the unit of diversion wil be a User-ID which matches the unit of analysis for Retention but not for Net Conversion. Therefore the analytical variance will be an overestimate compared to the empirical variance for Retention, but it will be comparable to the empirical variance for Net conversion.

If there is time, it may be worth doing an empirical estimate of the variance for Gross Conversion and Retention

### Sizing

Since there are 3 evaluation metrics, it will be best to use the Bonferroni correction to analyse the p-values of the hypothesis tests.

$$
\begin{align}
\alpha & = 0.05 \\
\Rightarrow \alpha^{*} & = \frac{\alpha}{m} = \frac{0.05}{3} \\
\alpha^{*} & = 0.0167
\end{align}
$$

The difference of two proportions $d$ will have standard error
$$SE_{d} = \sqrt{\frac{p_{pooled}(1-p_{pooled})}{n_1} + \frac{p_{pooled}{(1-p_{pooled}})}{n_2}}$$

Since this sizing calculation is performed before the A/B test is run, we will use the the probabilities acquired from the baseline values as a substitute for $p_{pooled}$ to estimate $SE_{d}$. We will use this to calculate the most conservative estimate for the number of enrollees required **per group** to ensure $\alpha^{*} = 0.0167$, $\beta = 0.2$ for the practical significance size $d_{min}$ of the metric that requires the most enrollees.

In [220]:
import numpy as np
from scipy.stats import norm

def get_z_star(alpha):
    return -norm.ppf(alpha/2)

def get_beta(z_star, s, d_min, N):
    SE = s / np.sqrt(N)  # The SE of d varies with N which subsequently affects the power 
    return norm.cdf(z_star*SE, loc=d_min, scale=SE)

def required_size(s, d_min, N_max=100000, alpha=0.05, beta=0.2):
    """
    s is the SE of the metric with N=1 in each group 
    """
    Ns = list(range(1,N_max))
    for n in Ns:
        if get_beta(get_z_star(alpha), s, d_min, n) <= beta:
            return n
    return -1

num_enrollees = required_size(s=np.sqrt(0.53*(1-0.53)*2), d_min=0.01, alpha=0.05)
num_pageviews = (num_enrollees/(0.08 * 0.20625))*2
print("Number of pageviews required: {}".format(num_pageviews))

Number of pageviews required: 4739878.787878788


In [229]:
required_size(s=np.sqrt(0.1093125*(1-0.1093125)*2), d_min=0.0075, alpha=0.05)

27172

In [234]:
(27413/(0.08))*2

685325.0

In [235]:
685325/20000

34.26625