In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../../notebook_format')
from formats import load_style
load_style()

In [2]:
os.chdir(path)
import numpy as np
import scipy.stats as stats

# A/B Test Caveats

## Avoid Biased Stopping Times

When you run an A/B test, you should avoid stopping the experiment as soon as the results "look" significant. Using a stopping time that is dependent upon the results of the experiment can inflate your false-positive rate substantially.

To understand why this is so, let's look at a simpler experimental problem. Let's say that we have a coin in front of us, and we want to know whether it's biased -- whether it lands heads-up with probability other than 50%. If we flip the coin n times and it lands heads-up on k of them, then we know that the posterior distribution for the coin's bias is $p \sim Beta(k+1,n−k+1)$. So if we do this and 0.5 isn't within a 95% credible interval for $p$, then we would conclude that the coin is biased with p-value <=0.05. This is all fine as long as the number of flips we perform, n, doesn't depend on the results of the previous flips. If we do *that*, then we bias the experiment to favor extremal outcomes.

Let's clarify this simulating these two experimental procedures in code.

- **Unbiased Procedure:** We flip the coin 1000 times. Let k be the number of times that the coin landed heads-up. After all 1000 flips, we look at the $p \sim Beta(k+1,1000−k+1)$ distribution. If 0.5 lies outside a the 95% credible interval for $p$, then we conclude that $p \neq 0.5$; if 0.5 does lie within the 95% credible interval, then we're not sure -- we don't reject the idea that $p = 0.5$.
- **Biased Procedure:** We start flipping the coin. For each n with $1 < n \leq 1000$, let $k_n$ be the number of times the coin lands heads-up after the first n flips. After each flip, we look at the distribution $p \sim Beta(k_n+1,n−k_n+1)$. If 0.5 lies outside a the 95% credible interval for $p$, then we immediately halt the experiment and conclude that $p \neq 0.5$; if 0.5 does lie within the 95% credible interval, we continue flipping. If we make it to 1000 flips, we stop completely and follow the unbiased procedure.

How many false positives do you think that the Biased Procedure will produce? We chose our p-value to be 0.05, so that the false positive rate would be about 5%. Let's repeat each procedure 1000 times, assuming that the coin really is fair, and see what the false positive rates really are:

In [3]:
def unbiased_procedure(n):
    """
    Parameters
    ----------
    n : int
        number of experiments (1000 coin flips) to run
    """  
    false_positives = 0
    
    for _ in range(n):
        # success[-1] : total number of heads after the 1000 flips
        success  = ( np.random.random(size = 1000) > 0.5 ).cumsum()
        beta_cdf = stats.beta( success[-1] + 1, 1000 - success[-1] + 1 ).cdf(0.5)

        if beta_cdf >= 0.975 or beta_cdf <= 0.025:
            false_positives += 1

    return false_positives / n

In [4]:
def biased_procedure(n):
    
    false_positives = 0
    
    for _ in range(n):
        success = ( np.random.random(size = 1000) > 0.5 ).cumsum()
        trials  = np.arange( 1, 1001 )
        history = stats.beta( success + 1, trials - success + 1 ).cdf(0.5)

        if ( history >= 0.975 ).any() or ( history <= 0.025 ).any():
            false_positives += 1

    return false_positives / n

In [5]:
# simulating 10k experiments under each procedure
print( unbiased_procedure(10000) )
print( biased_procedure(10000) )

0.0567
0.4931


Almost half of the experiments under the biased procedure produced a false positive. Conversely, only around 5% of the unbiased procedure experiments resulted in a false positive, which is close to the 5% false positive rate the we expected, given our p-value.

---

The easiest way to avoid this problem is to **choose a stopping time that's independent of the test results**. You could, for example, decide in advance to run the test for exactly two weeks, no matter the results you observe during the test's tenure. Or you could decide to run the test until each bucket has received more than 10,000 visitors, again ignoring the test results until that condition is met.

e.g. If 99% of your visitors convert after 1 week, then you should do the following.

- Run your test for two weeks.
- Include in the test only users who show up in the first week. If a user shows up on day 13, you have not given them enough time to convert.
- At the end of the test, if a user who showed up on day 2 converts more than 7 days after he first arrived, he must be counted as a non-conversion.

When you can afford to wait, setting a results-independent stopping time is the easiest and most reliable way to avoid biased stopping times.

## Do Follow Up Tests and Watch your Overall Success Rate

If you're running a lot of A/B tests, you should run follow-up tests and pay attention to your base success rate.

Let's talk about these in reverse order. Imagine that you do everything right. You use the Beta distribution. You implement a hierarchical model. You set your stopping time in advance, and keep it independent from the test results. You set a relatively high success criterion: A probability of at least 95% that the variant is better than the control (formally, $p \leq 0.05$). You do all of that. You run 100 tests, each with all the rigor just described. In the end, of those 100 tests, 5 of them claims that the variant will beat the control.

How many of those variants do you think are really better than the control, though? If you run 20 tests in a row in which the "best" variant is worse or statistically indistinguishable from the control, then you should be suspicious when your 21st test comes out positive. If a button-color test failed to elicit a winner six months ago, but did produce one today, you should be skeptical. Why now but not then?

Here's an intuitve way of thinking about this problem. Let’s say you have a class of students who
each take a 100-item true/false test on a certain subject. Suppose each student chooses randomly on all
questions. Each student would achieve a random score between 0 and 100, with an average of 50. 

Now take only the top scoring 10% of the class, and declaring them "winners", give them a second test, on
which they again choose randomly. They will score less on the second test than the first test. That’s because, no matter what they scored on the first test they will still average 50 correct answers in the second test. 

The same thing happens in A/B testing. If your original winning test was a false positive then any further testing will of course show a reduction in the uplift caused by the variant. 

Due to this situation, follow-up testing is also recommended. You tried out three variants, B, C, and D against the control A. Variant C won. Don't deploy it fully; drive 95% of your traffic to Variant C and 5% to Variant A (or some modification on this; the percent split is not important as long as you will have reasonable statistical power within an acceptable time period). It's a small cost in terms of conversion, 0.05 * (C's rate−A's rate) for the duration of the test, and will give you more information about C's true performance relative to A.

Given the situation above, it's better to keep a record of previous tests, when they were run, the variants that were tried, and the posterior distributions for their true conversion rates, since these historical record gives you an idea of what's reasonable. Despite the fact that this information is not directly informative of the rates you should expect from future tests (The absolute numbers are extremely time dependent, so the raw numbers that you get today will be completely different than the ones you would have gotten six months later), it gives you an idea of what's plausible in terms of the relative performance of each test.

## Reference

- [MOST winning A/B test results are illusory](http://www.qubit.com/sites/default/files/pdf/mostwinningabtestresultsareillusory_0.pdf)
- [Using data science with A/B tests: Bayesian analysis](https://econsultancy.com/blog/65755-using-data-science-with-a-b-tests-bayesian-analysis/)
- [Don't use bandit algorithms, they probably won't work for you](https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html)