# A/B testing

* A user experience research methodology
* Used to establish which of two treatments/products/procedures/features is most effective


**Key terms**

| Term | Definition |
| --- | --- |
| Subject | The item that is exposed to a treatment (e.g. patient, website visitor etc.) |
| Treatment | Something to which a subject is exposed (e.g. drug, website feature, price etc.) |
| Treatment group | A group of subjects exposed to a specific treatment |
| Control group | A group of subjects NOT exposed to the treatment |
| Randomization | The process of randomly assigning subjects to treatment/control groups |
| Test statistic | The metric used to measure the effect of the treatment |
| Null hypothesis | A baseline assumption any differences in effect between groups is due to chance |
| Alternative hypothesis | Counterpoint to the null. There is a difference between the groups  |
| One-way test | A hypothesis test that counts chance results only in one direction |
| Two-way test | A hypothesis test that counts chacne results in two directions |
| Permutation test | A method of estimating the sampling distribution given a null model. It utilises random resampling *without* replacement |
| Bootstrap permutation test | A variant of the above, which utilises random resampling with replacement |
| Exhaustive permutation test | A variant of teh above. Instead of random resampling, all possible divisions of the data are tested. Sometimes called an exact test |
| p-value | The probability of obtaining results at least as extreme as the observed results given a chance model that embodies the null hypothesis |
| Alpha | The probability threshold that chance results must surpass for actual outcomes to be deemed statistically significant. It represents the acceptable probability of rejecting a true null hypothesis. |
| Type 1 error | False positive. Mistakenly concluding an effect is real. |
| Type 2 error | False negative. Mistaken concluding an effect is due to chance. |


**Key ideas**

* Subjects are assigned *randomly* to two (or more) groups.
* Groups are treated exactly the same BUT the treatment under study differs between groups
* Randomization means differences between groups are due to either:
  * The effect of the different treatments
  * Random chance (i.e. by chance naturally better performing subjects are allocated to one group)
* A test statistic is measured for each group to assess the effect of treatment

  
**Experimental design**

1) Define a null & alternate hypothesis
2) Choose a single metric, called the *test statistic*, to assess results (e.g. mean spend, % conversion)
3) Choose a significance "alpha" level and between a *one-way or two-way* hypothesis test
4) Perform the hypothesis test


**Hypothesis tests**

* Used to assess whether random chance is a reasonable explanation for an observed difference between groups A and B
  
*How it works:*

1) The null hypothesis is assumed to be true
2) The sampling distribution of the test statistic is computed using a permutation test or standard reference distribution. This indicates the random variability in the test statistic
4) A p-value is calculated for the test statistic using the sampling distribution
5) A significance test is conducted. The p-value is compared to a significance threshold. If it is below the threshold, the null hypothesis is rejected (and vice versa).
   * For a single-tailed test, compare the p-value to the *alpha* significance level
   * For a two-tailed test, compare the p-value to half the *alpha* significance level 

*Definitions:*

* The p-value represents the probability, given a chance model that embodies the null hypothesis, of a result *at least* as extreme as the observed effect. It therefore indicates whether the observed effect is a reasonable outcome of a "null" model.
* *Alpha* is the probability threshold that chance results must surpass for actual outcomes to be deemed statistically significant. It represents the acceptable probability of rejecting a true null hypothesis.

*p-value controversy:*

* A p-value does NOT give the probability the alternate hypothesis is true.
* A p-value gives you the probability the observed result could be produced by a null model.

**Permutation tests (AKA randomization or random permutation tests)**

* Used to estimate the sampling distribution when the null hypothesis is true
* A p-value can then calculated by computing the % of times a difference equal to or greater than the observed difference occurs
* A non-parametric method 
  
*Advantage(s):*

1) No assumptions are made about the shape of the distribution (e.g. normal etc.)

*How it works:*

1) Combine A & B groups and shuffle subjects
2) Use sampling without replacement to recreate each of the groups (with their original size)
3) Recalculate the test statistic (e.g. mean values for A and B) and the difference in magnitude between the groups (e.g. mean B - mean A)
4) Repeat the above steps multiple times (e.g. 1000 for 0.001 precision in p-value) to yield a distribution for the *difference* in the test statistic (i.e. effect size)
5) Use this sampling distribution to estimate a p-value for the observed difference between A & B
6) Conduct a significance test by comparing the estimated p-value to the chosen *alpha* level

Note: Exhaustive and bootstrap permutation tests are variants of the above process. See key terms.


**Standard reference distributions**
* An alternative method to estimating the sampling distribution of a test statistic and calculating a p-value
* A parametric method

*Advantage(s):* 
* It doesn't require resampling so is computationally efficient
  
*Disadvantage(s):*
* You must make an assumption about the shape of the sampling distribution

*How it works:*

1) Choose an appropriate distribution to model the sampling distribution when the null hypothesis is true. Examples include:
* z-distribution (AKA standard normal distribution)
* t-distribution (AKA student's t distribution)
* chi-squared distribution

2) Compute a standardized version of the test statistic. This means it is on the same scale as the standard reference distribution.
* z-score (AKA z-statistic)
* t-score (AKA t-statistic)
* chi-squared statistic
  
3) Calculate the p-value that corresponds to the standardised statistic
4) Conduct a significance test. Compare the p-value to a pre-selected *alpha* significance level. If it is below the significance level, the null hypothesis is rejected (and vice versa).


**Choosing the significance level**

1) *Choose a one-way or two-way hypothesis test*

<img src="figures/one-tailed vs two-tailed test.png" align="center" width="400" />


* The null and alternate hypothesis determine the type of hypothesis test
* One-tailed tests are used when the alternative hypothesis is directional (e.g. A is better than B). Compare the p-value to the *alpha* significance level
* Two-tailed tests are used when the alternative hypothesis is not directional (e.g. A is different to B). Compare the p-value to half the *alpha* significance level 
Note: Often two-tailed tests are used with directional hypotheses as they are more conservative and protect you from being fooled by chance in both directions

2) *Consider the trade-off between α and the probability of error types*

<img src="figures/statistical power.png" align="center" width="300" />

Let sampling distribution 1 be for group A and sampling distribution 2 be hypothetical for group B. Lowering *alpha* decreases the probability of a type I error (false positive) but increases the likelihood of a type II error (false negative).

* α = Probability of a type I error (i.e. false positive) when the null hypothesis is true
* β = Probability of a type II error (i.e. false negative) when the alternate hypothesis is true
* 1 - α = Probability of a true negative when the null hypothesis is true
* 1 - β = Probability of a true positive when the alternate hypothesis is true (called statistical power)


**Estimating statistical power & sample size**
* 1 - β
* The probability of a true positive when the alternative hypothesis is true

*Estimating power:*

1) Specify an expected/desired effect size, *alpha* significance level (usually 0.05) and sample size
2) Use prior knowledge (e.g. literature or pilot study) to create a population distribution for group A

   Note: If the test statistic is a %, a box of 0s and 1s works. If it's a mean, use a probability distribution (e.g. normal with an estimate of the mean and s.d)
4) Create a population distribution for group B by adding the effect size
5) Draw a bootstrap sample from both distributions. Calculate the test statistic for A and B. Determine the observed effect size.
6) Use a permutation test or standard reference distribution to estimate the sampling distribution were the null hypothesis true and calculate the p-value (see above)
7) Conduct a significance test using the p-value and *alpha* significance level.
8) Repeat steps 2-7 many times. The proportion of times the difference is significant is the power!

*Estimating sample size:*

<img src="figures/effect_size.png" align="center" width="300" />


1) Specify an expected/desired effect size, desired statistical power (usually 0.8) and *alpha* significance level (usually 0.05)
2) Repeat the above steps to calculate statistical power as a function of sample size
3) Select the sample size which provides the desired statistical power

# Test statistics
* Z-test
* T-test
* Chi-squared statistic
* hypothesis testing for population proportions

Page 56-57 of data science interview text book

# Multiple testing
* ANOVA (one-way/two-way)
* Chi-square test
* Fishers exact test
* Multi-arm bandit algorithm


Page 112 onwards of stats text book
Page 248 of data science interview text book

## Mathematics

* A/B testing involves statistical hypothesis testing

**1. Hypothesis testing**

    i. Choose a null hypothesis, alternative hypothesis & significance level
    ii. Choose a test statistic & calculate a p-value
    iii. Compare the p-value to a significance level

**2. Test statistics**


*Example:* A company selling a service wants to test which of two web presentations does a better selling job. Due to the high value of the service being sold, sales are infrequent and the sales cycle is lengthy. So the company decides to measure the results using a proxy variable, how long people spend on the page. This is chosen based on the strength of its association with the true variable (total sales).



**3. Population proportions**


**4. P-Values & confidence intervals**

