# AB Testing Notes

A general methodology used to test out a new product or feature

- Take two sets of users
- One set is shown an existing product
- Second set is given a treated version
- How do the customers respond differently? determine which one is better based on some metric

Can you use AB tests for everything?

"AB testing is good for optimizing an existing product but not good for developing a new product based on an existing one"

Amazon did AB tests for personalized recommendations and found that they had an increase in revenue when given the personalized recommendations.

Google tested 41 different shades of blue

LinkedIn tested a ranking process where they checked whether it's better to show news articles or an encouragement to add more contacts on a users "stream". (Use click-through rate as metric?)

Amazon determined that ever 100ms increase in page load time decreased sales by 1%

Need a consistent response from your control and experiment groups

What can't you do with AB tests?

- **Test out new experiences**
    - Change aversion - Users refuse to participate in the test
    - Novelty effect - Too drastic of a change leads customers to "try out everything". Can't test out a specific treatment
- No baseline for comparison
    - Can't set up a control group if there is no baseline. 
- How much time do you need to have your users adapt to the new experience?
    - Need the plateaued experience to make a "robust decision". The metric being observed will be noisy in the beginning and only when it has stabilised can you check for any statistically significant changes to your metric.
- Long term effects are difficult to test
    - Difficult to measure changes in your metric over a long time period where other aspects of your product or users will change (can't attribute the change in your metric to the treatment)
- Can't test whether your missing something in your product
    - No baseline, can;t set up a control and treatment group because what do you change about the treatment group?

<br>
<br>

**Other techniques**

- Logs of what users did on your website. Analyse them retrospectively or observationally to see if a hypothesis can be developed about what caused changes in their behaviour. This can then be used to design an experiment. 
- User experience research, focus groups, surveys, human evaluation
- A/B testing gives quantitative data, other techniques give qualitative data.
- A completetly new product is difficult test

<br>
<br>

In online A/B tests, you don't know much about your users. You're using online user data and so it's difficult to distinguish whether a user is a single person, internet cafe etc.

The goal is to determine whether a new feature is desirable. To do this, you need to design an experiment that can be **repeatable**.

<br>
<br>

# Online Case Study

Audacity

Creates online finance courses

**User flow/Customer funnel**

- Homepage visits
- Explore the site
- Create an account
- Completion

Listed in decreasing number of users.

**The hypothesis:**

Changing the "Start Now" button from orange to pink will **increase** how many students explore Audacity's courses

**Possible metrics**

- ~~**Total number of courses completed**~~
    - Will take too much time. Students make take months to complete a course
- ~~**How many users click on the "Start Now" button**~~
    - Assumes that users who progress through the top of the customer funnel will eventually lead to more users being passed through the rest of the customer funnel
    - In unequal control/treatment groups, the number of users in the group will affect the total number of clicks
- **CTR: $\frac{\text{Number of clicks}}{\text{Number of page views}}$**
    - Called Click-Through-Rate
    - Single users can click more than once and inflate the CTR
- **CTP: $\frac{\text{Unique visitors who click}}{\text{Unique vistors to the page}}$**
    - Called Click-Through-Probability
    - The better metric to use in this case.
    
**Updated metric**

Changing the "Start Now" button from orange to pink will **increase** the Click-Through-Probability of the button

<br>
<br>

**When do you use CTR vs. CTP?**

Generally:

- Use a rate when you want to measure **usability**
    - Users have a number of different places they can press, you use a rate to measure how often the users clicked a specific button
    - Will have to change the website to log every page view of the website and every click of a button
- Use a probability when you want to measure **total impact**
    - You don't want to count when users double clicked, reloaded etc when measuring a total effect (e.g. getting to the second level of a page)
    - Will have to change the website to match each page view with all of their "child clicks" to count at most one click per page view

<br>
<br>

# The statistics

**Which distribution?**

When producing the CTP, the sample proportion $p_0$ was computed to be 0.1 $(\frac{100}{1000})$. When using a different sample to compute the CTP, the sample proportion was instead computed to be 0.15.

Is 0.15 or 15% considered to be surprising? How do you know how variable your estimate is likely to be?

We compare the sample proportion computed to the **binomial** distribution where we model each click as a bernoulli trial. Each unique visitor either clicks the button (success) or doesn't (failure).

**Variance**

We can use the standard error formula for a sample proportion $SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$ to estimate how variable the sample proportion $p_0$ should be as a result of sampling variability. Then we compare the the second computed sample proportion (0.15) to the first relative to the variance to see if it is a surprising value or not. 

Either compute a CI or perform a hypothesis test

**Practical significance**

We have to decide how big of a change is practically significant (aka substantive) to warrant changing the existing system. Statistical significance of any arbitrary difference in proportion can be achieved with a big enough sample size, however a small difference may not be practically significant. 

- Each change may require an investment in resources and so a small change may not warrant the investment
- Online A/B tests have a smaller margin for practical significance
- We need to make sure for online A/B tests that the change is **repeatable**. - We want a big enough sample size to have it so that the statistical significance bar is lower than the practical significance bar to ensure repeatability

We will decide that a 2% change in CTP is practically significant

**Size vs. Power tradeoff**

The power of a hypothesis test is the probability that the test rejects the null hypothesis  $H_0$  when a specific alternative hypothesis  $H_1$  is true. The idea is that, given a practically significant effect size and a significance level, we want the hypothesis test to be able detect the effect (by rejecting the null hypothesis) at a high enough probability, which can be controlled by increasing the sample size.

$\beta$ is the probability of making a type 2 error (failing to reject the null hypothesis when it is false). Statistical power is equal to $1-\beta$

When the CI captures the null hypothesis $H_0$, the test is statistically insignificant (recall that you create a CI around the point estimate $\hat{p}$ or $\hat{d} = \hat{p}_1 - \hat{p}_2$). When the CI is outside of both $H_0$ **and** the practical significance level $d_{\text{min}}$, then we can agree to launch the change. For cases inbetween where the CI is too wide or does not capture $H_0$ but does capture $d_{\text{min}}$, we have to use our best judgement.