# Worksheet 2: A/B Testing and principled peeking

## Review from worksheet 1

Some basic concepts to recall:

### Sampling Distribution

- The *sampling distribution* is the distribution of a statistic (e.g., sample mean, sample proportion, t-statistic, z-score).
    - The sampling distribution is *different from* the sample distribution
    - The sampling distribution is *different from* the population distribution
        
- We need a sampling distribution to make probabilistic statements about our statistic.
    - For example: if the population mean is actually 0 (we usually want to test this, you don't know it), what is the probability that the sample mean would be greater than 1?
    
- The problem is that the sampling distribution is usually unknown, mainly because the population distribution is unknown.
    
- You may be able to derive mathematically the sampling distribution if you know the population distribution (rarely in practice).
    - For example, if your sample comes from Normal distribution, then the sample mean is Normal as well 

- In certain cases, you can use results of the CLT if your sample size is large and additional assumptions are met.
    - For exmple, for a sample of independent and identically distributed random variables, if the sample size is large, the sampling distribution of the mean is approximately Normal
    
- You can use bootstrapping (although conditions exist as well) to approximate the sampling distribution.

### Errors in Hypothesis Tests

There are 2 types of errors in a hypothesis testing problem: 

- **Type I error**: rejecting $H_0$ when $H_0$ is true

- **Type II error**: failing to reject $H_0$ when $H_0$ is false

The probability of the type I error is usually called **significance level** (aka $\alpha$) and it is set by the analyst when designing a test.

Another important measure used to design a test is the **power**:

- **Power**: the probability of rejecting $H_0$ when $H_0$ is false (i.e., power = $1 - P(\text{type II error})$)

### $p$-value

The $p$-value can be used to assess the significance of the observed results by comparing its value to the specified significance level:
   - Is $p < \alpha$?? 

But what is a $p$-value?? It's been greatly missused for sure!!

- **$p$-value**: the probability, under the model specified in $H_0$, that a statistic would be at least as extreme as its observed value 

Note that the $p$-value is **NOT**:

- the probability that $H_0$ is true 
- the probability that $H_0$ is false
- the probability that the statistic observed was produced by random chance alone
- a measure of the importance of the observed effect

## Learning Objectives 

After completing this week's worksheet and tutorial work, you will be able to:

1. Discuss why the methods learned in past courses are not sufficient to answer the more complex research problems being posed in this course (in particular stopping an A/B test early).
2. Explain sequential testing and principled peeking and how it can be used for early stopping of an experiment (e.g., A/B testing).
3. Write a computer script to perform A/B testing optimization with and without using principled peeking.
4. Discuss the tradeoff between stopping earlier and certainty of significance, and thereal world implications (e.g., what does the FDA require for early stopping of clinical trials versus Facebook ads optimization?).
5. List other questions related to A/B testing optimization that may be relevant in a real data application (e.g., what features cause a Facebook ad to perform best?)

## Loading packages

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(infer)
library(broom)
install.packages("gsDesign")
library(gsDesign)

source("tests_worksheet_02.R")

# Part I (Tuesday)

## 1. A/B Testing Optimization

### 1.1 A/B testing

**A/B testing** refers to an experiment, in which users are randomly assigned to one of two variations of a product or service: control (A) and variation (B) to see if variation B should be used for improvement.

- A/B testing became very popular in the context of updating and improving websites. However, they can be used in many other contexts to monitor and update products and/or services.

### Case study: Obama's 60 million dollar experiment

In 2008, Obama's campaign was looking to increase the total amount of donations to the campaign. In December of 2007, they run an experiment to compare how different versions of the website can yield different responses of their visitors.

In the original experiment they compared 24 different combinations of buttons and media. Visitors were randomly assigned to one of these variations and they track different response variables (i.e, conversion rate, amount donated, etc.)

**The original website**

![img](https://www.optimizely.com/contentassets/30f67b662917481e867d7c6b602d05d6/obama_homepage_original-445x313.png?width=785&mode=crop)

**The winner website**

![img](https://www.optimizely.com/contentassets/30f67b662917481e867d7c6b602d05d6/obama_winner-373x335.png?width=785&mode=crop)

> "This experiment taught us that every visitor to our website was an opportunity and that taking advantage of that opportunity through **website optimization and A/B testing** could help us raise tens of millions of dollars." [Dan Siroker](https://www.optimizely.com/insights/blog/how-obama-raised-60-million-by-running-a-simple-experiment/)



### 1.2 Experimental Design

In any statistical analysis there are some important steps to define:

- post the *question(s)* you want to answer using data

- *design* the experiment to address your question(s)

    - an **important quantity** defined in the design of the experiment is the **sample size**, which ideally is calculated based on a power analysis

    - identify appropriate methodologies to analyze the data. For example, you may prefer a test statistic that can control Type I error with high power (many of the statistics you've learned have this property!)

- run the experiment to collect data 

- analyze the data according to the experimental design and make decisions

    - for example, if the $p$-value of the test is smaller than the specified significance level, reject the null hypothesis

**Note**: This process is usually non-linear and may take multiple iterations among some of these steps, nicely described as "epicycles" by R. Peng and E. Matsui in ["The Art of Data Science"](https://bookdown.org/rdpeng/artofdatascience/images/epicycle.png).

**Question 1.2.0**
<br>{points: 1}

An important question that motivated Obama's website experiment was if a variation of the current website could yield higher donations from its visitors. 

A key factor of the experimental design is deciding which statistical tool will be used to analyzed the data collected. Given that they do not have previous information about the population distribution, they decided to use a classical $t$-statistic to compare differences of means.

Another important planned quantity is the significance level of the test, a.k.a **Type I error rate**, which is:


**A.** the probability of finding a significant difference in donation sizes between the two variations, if the new website indeed attracts, on average, larger donations.

**B.** the probability of *not* finding a significant difference in donation sizes between the two variations, if the new website indeed attracts, on average, larger donations.

**C.** the probability of finding a significant difference in donation sizes between the two variations, if the mean of the size of the donations of both websites are equal.

**D.** the probability that the new website indeed attracts larger donations.

*Assign your answer to an object called `answer1.2.0`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer1.2.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2.0()

**Question 1.2.1**
<br>{points: 1}

The designers of the experiment also need to decide how large the experiment will be since there are large costs (including opportunity costs) related to the experiment. 

Thus, for the statistical test planned, they decide to conduct a **power analysis** to:

**A.** estimate the minimum sample size required, given a desired significance level, expected difference in mean donations, and statistical power.

**B.** maximize the probability of finding a significant difference in donation sizes between the two variations, if the new website indeed attracts larger donations.

**C.** minimize the probability of not finding a significant difference in donation sizes between the two variations, if the new website indeed attracts larger donations.


*Assign your answer to an object called `answer1.2.1`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer1.2.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2.1()

**Question 1.2.2**
<br>{points: 1}

After deciding on the sample size required and *randomly* assigning visitors to each variation of the website, the company will start analyzing the size of the donations made by visitors. 

Since the sample size planned is large enough, the analysts will conduct a classical hypothesis test and compute $p$-values and confidence intervals based on results from the CLT.

Considering the opportunity costs involved in this experiment, the analysts are going to monitor the size of the donations closely and stop the experiment earlier if they find (using a standard 2-samples $t$-test) that the new website attracts higher donations. 

**However, computing (raw) $p$-values before collecting *all* the full can seriously bias the results of the experiment. True or False??**

**Note**: "raw" here means the $p$-values calculated with a $t$ sampling distribution based on results derived from the CLT. 

*Assign your answer to an object called answer1.2.2. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer1.2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.2.2()

## 2. Early Stopping in A/B Testing

> ***In classical hypothesis testing theory, the sample size must be fixed in advance when the experiment is designed!!***

### 2.1 Dashboards to monitor A/B testing

In the last decade, many A/B testing platforms have been developed to assist companies to analyze, report and visualize the results of their experiments. Some notable examples include:

   - Optimizely
   
   - Google Analytics
   
   - Crazy Egg
   
These platforms allow the users to *continuously monitor* the $p$-values and confidence intervals in order to re-adjust their experiment dynamically. But,

> ***Is it ok to peek at results before all the data are collected??***

![img](https://miro.medium.com/max/1400/1*W8mUB5A96ufsbMWLqScVOA.png)
<font color=grey>Figure by D. Meisner in Towards Data Science </font>


### 2.2 Early stopping

**Early stopping** refers to ending the experiment earlier than originally designed.

> ***Can we stop or re-design the experiment earlier if we have supporting evidence to do so??***

- a company would like to stop an experiment that results is losses in revenue

- a medical treatment may want to switch as early as possible to a more effective drug


#### Let's use data to answer to this question!

### 2.3 A/A testing

To examine the problem of early stopping, let's simulate data for which $H_0$ is true (i.e., there is no effect)
  - we can think of a scenario where both groups are equal (aka A/A testing)
  
  - in this scenario, we know that claiming a significant result is a false discovery
  
**Note**: although this seems artificial, it is a widely used technique to test experiments and platforms

To compute error rates, we will generate 100 of such experiments:

![img](img/aa-Obama.png)
<font color=grey>Figure by [R. Lourenzutti](https://lourenzutti.github.io/tutorials/ab-testing/ab-test.html) </font>



#### Experimental design

The campaign organizers have decided to:

- run a balanced experiment with a *pre-set* sample size of 1000 visitors per variation (total sample size of 2000) 

- **sequentially collect** the data in batches of 50 visitors per group

- **sequentially analyze** the data using two-sample $t$-tests

- **sequentially compute and monitor** (raw) $p$-values 

- **stop** the experiment **once a significant result** is found 

Run this experiment 100 times

#### Simulation function

We have prepared a function for you that: 

- generates two samples, each of size `n`, from two (known) Normal distribution, the control and the variation
  - note that this can only be done in a simulation study. In a real data analysis we collect data from an *unknown* distribution

- analyzes the data in an incremental way by `sample_increase_step` until all `n` samples in each treatment group are analyzed 
  - for example: compares donations by batches of visitors of each website variation until data from all planned visitors are collected 

- returns the $t$-statistic and $p$-value (computed by a two-sample $t$-test) for every set of collected data
  - for example: $p$-values to assess the difference in means of the size of donations made by visitors of each website as batches of data are collected

For example, if `sample_increase_step` is 20, and `n=500`, the function will:
1. draw samples of 500 experimental units from the control distribution and 500 from the variation distribution;
1. subset the first 20 experimental units from each sample;
2. perform the two-sample $t$-test and return the associated $t$-statistic and $p$-value;
3. add 20 more experimental units to each group 
4. perform the two-sample $t$-test (now based on 40 experimental units per group) and return the associated $t$-statistic and $p$-value  
5. add another 20 experimental units to each group 
6. perform the two-sample $t$-test (now based on 60 experimental units per group) and return the associated $t$-statistic and $p$-value 
$$
\vdots\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
$$
and so on, until the total sample size in each group is 500 (as originally planned).

The function returns a tibble that has two columns:

- `inc_sample_size`: the sample size of the set of data analyzed 
- `statistic`: $t$-statistic calculated by the `t.test()` function
- `p_value`: $p$-value calculated by the `t.test()` function

In [None]:
# Two-sample t-test with tracking sequential statistic and p-values by incremental sample sizes until getting to n.

# @param n (numeric): Initially planned sample size for each group (for simplicity, n needs to be a multiple of sample_increase_step).
# @param d_0 (numeric): effect size.
# @param mean_current (numeric): Population mean for control variation.
# @param sd_current (numeric): Population standard deviation for current variation.
# @param sd_new (numeric): Population standard deviation for new variation.
# @param sample_increase_step (numeric): Sample size increment.

# @return p.value.df: A tibble that has 3 columns:
# inc_sample_size, statistic, and p_value 

incremental_t_test <- function(n, d_0, mean_current, sd_current, sd_new, sample_increase_step) {
  sample_current <- rnorm(n, mean = mean_current, sd = sd_current)
  sample_new <- rnorm(n, mean = mean_current + d_0, sd = sd_new)

  p.value.df <- tibble(
    inc_sample_size = rep(0, n / sample_increase_step),
    statistic = rep(0, n / sample_increase_step),
    p_value = rep(0, n / sample_increase_step)
  )

  current_sample_size <- sample_increase_step
  
  for (i in 1:nrow(p.value.df))
  {
    t_test_results <- t.test(sample_new[1:current_sample_size], sample_current[1:current_sample_size],
      var.equal = TRUE,
      alternative = "greater"                      
    )
    p.value.df[i, "statistic"] <- as_tibble(t_test_results$statistic)
    p.value.df[i, "p_value"] <- as_tibble(t_test_results$p.value)
    p.value.df[i, "inc_sample_size"] <- current_sample_size
    current_sample_size <- current_sample_size + sample_increase_step
  }

  return(p.value.df)
}

**Question 2.3.0**
<br>{points: 1}

In a simulation study, we know the true population distributions! furthermore, in an A/A testing, we know that there is no true difference between the population means.

The function used to simulate the data assumes:

**A.** the sample distributions are $\mathcal{N}(0,1)$

**B.** the population distribution of the donation sizes of visitors of the current website is $\mathcal{N}(\mu_0,\sigma_0^2)$, where $\mu_0$ = mean_current and $\sigma_0$ = sd_current

**C.** the sample distribution of the donation sizes of visitors of the current website is $\mathcal{N}(\mu_0,\sigma_0^2)$, where $\mu_0$ = mean_current and $\sigma_0$ = sd_current

*Assign your answer to an object called `answer2.3.0`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer2.3.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3.0()

**Question 2.3.1**
<br>{points: 1}

In Obama's campaign example, we simulate data that reflects no difference in the expected size of the donations (since all visitors are exposed to the same website) 

Then, suppose that the compaign organizers want to analyze the data in batches of 50 visitors per group until a total of $n = 1000$ visitors have watched each website.

Use the `incremental_t_test` function to conduct the company's experiment. 

*Save the result in an object called `answer2.3.1`. Your answer should be a tibble with two columns: `inc_sample_size`, and `p_value`.*

In [None]:
set.seed(301) # do not change this.

#answer2.3.1 <- 
#    incremental_t_test(n = ..., d_0 = ..., sample_increase_step = ..., mean_current = 200, sd_current = 50, sd_new = 50)

# your code here
fail() # No Answer - remove if you provide an answer

answer2.3.1

In [None]:
test_2.3.1()

**Question 2.3.2**
<br>{points: 1}

Using the data stored in `answer2.3.1`, plot the $p$-value sequence as a **line** with the incremental sample size on the $x$-axis and $p$-value on the $y$-axis. Add a dashed horizontal red line that indicates a threshold of the significance level $\alpha = 0.05$. The `ggplot()` object's name will be `sequential_pvalue`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 9) # Adjust these numbers so the plot looks good in your desktop.

# sequential_pvalue <- 
#   answer2.3.1 %>%
#   ggplot() +
#   geom_line(aes(x = ..., y = ...)) +
#   theme(
#     text = element_text(size = 18),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   ) +
#   geom_point(aes(x = ..., y = ...)) +
#   ggtitle("Evolution of p-values in Experiment 1") +
#   ylab("p-value") +
#   xlab("Sample Size") +
#   geom_hline(
#     yintercept = ...,
#     colour = "red",
#     linetype = "twodash"
#   ) +
#   coord_cartesian(ylim = c(0, 1)) +
#   scale_y_continuous(breaks = seq(0, 1, by = 0.05))


# your code here
fail() # No Answer - remove if you provide an answer

sequential_pvalue

In [None]:
test_2.3.2()

**Question 2.3.3**
<br>{points: 1}

As mentioned in the **Experimental Design** section, the campaign organizers want to implement an early stopping (before reaching the maximum sample size of `n = 1000` visitors per website) to save time and resources allocated for the experiment.

> Using a significance level $\alpha = 0.05$, they would stop the experiment as soon as they find a significant result. 

Given the results in **Question 2.3.2**, the compaign organizers would stop the experiment 

**A.** once they finish collecting and analyzing all the data

**B.** after 100 visitors have entered each website since the $p$-value is below the specified significance level

**C.** after 150 visitors have entered each website since results are getting worse after that point

*Assign your answer to an object called `answer2.3.3`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer2.3.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3.3()

**Question 2.3.4**
<br>{points: 1}

Since the simulated data correspond to a **A/A testing** design, what error, if any, are the compaign organizers making by stopping the experiment as noted in **Question 2.3.3**?

**A.** No error.

**B.** Type I Error.

**C.** Type II Error.

*Assign your answer to an object called `answer2.3.4`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer2.3.4 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3.4()

**Question 2.3.5**
<br>{points: 1}

The hypothesis test was designed with a $5\%$ probability of falsely rejecting $H_0$ 
  - in the Obama's experiment, a false rejection means implies that the variation website attracts better donation when there's no real difference between the two population means. 
  
So we can think that the possibility of making a mistake was always in the original plan. However, the strategy of stopping earlier may increase the (overall, family wise) probability of wrongly rejecting $H_0$. 

To examine this potential problem, the campaign organizers decided to: 

- perform the **A/A testing** experiment 100 times 

- count how many times they would wrongly reject $H_0$ with their strategy, and

- compare it with the expected number of rejections given the significance level $\alpha = 0.05$

We wrote a code to perform the first step. Read it and learn from it!! Then, you need to work on the rest!

Your answer will be a tibble with two columns: `n_rejections` and `expected_n_rejections`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
set.seed(120)

### Run this before continuing
multiple_times_sequential_tests <- 
    tibble(experiment = 1:100) %>% 
    mutate(seq_test = map(.x = experiment, 
                          .f = function(x) incremental_t_test(n = 1000, d_0 = 0, sample_increase_step = 50, 
                              mean_current = 200, sd_current = 50, sd_new = 50)))


In [None]:
#answer2.3.5 <- multiple_times_sequential_tests %>% 
#    mutate(reject = map_dbl(.x = seq_test, .f = function(x) sum(x$p_value< ...) > 0)) %>% 
#    summarise(n_rejections = ...(reject),
#              expected_n_rejections = ...)

# your code here
fail() # No Answer - remove if you provide an answer
                            
answer2.3.5

In [None]:
test_2.3.5()

**Question 2.3.6**
<br>{points: 1}

Select the right option to complete the sentence below:

> *With the strategy used by the company, the probability of Type I error is approximately ... the specified one.* 

**A.** equal to

**B.** 3 times lower than

**C.** 5 times lower than

**D.** 3 times higher than

**E.** 5 times higher than

*Assign your answer to an object called `answer2.3.6`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`,  or `"E"` surrounded by quotes.*

In [None]:
#answer2.3.6 <- ""

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.3.6()

**Note**: not only the type I error rate is affected by this problem, but also the estimates themselves! If we analyze samples until the means of both groups are significantly far apart, we would also overestimate the effect size (difference between means).

#### <font color=blue> In *classical hypothesis testing*, monitoring results in a dashboard and *stopping experiments earlier* than planned will increase the probability of *incorrectly* rejecting the null hypothesis (i.e., when there is no real effect). </font>

   - the test was planned and designed for a *specific sample size*. If you conduct the analysis once *all* the data is collected, then the probability to falsely rejecting $H_0$ is the selected level of significance. 
    
   - *but*, if instead you check the results $k$ times as you collect data, you have $k$ opportunities to *falsely reject* $H_0$. The probability of a type I error is larger than the set value specified when the experiment was designed!!

![img](img/aa-Obama-pval100.png)
<font color=grey>Figure by [R. Lourenzutti](https://lourenzutti.github.io/tutorials/ab-testing/ab-test.html) </font>

**Interesting note**:
> It can be proved, mathematically, that under the null hypothesis, the classical $p$-value will *always* cross $\alpha$ if the experimenter waits long enough$^{*}$. This means that with increasing data, the probability of falsely rejecting a true $H_0$ approaches to 1!

[*] David Siegmund. 1985. Sequential analysis: tests and confidence intervals.
Springer.

# Part II (Thursday)

## Review from last class

Last class we learned that:

- One may be tempted to peek at results of A/B tests as data are being collected

- Stopping an experiment and rejecting $H_0$ as soon as the $p$-value is below the specified significance level can drastically inflate the type I error

- Controlling the risk of wrongly rejecting the null hypothesis is not an easy task in A/B testing if peeking and early stops are allowed

## Today:

> How does experimentation platforms, like Optimizely, address this challenge?? 

- Users need to adaptively determine the sample size of the experiments since there are large opportunity costs associated with longer experiments. 

- When done correctly, stopping an experiment earlier (or re-designing it) can be beneficial in many contexts. 

- Users *are* monitoring results as they collect and analyze data and are making decisions accordingly.

> How can we allow a user to stop when they wish, while still controlling the probability of falsely rejecting $H_0$ at the pre-specified level α?

*Hint*: an **appropriate measure of "enough evidence"** is required

## 3. Sequential testing 

**Sequential tests** are decision rules that allows users to test data sequentially as data come in. The experiment may be stopped earlier, meaning the sample size is dynamic, rather than fixed. 

> Sequential testing is just another flavour of a multiple comparison problem. If you make lots of comparisons, but don’t correct for it, your error rates are inflated!!


Many methods have been proposed to address the characteristics of the **A/B testing experimental designs**:  

- For example, one way to control the type I error rate inflation in multiple testing problems is to adjust the p-value (e.g., Bonferroni, BH). 

- Some new methods propose using a different tests statistics and computing $p$-values differently
    - In the *Optimizely platform*, a mixture sequential probability ratio test (mSPRT) tests are performed and *always valid* $p$-values are constructed to allow users to trade off between power and sample size dynamically while controlling the false rejection probability at the level $\alpha$. 


There are different classes of sequential approaches: 

- **Group sequential designs**: the analyst pre-specifies when to inspect the data (interim analysis) and performs each analysis as a fixed sample one. The significance level of each interim analysis is set at some level that controls the Type I error, no matter when the user chooses to stop the test.

- **Full sequential designs**: the analyst performs an analysis after every new observation, sequentially, in a principled way.

### 3.1 Bonferroni

In the next few exercises, you are going to investigate if a Bonferroni correction controls the type I error rate in A/B testing.

We will use again the **A/A experimental design** of Obama's campaign since in that scenario we know that the (true) expected difference in donation sizes is zero (i.e., effect size = 0)

**Recall**:
Bonferroni's method can be thought as: 

- an adjustment of the $p$-values by multiplying them by the number of comparisons and keeping the significance level at a desired threshold, or 

- an adjustment of the significance threshold $\alpha$ by dividing it by the number of comparisons, or

- an adjustment of the critical value, computed with a sampling distribution, corresponding to the adjusted significance threshold

**Question 3.1.0**
<br>{points: 1}

Since Obama's campaign organizers have decided to monitor the data every 50 visitors per website, they will perform 20 sequential tests. 

Suppose that after each interim analysis, they will use a Bonferroni correction to control the type I error rate at $5\%$. Thus, using a classical two-sample $t$ test, they we will **reject $H_0$** if the raw $p$-value is: 

**A.** smaller than 0.05

**B.** smaller than 0.0025 (adjusted threshold)

**C.** greater than a 0.0025 (adjusted threshold)

**D.** greater than 0.05 when multiplied by 4


*Assign your answer to an object called `answer3.1.0`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"`, surrounded by quotes.*

In [None]:
# answer3.1.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1.0()

**Question 3.1.1**
<br>{points: 1}

Continuing with the problem stated in **Question 3.1.0**, the campaign organizers can also **reject $H_0$** if the observed $t$-statistic is:

**A.** greater than `qt(1 - 0.05,998) = 1.65` 

**B.** greater than `qt(1 - 0.025,998) = 1.96`

**C.** greater than `qt(1 - 0.0025,998) = 2.81` 

**D.** greater than 0.05


*Assign your answer to an object called `answer3.1.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"`, surrounded by quotes.*

In [None]:
# answer3.1.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.1.1()

**Question 3.1.2**
<br>{points: 1}

In **Question 2.3.1** you performed 20 interim analyses of simulated data for an A/A design. Modify the code of **Question 2.3.5** to implement a Bonferroni correction as specified in **Question 3.1.0** in 100 experiments.

Then compare the estimated type I error rate when a Bonferroni correction is used with the expected type I error rate value.

*Assign your answer to an object called `answer3.1.2`. Your answer should be a tibble with two columns: `n_rejections_Bonf` and `expected_n_rejections`.*

In [None]:
#answer3.1.2 <- ...

# your code here
fail() # No Answer - remove if you provide an answer
                            
answer3.1.2 

In [None]:
test_3.1.2()

#### <font color=blue> Using a Bonferroni correction, the data can be sequentially analyzed and the experiment can be stopped earlier while controlling the type I error rate. </font>
- For the simulated experiments in the example above, the type I error rate was 2%, which is now below the planned 5% value. 

As with other multiple comparison problems, the Bonferroni's correction in sequential analysis is very conservative and can affect the power of the test!!

### 3.2 Pocock boundaries

As we recalled in **Question 3.1.1**, the Bonferroni correction can be implemented by adjusting the critical value to `qt(1 - 0.0025, 998) = 2.81`. 

However, Bonferroni's correction was originally designed for independent tests, which, by design is not true in this problem. 

> Sequential tests are nested so they are *not* independent. Thus, Bonferroni correction is not commonly used in A/B testing experiments

Other methods have been proposed tailored for A/B testing. 

In this section we will examine the **Pocock method** to compute alternative critical values to evaluate interim analyses in sequential A/B testing.

Similarly to Bonferroni's method, the **Pocock method** computes a *common* critical value for all interim analyses. However, the Pocock's boundary is not an adjustment of the quantile of a $t$-distribution.

We can easily get the critical values for this design using `gsDesign::gsDesign()`.

**Note 1**: `gsDesign()` outputs a full sequential design, not just the critical values to control a desired type I error!! Particularly important is the computation of the required sample size to achieve the designed power!! You can read more about this package [here](https://keaven.github.io/gsDesign/reference/gsDesign.html)

**Note 2**: a caveat about this package is that two-sample tests are based on $z$-statistics, i.e., a case for which we assume that samples are drawn from Normal distributions with known SD. While this is usually an unrealistic assumption and in practice we use a $t$-test to compare means of two populations, results are nearly equivalent to a $z$-test. More can be read [here](https://keaven.github.io/gsDesign/articles/nNormal.html)  

In the following exercises we will examine if the critical values of the Pocock design can be used to control the type I error rate. 

Let's start computing a Pocock design. Save the output in the object `design_pocock`. Extract the Pocock's critical values for each interim analyses and save them in an object called `crit_pocock`.

In [None]:
# Run this cell to get a Pocock design!

design_pocock <- gsDesign(k = 20, #number of interim analysis planned
                          test.type = 1, # for one-sided tests
                          delta = 0, # default effect size
                          alpha = 0.05, #type II error rate
                          beta = 0.2, # type II error rate
                          sfu = 'Pocock')
                          
crit_pocock <- design_pocock$upper$bound

**Question 3.2.0**
<br>{points: 1}

As we know, when performing a hypothesis test, we can either compare the $p$-value to a pre-specified significance level $\alpha$ *or* we can compare the observered statistic to a critical value. 

Based on previous results, the Pocock method is more conservative than the Bonferroni correction. **True or False??**

*Assign your answer to an object called answer3.2.0. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
#answer3.2.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2.0()

**Question 3.2.1**
<br>{points: 1}

Using the data stored in `answer2.3.1`, plot the sequence of observed statistics for each interim analysis as a **line** with the incremental sample size on the $x$-axis and the value of the observed statistic on the $y$-axis. 

Add 3 dashed horizontal lines that indicate the following 3 boundaries (critical values): 

- a red line for the Pocock's critical values

- a blue line for the Bonferroni's critical values

- a black line for the unadjusted critical values

The `ggplot()` object's name will be `sequential_stat`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 9) # Adjust these numbers so the plot looks good in your desktop.

#crit_unadj <- qt(1 - ..., ...)
#crit_bonferroni <- ...(1 - ..., ...)

#sequential_stat <- 
#  answer2.3.1 %>%
#  ggplot() +
#  geom_line(aes(x = inc_sample_size, y = statistic)) +
#  geom_point(aes(x = ..., y = ...)) +
#  geom_hline(yintercept = ..., colour = "red", linetype = "twodash") +
#  geom_point(aes(x = inc_sample_size, y = ...), colour = "red") +
#  geom_text(x=850, y=crit_pocock + 0.15, size=6, label="Pocock",colour = "red") +
#  geom_hline(yintercept = ..., colour = "blue", linetype = "twodash") +
#  geom_point(aes(x = inc_sample_size, y = rep(crit_Bonferroni, 20)), colour = "blue") +
#  geom_text(x=850, y=crit_bonferroni + 0.15, size=6, label="Bonferroni",colour = "blue") +
#  geom_hline(yintercept = ..., linetype = "twodash") +
#  geom_point(aes(x = inc_sample_size, y = rep(..., 20))) +
#  geom_text(x=850, y=crit_unadj + 0.15, size=6, label="Unadjusted") +
#  theme(
#    text = element_text(size = 18),
#    plot.title = element_text(face = "bold"),
#    axis.title = element_text(face = "bold")
#  ) +
#  ggtitle("Critical values in Sequential Designs") +
#  ylab("Statistic") +
#  xlab("Sample Size") +
#  coord_cartesian(ylim = c(-1, 3)) +
#  scale_y_continuous(breaks = seq(-1, 3, by = 0.5))

# your code here
fail() # No Answer - remove if you provide an answer

sequential_stat

In [None]:
test_3.2.1()

**Question 3.2.2**
<br>{points: 1}

The compaign organizers have decided to monitor the data every 50 visitors per website and stop the experiment earlier if there's evidence of a difference between the group means. According to the data plotted **Question 3.2.1**, which of the following statement is correct?? 

**A.** The compaign organizers would never stop the experiment, regardless of the boundary used

**B.** The compaign organizers would erroneously stop the experiment after the analysis of the second test, regardless of the boundary used

**C.** The compaign organizers would erroneously stop the experiment after the analysis of the second test, only if they correct the critical values using a Bonferroni's method to control the type I error rate

**D.** The compaign organizers would erroneously stop the experiment after the analysis of the second test if they use undadjusted $t$ critical values 

*Assign your answer to an object called `answer3.2.2`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer3.2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.2.2()

**Question 3.2.3**
<br>{points: 1}

In **Question 2.3.1** you performed 20 interim analyses of simulated data for an A/A design. Modify the code of **Question 2.3.5** to implement a sequential analyses using Pocock's boundary to control the type I error in 100 experiments.

Then compare the estimated type I error rate when the Pocock's method is used with the expected type I error rate value.

*Assign your answer to an object called `answer3.2.3`. Your answer should be a tibble with two columns: `n_rejections_Pocock` and `expected_n_rejections`.*

In [None]:
#answer3.2.3 <- ...

# your code here
fail() # No Answer - remove if you provide an answer
                            
answer3.2.3 

In [None]:
test_3.2.3()

#### <font color=blue> Using the Pocock's method, the data can be sequentially analyzed and the experiment can be stopped earlier while controlling the type I error rate. </font>
- For the simulated experiments in the example above, the type I error rate was 7%, which close to the planned 5% value

- As expected, this method is less conservative than the Bonferroni's correction

In Tutorial 2, you will implement another sequential test method available in `gsDesign` package, called the **O’Brien-Fleming method**, which has conservative critical values for earlier interim analysis and less conservative values (closer to the unadjusted critical values) as more data are collected. In other words, bounds are not uniform. 

There are many other methods to implement principled peeking strategies in A/B testing. 
- A very popular and flexible method, implemented by [Optimizely](https://www.optimizely.com), computes a mixture sequential probability ratio test (mSPRT) tests and *always valid* $p$-values. The metholology and implementation are beyond the scope of this course but here's a nice [video](https://www.youtube.com/watch?v=AJX4W3MwKzU) that explains its key points without too many technical details.

## 4. Summary and key concepts learned

1. A/B testing refers to an experiment, in which users are randomly assigned to one of two variations of a product or service: control (A) and variation (B) to see if variation B should be used for improvement.


2. The statistic used to test a hypothesis, the sample size calculation, the type I error rate specification and the desired power are all important and interconnected pieces of the experimental design! 


3. In classical hypothesis testing theory, the sample size must be fixed in advance when the experiment is designed!!


4. Modern platforms allow the users to continuously monitor the p-values and confidence intervals of their tests as data are collected (peeking) in order to re-adjust their experiment dynamically. 


5. In particular, users would like to stop their experiments earlier depending on the results of interim analyses


6. Naively stopping experiments earlier than planned will increase the probability of *incorrectly* rejecting the null hypothesis (i.e., when there is no real effect). Stops must be part of the experimental design and appropriate testing methods must be used!


7. Sequential testing is just another flavour of a multiple comparison problem. If you make lots of comparisons, but don’t correct for it, your error rates are inflated!! However, in sequential testing tests are nested and not independent.


8. A possible way to control the type I error rated is to use a Bonferroni adjustment of the $p$-values (or equivalently the significance level or critical values). As with other multiple comparison problems, the Bonferroni's correction in sequential analysis is very conservative and can affect the power of the test!!


9. The Pocock's method offers a less conservative way of controlling the type I error rate in sequential testing with early stops.


10. *Principled* peeking is ok and even beneficial in A/B testing.

> The experimental designt is a very important piece of any statistical analysis! 