# Worksheet 1: Introduction to Statistical Modelling and Planned Peeking for A/B Testing Optimization

## Welcome to STAT 301: Statistical Modelling for Data Science

Each week you will complete a lecture assignment like this one. Before we get started, there are some administrative details.

You cannot learn technical subjects without hands-on practice. The weekly lecture worksheets and tutorials are an essential part of the course. Collaborating on lecture worksheets and tutorial assignments is more than okay -- it is encouraged! You should rarely be stuck for more than a few minutes on questions in lecture or tutorial, so ask a neighbour, TA or an instructor for help (explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it). Please do not just share answers, though. Everyone must submit a copy of their own work.

You can read more about course policies on the course website.

## Learning Objectives

After completing this week's worksheet and tutorial work, you will be able to:

1. Discuss why the methods you have learned in past courses are not sufficient to answer the more complex research problems being posed in this course (in particular, stopping an A/B test early).
2. Explain principled peeking and how it can be used for early stopping of an experiment (e.g., A/B testing). 
3. Write a computer script to perform A/B testing optimization using principled peeking.
4. Discuss the limitations of principled peeking for A/B testing optimization/early stopping of experiments.

## Loading packages

In [None]:
# Run this cell before continuing.
library(tidyverse)
library(digest)
library(testthat)
library(infer)
library(broom)
source("tests_worksheet_01.R")

## 1. Warm Up Questions

**Question 1.0**
<br>{points: 1}

In DSCI 100, we learned about [6 different types of data analysis questions we can ask and answer](https://ubc-dsci.github.io/introduction-to-datascience/). Moreover, in STAT 201, we reviewed what an inferential question is. Now, it is time to do a more comprehensive exercise to identify what class of data analysis a given real-life question implicates.

Below there is a table that lists out various types of data analysis questions on the left column:

| **Question** | **Type** |
| ------------------------------- | ----------------------- |
| Is wearing sunscreen associated with a decreased probability of developing skin cancer in Canada? | `answer1.0.0` |
| Is there a relationship between alcohol consumption and socioeconomic status in the 2018 City of Vancouver survey dataset? | `answer1.0.1` |
| Does a more concise Google ad lead to an increased number of visits to the advertised company's website? | `answer1.0.2` |
| How do changes in human behaviour lead to a reduction in the number of COVID-19 confirmed cases? | `answer1.0.3` |
| Does reduced caloric intake cause weight-loss? | `answer1.0.4` |
| Do tweets with GIFs get more impressions than tweets that do not? | `answer1.0.5` |
| Does including a GIF in tweets lead to more profile visits than tweets that do not include a GIF? | `answer1.0.6` |
| How many mentions will my next tweet get? | `answer1.0.7` |
| How many accounts are there on Twitter today? | `answer1.0.8` |
| Does increasing the contrast in images lead to better visual discrimination of visually impaired image content? | `answer1.0.9` |

The right column of the table is empty but should describe one of the following types of statistical question being asked: 

**A.** Descriptive.

**B.** Exploratory.

**C.** Inferential.

**D.** Predictive.

**E.** Causal.

**F.** Mechanistic.

*Assign your answers to the objects `answer1.0.0`, `answer1.0.1`, `answer1.0.2`, `answer1.0.3`, `answer1.0.4`, `answer1.0.5`, `answer1.0.6`, `answer1.0.7`, `answer1.0.8`, and `answer1.0.9`. Your answer should each be a single character (`"A"`, `"B"`, `"C"`, `"D"`, `"E"`, or `"F"`) surrounded by quotes.*

In [None]:
# answer1.0.0 <- ...
# answer1.0.1 <- ...
# answer1.0.2 <- ...
# answer1.0.3 <- ...
# answer1.0.4 <- ...
# answer1.0.5 <- ...
# answer1.0.6 <- ...
# answer1.0.7 <- ...
# answer1.0.8 <- ...
# answer1.0.9 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.0()

**Question 1.1**
<br>{points: 1}

We must have language/terminology that we can use to discuss concepts related to experimentation and causal inference, as in A/B testing. It takes time and practice to commit these terms and corresponding definitions to our memory to use them fluidly in practice. Let us get some more training by matching language/terminology with their definitions.

Read the table below and assign the correct term on the right column.

| **Defintion** | **Term** |
| ------------------------------- | ----------------------- |
| Technique to investigate effects of several variables in one study; experimental units are assigned to all possible combinations of factors. | `answer1.1.0` |
| Explanatory variable manipulated by the experimenter. | `answer1.1.1` |
| The entity/object in the sample that is assigned to a treatment and for which information is collected. | `answer1.1.2` |
| Repetition of an experimental treatment. | `answer1.1.3` |
| Equal number of experimental units for each treatment group. | `answer1.1.4` |
| Process of randomly assigning explanatory variable(s) of interest to experimental units | `answer1.1.5` |
| A combination of factor levels. | `answer1.1.6` |
| Statistically comparing a key performance indicator (conversion rate, dwell time, etc.) between two versions of a webpage/app/add to assess which one performs better. | `answer1.1.7` |

*Assign your answers to the objects  `answer1.1.0`, `answer1.1.1`, `answer1.1.2`, `answer1.1.3`, `answer1.1.4`, `answer1.1.5`, `answer1.1.6`, and `answer1.1.7`. Your answer should each be a single string (`"randomization"`, `"A/B testing"`, `"treatment"`, `"factor"`, `"experimental unit"`, `"replicate"`, `"balanced design"`, and `"factorial design"`) surrounded by quotes.*

In [None]:
# answer1.1.0 <- ...
# answer1.1.1 <- ...
# answer1.1.2 <- ...
# answer1.1.3 <- ...
# answer1.1.4 <- ...
# answer1.1.5 <- ...
# answer1.1.6 <- ...
# answer1.1.7 <- ...

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_1.1()

## 2. Introduction to Statistical Modelling for A/B Testing Optimization

In A/B testing optimization, we are trying to compare the parameters of two populations. The parameter being compared can vary depending on the problem. For example, you could be interested in the proportion of website visitors who register for the newsletter. Or, you could be interested in the average amount of money spent by each visitor. Naturally, the statistical analysis will change depending on the parameters being tested (remember the different formulae for hypothesis testing using CLT you learned in STAT 201?). 

Suppose a company's marketing team has developed a new video for their TikTok ad. They want to know if this **new** ad will increase the ad engagement (which they will measure via *ad dwell time* in seconds, i.e., a continuous response) compared to the **current** ad they are currently running.

<div>
    <img src="attachment:image.png" width="600px"/>
</div>

**Question 2.0**
<br>{points: 1}

The null hypothesis, $H_0$, generally refers to the status quo, i.e., there is no change in ad engagement. Let $\mu_{\text{new}}$ and $\mu_{\text{current}}$ be the mean dwell times of the new and current ads, respectively. What is the null hypothesis we are testing?

**A.** $H_0: \mu_{\text{new}} > \mu_{\text{current}}$

**B.** $H_0: \mu_{\text{new}} < \mu_{\text{current}}$

**C.** $H_0: \mu_{\text{new}} = \mu_{\text{current}}$

**D.** $H_0: \mu_{\text{new}} \neq \mu_{\text{current}}$

*Assign your answer to an object called `answer2.0`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer2.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.0()

**Question 2.1**
<br>{points: 1}

The alternative hypothesis, $H_1$, generally refers to the researcher's hypothesis of interest, i.e., the new ad increases the ad engagement. Let $\mu_{\text{new}}$ and $\mu_{\text{current}}$ be the mean dwell times of the new and current ads, respectively. What is the alternative hypothesis we are testing?

**A.** $H_1: \mu_{\text{new}} > \mu_{\text{current}}$

**B.** $H_1: \mu_{\text{new}} < \mu_{\text{current}}$

**C.** $H_1: \mu_{\text{new}} = \mu_{\text{current}}$

**D.** $H_1: \mu_{\text{new}} \neq \mu_{\text{current}}$

*Assign your answer to an object called `answer2.1`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer2.1 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.1()

**Question 2.2**
<br>{points: 1}

The company would like to run an experiment on TikTok users in the age demographic most of their customers fall (between 16 and 24 years old). They would randomize a sample of $n = 2000$ TikTok users to view one of the two ads. The sample will be split by half, i.e., $n_{\text{current}} = n_{\text{new}} = 1000$.

Once the data is collected, we will need to conduct a specific statistical analysis. This analysis will depend on the nature of our response and on the approach we want to use, Bootstrapping or Central Limit Theorem. If we opt for using the CLT to conduct the analysis, what is the specific test we need to perform?

**A.** One-sample $z$-test. 

**B.** One-sample $t$-test.

**C.** Two-sample $z$-test.

**D.** Two-sample $t$-test.

**E.** Two-way ANOVA.

*Assign your answer to an object called `answer2.2`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, or `"E"` surrounded by quotes.*

In [None]:
# answer2.2 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.2()

In practice, we would run an A/B testing by drawing a sample of size of $n$ experimental units (i.e., subjects) from the population. Then, we split the subjects in the sample in such a way that some of the subjects will receive one of the treatments (in this case, they will see the current ad) and the remaining subjects will receive the other treatment (in this case, the will see the new ad). However, since in practice we never know the truth, in this exercise we are going to use simulated data to explore the behavior of our inference methods.

Suppose we have one million TikTok users who are between 16 and 24 years old. The object `tiktok_pop` stores the dwell time (in seconds) of each user for the current ad (`dwell_time_current_ad`) and for the new ad (`dwell_time_new_ad`).

In [None]:
# run this cell before continuing
tiktok_pop <-
    read_csv("data/tiktok_pop.csv")

head(tiktok_pop)

**Question 2.3**
<br>{points: 1}

Calculate the true mean and true standard deviation of dwell time for both ads. 

_Save the result in a tibble called `tiktok_true_params`. The tibble should have four columns: `mean_current_ad`, `sd_currnt_ad`, `mean_new_ad`, and `sd_new_ad`.

In [None]:
# tiktok_true_params <- 
#     ... %>% 
#     ...(mean_current_ad = ...,
#               sd_current_ad = ...,
#               mean_new_ad = ...,
#               sd_new_ad = ...)


# your code here
fail() # No Answer - remove if you provide an answer

tiktok_true_params

In [None]:
test_2.3()

**Question 2.4**
<br>{points: 1}

Although in the previous exercise we had access to the population, and true parameters, this is not the case in practice!! Let's see how things actually work in practice. 
Here's what you need to do:

1. Take one sample of size 200 users from the population. 
2. The first 100 users in our sample will watch the current ad, and the remaining will watch the new ad. 

*Save the sample in a tibble called `tiktok_sample`. The tibble should have three columns: `user`, `ad_watched`, and `dwell_time`.*

In [None]:
set.seed(432121) # do not change this!

# tiktok_sample <-
#     ...%>% 
#     rep_sample_n(...) %>% 
#     ungroup() %>% 
#     mutate(row = row_number(),
#            ad_watched = if_else(row <= ..., "current", "new"),
#            dwell_time = if_else(row <= 100, dwell_time_current_ad, dwell_time_new_ad)) %>% 
#     select(...)

# your code here
fail() # No Answer - remove if you provide an answer

head(tiktok_sample)

In [None]:
test_2.4()

**Question 2.5**
<br>{points: 1}

Once we have collected our experimental samples for both treatments, it is time to conduct the statistical analysis. However, before testing the hypothesis, it is always good to graphically compare the distributions and spread in both samples of dwell times.

Make the side-by-side plot of the boxplots of each sample distribution, `current` and `new`, stored in the `ad_watched` column. Since boxplot does not show the mean value, let's add a point on top of each boxplot to represent the mean. The function `stat_summary()` can help with that. 

*Store the plot in a object named `dwell_time_boxplots`.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 9) # Adjust these numbers so the plot looks good in your desktop.

# dwell_time_boxplots <- 
#    ... %>%
#   ggplot() +
#   ...(aes(..., ..., fill = ...)) +
#   theme(
#     text = element_text(size = 22),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   ) +
#   ggtitle(...) +
#   xlab(...) +
#   ylab(...) +
#   guides(fill = FALSE) +
#   stat_summary(aes(..., ..., fill = ...),
#     fun = ..., colour = "yellow", geom = "point",
#     shape = 18, size = 5
#   )

# your code here
fail() # No Answer - remove if you provide an answer

dwell_time_boxplots

In [None]:
test_2.5()

**Question 2.6**
<br>{points: 1}

Based on your findings in **Question 2.5**, what can you conclude on the sampled data behaviour?

**A.** The current ad's median and mean are higher than the new ad's ones. Moreover, both data spreads are quite similar.

**C.** The new ad's median and mean are significantly higher than the current ad's ones. Moreover, both data spreads are quite similar.

**D.** The new ad's median and mean seems to be higher than the current ad's ones. However, there's the possibility that this difference is due to sampling variability. Both data spreads are quite similar.

**E.** The new ad's median and mean are significantly higher than the current ad's ones. Moreover, the data spreads are quite different by treatment.

*Assign your answer to an object called `answer2.6`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`, or `"E"` surrounded by quotes.*

In [None]:
# answer2.6 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.6()

**Question 2.7**
<br>{points: 1}

The previous plot indicates that the new ad's mean is higher than the current ad's one. Nonetheless, given the variations found in each treatment's dwell times, how likely would it be for us to see a sample difference at least as extreme as the observed one *if there were no difference in means*? In other words, is the observed difference statistically significant? 

Recall that the company wants to know whether this new ad will increase the population's ad engagement compared to the current ad. The hypotheses were already defined in **Questions 2.0** and **2.1**. Now, let us define the test statistic we will need to conduct this test:

$$
T = \frac{\bar{x}_{\text{new}} - \bar{x}_{\text{current}}}{\sqrt{\frac{s^2_{\text{new}}}{n_{\text{new}}}+\frac{s^2_{\text{current}}}{n_{\text{current}}}}}
$$

where $\bar{x}_{\text{new}}$ and $\bar{x}_{\text{current}}$ are the sample means of the dwell times for the new and current ads, respectively; $s^2_{\text{new}}$ and $s^2_{\text{current}}$ are the sample variances for the new and current ads, respectively; and $n_{\text{new}}$ and $n_{\text{current}}$ is the sample size for new and current ads, respectively. 

Furthermore, under the null hypothesis $H_0$, the $T$ statistic follows a $t$ distribution with approximately

$$
\nu = \frac{
    \left(\frac{s_{\text{new}}^2}{n_\text{new}}+\frac{s_{\text{current}}^2}{n_\text{current}}\right)^2
}
{
\frac{s_{\text{new}}^4}{n_{\text{new}}^2(n_{\text{new}}-1)}+\frac{s_{\text{current}}^2}{n_{\text{current}}^2(n_{\text{current}}-1)}
}
$$

degrees of freedom. Use the corresponding `R` function to automatically obtain all these calculations. Make sure to use `broom::tidy()` to get a more organized result.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it. Assign your answer to an object called `answer2.7`.*

In [None]:
# answer2.7 <- tidy(...(
#   x = ...,
#   y = ...,
#   alternative = ...,
#   var.equal = ...,
# ))

# your code here
fail() # No Answer - remove if you provide an answer

answer2.7

In [None]:
test_2.7()

**Question 2.8**
<br>{points: 1}

What is your decision at the 5% significance level?

**A.** Since the p-value is less than 0.05, we reject $H_0$. Therefore, we have statistical evidence to state that the current ad's mean dwell time is larger than the new ad's one.

**B.** Since the p-value is less than 0.05, we fail to reject $H_0$. Therefore, we have statistical evidence to state that the new ad's mean dwell time is equal to the current ad's one.

**C.** Since the p-value is less than 0.05, we reject $H_0$. Therefore, we have statistical evidence to state that the new ad's mean dwell time is larger than the current ad's one.

**D.** Since the p-value is less than 0.05, we fail to reject $H_0$. Therefore, we have statistical evidence to state that the new ad's mean dwell time is larger than the current ad's one.

*Assign your answer to an object called `answer2.8`. Your answer should be one of `"A"`, `"B"`, `"C"`, or `"D"` surrounded by quotes.*

In [None]:
# answer2.8 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_2.8()

**Question 2.9**
<br>{points: 1}

Alternatively to using asymptotic approximation, one could use bootstrapping to test the equality of the parameters of two populations. Using 1000 replications, test the hypothesis $H_0: \mu_{current}=\mu_{new}$ vs $H_0: \mu_{current}<\mu_{new}$.

*Store the result in an object named `tiktok_bootstrap_results`. Your answer should be a tibble with the observed test statistic, and the bootstrap p-value estimate.*

In [None]:
set.seed(10)

# your code here
fail() # No Answer - remove if you provide an answer

tiktok_bootstrap_results

In [None]:
test_2.9()

## 3. Inferential Implications of Early Stopping in Hypothesis Testing

Let us reconsider the TikTok ad example. In practice, the company testing the ads wants to present the new ad to all users as soon as possible, given that the new ad is superior. For example, imagine the company decided to conduct a balanced A/B test with 5000 users for each ad, in other words, $n_{\text{new}}=n_{\text{current}}=5000$. Now, suppose that 4000 users have watched the current ad and 4000 users have watched the new ad, and the dwell time of the new ad is 90% higher. Does the company really need to wait until 5000 users? Wouldn't it be better for the company to start using the new ad right away and start profiting more? This is called early stopping (because the test would be stopping earlier than planned). 

But how early? What if, instead of 4000 users for each ad, the company had observed only 3000? Would that be ok to stop? What about 500? 100? 50? 10? How about the company keeps monitoring the test statistic after each view of the ads? 

In this part, we are going to explore the consequences of online monitoring and early stopping. We'll explore more in the tutorial.


Suppose that initially they plan to run a balanced experiment with the sample size per treatment (ad watched) of 500, i.e., $n_\text{current} = n_\text{new} = 500$. Nonetheless, in terms of early stopping, the team would collect the data **gradually** for both groups and they will stop the experiment **once they find a significant result** (i.e., $p\text{-value} < \alpha$, where $\alpha$ is the significance level). 

**Question 3.0**
<br>{points: 1}

Suppose that after only ten users watched each ad, the company decided to peek at the test statistic and found the difference to be significant. Therefore they rejected $H_0$ and stopped the experiment. True or false:

> The problem with this scenario is that 10 is a fairly small sample size, which considerably hinders the sensibility of the test to detect if there's a difference. In addition, for a sample of size 10, the probability of Type I Error is very high. Therefore, the company should not rely on this result and stop the experiment. 


*Assign your answer to an object called answer3.0. Your answer should be either "true" or "false", surrounded by quotes.*

In [None]:
# answer3.0 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.0()

In the next few exercises, we will study how the $p$-values evolve as we increase the sample size per treatment until getting to $n_\text{current} = n_\text{new} = 500$. We prepared a function for you that performs a two-sample $t$-test beginning 10 experimental units per treatment group (ad watched), then it gradually increases the data collection by `sample_increase_step` until getting to `n` in each treatment group (equal sample size per treatment) for a single experiment. 

For example, if `sample_increase_step` is 20, and `n=500`, the function will:
1. draw a sample of 10 units;
2. perform the two-sample t-test;
3. draw another 20 units and perform the two-sample t-test now with all 30 units drawn; 
4. draw another 20 units and perform the two-sample t-test now with all 50 units drawn; 
5. draw another 20 units and perform the two-sample t-test now with all 70 units drawn; 
$$
\vdots\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad
$$
and so on, until the total units drawn is 500.

The function returns a tibble that has two columns:

- `inc_sample_size`: the final sample size per group for which the $t$-test is performed.
- `p_value`: $p$-value from performing a given $t$-test.

In [None]:
# Two-sample t-test for tracking p-values by incremental sample sizes until getting to n.

# @param n (numeric): Initially planned sample size for each group, n = n_current = n_new.
# @param d_0 (numeric): effect size.
# @param sd_current (numeric): Population standard deviation for current ad.
# @param sd_new (numeric): Population standard deviation for new ad.
# @param sample_increase_step (numeric): Sample size increment.

# @return p.value.df: A tibble that has 2 columns:
# inc_sample_size (increasing sample size) and p_value (p-value from performing a two-sample t-test).

incremental_t_test <- function(n, d_0, sd_current, sd_new, sample_increase_step) {
  sample_current <- rnorm(n, mean = 4, sd = sd_current)
  sample_new <- rnorm(n, mean = 4 + d_0, sd = sd_new)

  p.value.df <- tibble(
    inc_sample_size = rep(0, 1 + (n-10) / sample_increase_step),
    p_value = rep(0, 1 + (n-10) / sample_increase_step)
  )

  current_sample_size <- 10
  
  for (i in 1:nrow(p.value.df))
  {
    t_test_results <- t.test(sample_new[1:current_sample_size], sample_current[1:current_sample_size],
      var.equal = TRUE,
      alternative = "greater"                      
    )
    p.value.df[i, "p_value"] <- as_tibble(t_test_results$p.value)
    p.value.df[i, "inc_sample_size"] <- current_sample_size
    current_sample_size <- current_sample_size + sample_increase_step
  }

  return(p.value.df)
}


**Question 3.1**
<br>{points: 1}

Let's focus on the case where there is no difference in the dwell time means for the current and new ads. 
The ad company will now start testing the new ad. The process will be this:

1. For every 60 users, the company will assign the current ad to 30, and the new ad to 30;
2. Then, the company conduct the hypothesis test. 
3. The company will repeat step 1 and 2 until $n = 1000$ users have watched each of the ads.

Use the `incremental_t_test` function to conduct the company's experiment. 

*Save the result in an object called `answer3.1`. Your answer should be a tibble with two columns: `inc_sample_size`, and `p_value`.*

In [None]:
set.seed(28) # do not change this.

#answer3.1 <- 
#    incremental_t_test(n = ..., d_0 = ..., sample_increase_step = ..., sd_current = 8, sd_new = 8)

# your code here
fail() # No Answer - remove if you provide an answer

answer3.1

In [None]:
test_3.1()

**Question 3.2**
<br>{points: 1}

Using the data stored in `answer3.1`, plot the $p$-value sequence as a **line** with the incremental sample size on the $x$-axis and $p$-value on the $y$-axis. Add a dashed horizontal red line that indicates a threshold of the significance level $\alpha = 0.05$. The `ggplot()` object's name will be `dwell_time_pvalue_evolution`.

*Fill out those parts indicated with `...`, uncomment the corresponding code in the cell below, and run it.*

In [None]:
# dwell_time_pvalue_evolution <- 
#   answer3.1 %>%
#   ggplot() +
#   geom_line(aes(x = ..., y = ...)) +
#   theme(
#     text = element_text(size = 18),
#     plot.title = element_text(face = "bold"),
#     axis.title = element_text(face = "bold")
#   ) +
#   ggtitle("Evolution of p-values in a Single Experiment by Increasing Data Collection") +
#   ylab("p-value") +
#   xlab("Sample Size") +
#   geom_hline(
#     yintercept = ...,
#     colour = "red",
#     linetype = "twodash"
#   ) +
#   coord_cartesian(ylim = c(0, 1)) +
#   scale_y_continuous(breaks = seq(0, 1, by = 0.05))


# your code here
fail() # No Answer - remove if you provide an answer

dwell_time_pvalue_evolution

In [None]:
test_3.2()

**Question 3.3**
<br>{points: 1}

Suppose you want to implement an early stopping (before reaching the maximum sample size of `n = 1000` experimental units per treatment). To save time and resources allocated for the A/B testing, the company initially decided to peek every **60 users** (30 for each ad) during the experiment (i.e., each row in `dwell_time_evolution`). Using a significance level $\alpha = 0.05$, you would stop the experiment as soon as you find a significant result. 

Given the results in **Question 3.2**, the company would have stopped the experiment when 550 users had watched each ad. What error, if any, is the company committing? 

**A.** No error.

**B.** Type I Error.

**C.** Type II Error.

*Assign your answer to an object called `answer3.3`. Your answer should be one of `"A"`, `"B"`, or `"C"` surrounded by quotes.*

In [None]:
# answer3.3 <- 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.3()

**Question 3.4**
<br>{points: 1}

Since the company is testing the hypothesis at every **60 users** (30 for each ad) during the experiment, the company is conducting 34 hypothesis tests, and they need to reject only once to conclude that there's a difference. Intuitively, we would expect that the probability of wrongly rejecting $H_0$ would be inflated (think about it!). Let's see if that's the case.  


In the code below, we performed this experiment 100 times.  Count how many times the company would wrongly reject $H_0$ by following this strategy. Compare it with the expected number of rejections given the significance level $\alpha = 0.05$. 

*Assign your answer to an object called `answer3.4`. Your answer should be a tibble with two columns: `n_rejections` and `expected_n_rejections`.*

In [None]:
set.seed(12)

### Run this before continuing
multiple_times_sequential_tests <- 
    tibble(experiment = 1:100) %>% 
    mutate(seq_test = map(.x = experiment, 
                          .f = function(x) incremental_t_test(n = 1000, d_0 = 0, sample_increase_step = 30, sd_current = 8, sd_new = 8)))


In [None]:
#answer3.4 <- ...

# your code here
fail() # No Answer - remove if you provide an answer
                            
answer3.4

In [None]:
test_3.4()

**Question 3.5**
<br>{points: 1}

Select the right option to complete the sentence below:

> *With the strategy used by the company, the probability of Type I error is approximately ... the specified one.* 

**A.** equal to

**B.** 3 times lower than

**C.** 5 times lower than

**D.** 3 times higher than

**E.** 5 times higher than

*Assign your answer to an object called `answer3.5`. Your answer should be one of `"A"`, `"B"`, `"C"`, `"D"`,  or `"E"` surrounded by quotes.*

In [None]:
#answer3.5 <- ""

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_3.5()