## Lab 3

### UGBA 88: Data and Decisions

<img src="what-ab-test.webp" alt="Drawing" style="width: 500px;"/>

<br>

This lab is designed to be completed in class. However, in case you need additional time, this assignment is due **Tuesday, October 22nd at 11:59pm**.

The lab will be graded for **completion**. Lab office hours are held by Connector Assistants on Tuesdays after labs from 2-4pm in the DS Nexus in Moffitt.

## Experiments and the Statistical Power of a Test

The **statistical power** of an experiment reflects the likelihood of detecting a *statistically significant* treatment effect for the outcome of interest if a treatment effect is present. The purpose of this lab to explore how the statistical power of an experiment depends on sample size and the true effect size. Statistical power is an essential consideration when designing an experiment because experiments without adequate power may not teach us much about the causal effect of the treatment.

### Outline

*Dependencies:* 
- datascience
- Numpy
- Matplotlib

In [None]:
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np

from client.api.notebook import Notebook
ok = Notebook('lab3.ok')
_ = ok.auth(inline=True) 

## Testing a New Landing Page

Our motivating application in this lab is a common one for online companies: testing a new design for a webpage. 

Consider a company that provides some service to customers. Potential customers that click on an advertisement or referral link are redirected to a landing page that describes the service in more detail. Most importantly, potential customers can *sign up* for the service via the landing page.

<img src="slack_landing.png" alt="Drawing" style="width: 500px;"/>
<center>a Slack landing page</center>

The landing page is an important path for generating new business, but some in the company believe the page is too confusing and may be an impediment to customer growth. The company's designers have redesigned the landing page in an attempt to increase user engagement. The outcome of interest is whether a user that arrives at the landing page actually signs up for the service. We'll call this outcome the **clickthrough rate**.

In practice, most companies will test whether a new design is better than the old one by running an A/B test. That is, they would run an experiment that randomly assigned some users to the old landing page and others to the new landing page, and then compare outcomes. In this lab, for the sake of simplicity, we'll compare the new design against a *fixed* benchmark. The benchmark we'll use is **20%**. You can think of this as the *historic* clickthrough rate for the old design. In the experiment we will randomly assign a set of users to see the new design. If we have sufficient evidence that the clickthrough rate for the new design would be higher than 20%, we'll call the new design a success and launch it to all users.

## Section 1: Review of Hypothesis Testing

As you may recall from Data 8, we use **hypothesis tests** to choose between two views about how data are generated. These two views are called the **null** and **alternative** hypotheses.

For reference, I've reprinted the Data 8 definitions of the null and alternative hypotheses below.

**The null hypothesis** is a clearly defined model about chances. It says that the data were generated at random under clearly specified assumptions about the randomness. The word "null" reinforces the idea that if the data look different from what the null hypothesis predicts, the difference is due to nothing but chance.

**The alternative hypothesis** says that some reason other than chance made the data differ from the predictions of the model in the null hypothesis.

<br>

**Q1.1** State the null and alternative hypotheses for our landing page experiment.

*write answer here*

Null hypothesis: ...

Alternative hypothesis: ...

### Running a Hypothesis Test

Remember the general flavor of a hypothesis test from Data 8:
- Choose a test statistic
- Simulate the empirical distribution of the test statistic under the null hypothesis
- Compare the observed test statistic to this empirical distribution
- From this comparison, determine whether the null hypothesis is supported or not

A natural test statistic to use here is the **clickthrough rate**. We will now build towards simulating the empirical distribution of the clickthrough rate under the null hypothesis.

**Q1.2** Run the cell below, which initializes the `sample_proportion` function.

This function takes as input two values, the sample size and the population probability of success, and returns the success rate for a random sample drawn from the pouplation. (Note: this is a slight simplification of the function `sample_proportions` from Data 8.)


In [None]:
def sample_proportion(n, prob):
    """
    n (int): the sample size to be taken from
    prob (float):  probability of success
    """
    if (prob > 1):
        return 'probability of success must not exceed 1'
    if (prob < 0):
        return 'probability of success must be positive'
    return np.random.binomial(n, prob) / n

**Q1.3** Let's draw one sample of the data under the null hypothesis using the `sample_proportion` function.

Suppose our sample size is 100 customers. Compare the clickthrough rate of the data simulated under the null with the expected clickthrough rate under the null. Notice a difference?

In [None]:
#sample size
n = 100

#success rate under null hypothesis (a value of 0.2 corresponds to 20% clickthrough rate)
null_clickthrough_rate = 0.2

#simulate values for clickthrough rate under the null
#hint: use sample_proportion function defined above
observed_clickthrough_rate = ...

print("Expected Clickthrough Rate:", null_clickthrough_rate)
print("Observed Clickthrough Rate:", observed_clickthrough_rate)

Try re-running the cell a few times. Notice that the simulated clickthrough rate generally does not match the expected clickthrough rate exactly, but is usually close.

We'll call the latest clickthrough rate you simulate the **observed clickthrough rate**.

### Simulating an Empirical Distribution for the Null

Next, we will simulate an empirical distribution of the clickthrough rate for samples of size 100 under the null hypothesis. This will give us a sense of how variable the clickthrough rate could be. We will generate 10,000 samples and calculate the clickthrough rate for each sample.

**Q1.4** Fill out the function below that returns an array of the clickthrough rates of simulated trials. Then run the cell below to ensure that you get reasonable values.

In [None]:
#number of simulated draws for empirical distribution
model_draws = 10000

def simulate_n_trials(num_trials, prob, sample_size):
    """
    num_trials: the number of trials to record empirical success rates from
    prob: the probability of success under the model
    sample_size: the size of the population that you use to simulate success rates
    """
    model_clickthrough_rates = make_array()
    for i in np.arange(num_trials):
        sim_model_clickthrough_rate = ...
        model_clickthrough_rates = np.append(model_clickthrough_rates, sim_model_clickthrough_rate)
    return model_clickthrough_rates

model_clickthrough_rates = simulate_n_trials(model_draws, null_clickthrough_rate, n)
model_clickthrough_rates

### Evaluating Data under the Null

To evaluate data under the null, we will start by plotting a histogram of the empirical distribution. Then we will check how your observed clickthrough rate compares.

**Q1.5** Make a histogram of the simulated clickthrough rates.

For comparison, the code below will add a red dot for your `observed_clickthrough_rate` value from earlier.

In [None]:
#plot a histogram of model_clickthrough_rates values
...

#this line just draws a red dot corresponding to the observed clickthrough rate
plt.scatter(observed_clickthrough_rate, 0, color='red', s=30);

Was the clickthrough rate you simulated significantly different from what we would predict under the null? 

### The P-Value

To be more precise about what we mean by "significance", we define the *p-value* in Data 8. I've reprinted the Data 8 definition below:

"**The P-value** is the chance, based on the model in the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative.

If a P-value is small, that means the tail beyond the observed statistic is small and so the observed statistic is far away from what the null predicts. This implies that the data support the alternative hypothesis better than they support the null."

In other words, when the p-value is small, we **reject** the null hypothesis.

Let's compute the p-value corresponding to your observed clickthrough rate. This is the share of values in the array `clickthrough_rates` that are greater than or equal to `observed_clickthrough_rate`.

[Link to relevant data 8 material](https://www.inferentialthinking.com/chapters/11/3/decisions-and-uncertainty.html#Conventional-Cut-offs-and-the-P-value)

**Q1.6** Calculate the p-value for your observed clickthrough rate.

In [None]:
#calculate the p-value
p_value = ...
p_value

To determine whether an observed statistic is 'significant', a common cutoff to use is a p-value of 0.05 or below. More specifically, we would say an observed statistic with a p-value of 0.05 or below under the null is 'statistically significant at the 5% significance level'.

How large a clickthrough rate would we have needed to observe to declare it statistically significant and reject the null hypothesis? We can compute this by taking the 95th percentile of our empirical distribution of the clickthrough rate under the null. By definition, only 5% of values in the empirical distribution are larger than this cutoff.

*Run the cell below to calculate the cutoff for a significance level of 5%.*

In [None]:
cutoff = percentile(95, model_clickthrough_rates)
cutoff

Your cutoff should be around 0.27.

### Simulating Experiment Results

The cutoff above defines a decision rule. If our experiment produces a result that's above the cutoff, we launch the new design.

In this section we will ask the following question: if the null hypothesis is true and the clickthrough rate for the new design is indeed 20%, how often will we reject the null hypothesis and launch the new design? 

We will assume the clickthrough rate for the new design is also 20%, and simulate experimental results under that distribution. A simulated experiment here is a random draw of 100 observations. For each draw, we will apply our decision rule and decide whether to reject the null hypothesis. Then, across all our simulations, we will measure how often we reject the null hypothesis.

We'll simulate 1,000 experiment results given this clickthrough rate. For each draw, let's also store the *p-value* and whether or not the null hypothesis is rejected.

**Q1.7** Simulate 1,000 experiments, store the p-values of each experiment and whether or not we would reject the null. Store the results in `experiments_table`.

In [None]:
#set clickthrough rate for the new design
#at first we'll set this to the null clickthrough rate
true_clickthrough_rate = null_clickthrough_rate

#number of experiments we'll simulate
exp_draws = 1000

#simulate experiments
simulated_data = simulate_n_trials(..., true_clickthrough_rate, n)

#initiate arrays to store p-value and whether null is rejected for each simulated experiment
p_values = make_array()
rejections = make_array()

#loop through elements of simulated_data
for simulated_val in simulated_data:
    #calculate p-value
    p_value = ...
    #add p_value to array
    p_values = np.append(p_values, p_value)
    #calculate indicator for whether we reject the null hypothesis
    rejection = ...
    #add rejection to array
    rejections = np.append(rejections, rejection)

#store results in table, one row per experiment
experiments_table = Table().with_columns("simulated_values", simulated_data, "p-values", p_values, "rejections", rejections) 
experiments_table

In [None]:
_ = ok.grade("q1_7")

Next we'll look at the distribution of experiment outcomes. First we'll look at the distribution of p-values across all of our simulated experiments.

**Q1.8** Produce a histogram of the p-values for the table above.

In [None]:
#plot histogram of p-values

Even though the null hypothesis is true, we get a variety of p-values. In fact, we may actually **reject** the null hypothesis some of the time.

**Q1.9** Across simulated experiments, how often do we reject the null hypothesis?

In [None]:
#calculate the rate that null hypothesis is rejected

Even when the null hypothesis is true, we will reject the null about **5%** of the time (your number may differ slightly).

### False-Positives

That value should sound familiar. We applied a 5% significance level, or set our cutoff for 'significance' at $p \le 0.05$. 

Recall that, in our example, the null hypothesis is in fact true. So if we decide to reject the null hypothesis and launch the design change, we are in fact making a *mistake*. We call this type of mistake a **false-positive** or a **Type I** error. Our experiment gave us a significant result, but it occurred by chance, and not because the new design actually pushed the clickthrough rate above 20%.

When we set our cutoff at $p = 0.05$ ('5% significance level'), we are setting a decision rule where, if the null hypothesis is true, we'll hope to make this type of mistake at most 5% of the time.

Depending on how costly a false-positive is, we may decide to set the cutoff at a higher or lower value.

## Section 2: False-Negatives

Suppose the true clickthrough rate is instead **25%**. In this case, the clickthrough rate is large enough that the *correct* decision is to launch the new design. If we fail to reject the null hypothesis and do not launch the new design, we are making a different kind of mistake. In this case, we call the mistake a **false-negative** or a **Type II** error. 

Even if the null hypothesis is false, the evidence may not be strong enough for us to (correctly) reject the null hypothesis. As you calculated above, we will reject the null hypothesis if the clickthrough rate we calculate is above **27%**. So if the true clickthrough rate is 25%, it's quite possible we will draw a sample that does not provide strong enough evidence to reject the null hypothesis.

The **power** of a statistical test is 1 minus the probability of a **false-negative**. In other words, it is the probability of correctly rejecting the null hypothesis when it is false. Note that the power depends on the specific value for the true clickthrough rate that we're considering. In this case, the value we're considering is 25%. A experiment with low power is one that's unlikely to produce enough evidence to reject the null hypothesis even if it's in fact false. In the context of our landing page, with a low power experiment we will have trouble identifying new designs that are in fact effective, missing opportunities for improving clickthrough rates.

**Q2.1** Let's calculate the power of our test. First, simulate the distribution of clickthrough rates you will observe if the true clickthrough rate is 25%. As above, simulate 1,000 experiments.

In [None]:
#set success rate for the new design
true_clickthrough_rate = 0.25

#simulate values for observed clickthrough rates using true_clickthrough_rate value
#remember: simulate for 1,000 experiments (use exp_draws value)
true_clickthrough_rates = ...
true_clickthrough_rates

**Q2.2** Plot histogram of simulated clickthrough rates.

In [None]:
#plot histogram of simulated clickthrough rates
...

#code below draws a vertical line at the cutoff for rejecting the null hypothesis
#note: `cutoff` is already defined above
plt.axvline(x= cutoff, color='red', lw = 0.5)

As you can see, for many simulated experiments, the clickthrough rate we will observe is below the cutoff for rejecting the null hypothesis.

**Q2.3** Below, calculate the rate at which you would reject the null hypothesis if the true clickthrough rate is in fact 25%.

In [None]:
#calculate rejection rates

You should find a value of about 30%. That means that, if our new landing page improves the clickthrough rate from 20% to 25%, there is about **70%** chance of us committing a Type II error, meaning we fail to the reject the null hypothesis and adopt the new and improved design!

How can we reduce the Type II error rate? There are two approaches: (1) we can increase the p-value cutoff we use in rejecting the null hypothesis or (2) we can increase the sample size. Increasing the p-value cutoff increases the power of our test by reducing our threshold for rejecting the null hypothesis. Increasing the sample size helps both by reducing our threshold for rejecting the null hypothesis (for a fixed p-value cutoff) and by making it more likely that the clickthrough rate we observe in the experiment is close to the true clickthrough rate.


Both involve tradeoffs. Increasing the significance level of our test will increase our chances of getting a false positive. Increasing our sample size is usually costly--in this case, it will take longer to accumulate enough users for inclusion in the experiment.

Let's investigate how the statistical power of our test changes with the sample size.

**Q2.4** Calculate the rate at which you would reject the null hypothesis for a range of sample size. We suggest looking at sample sizes ranging from 20 to 1000 and have provided code for generating a list with this range, but you are welcome to change those values.

In [None]:
#first, we create an array of sample size values that you will loop through
#you can change these values if you would like

#set number of sample sizes to try
num_sizes = 20

#set minimum and maximum sample size
n_min = 20
n_max = 1000

#generate sequence of sample sizes ranging from n_min to n_max
row_num = np.arange(0, num_sizes)

#note: np.floor ensures that we get an integer value for the sample size
sample_sizes = np.floor(n_min + ((n_max - n_min)/(num_sizes - 1))*row_num)
sample_sizes

In [None]:
#calculate null rejection rates for various sample sizes

#create array of rejection rates for storage
reject_rates = make_array()

#create array of cutoff values for storage
cutoff_values = make_array()

#loop through sample sizes
for n in sample_sizes:

    #simulate empirical distribution for observed clickthrough rates *under the null hypothesis*
    model_clickthrough_rates = ...
    
    #calculate cutoff for this sample size
    cutoff_n = ...
    
    #store cutoff values for rejecting null hypothesis
    cutoff_values = np.append(cutoff_values, cutoff_n)

    #simulate empirical distribution for observed clickthrough rates using true_clickthrough_rate value
    #note: as above, simulate 1,000 experiments (use exp_draws value)
    true_clickthrough_rates = ...
    
    #calculate the rate at which the null hypothesis is rejected
    reject_rate = ...

    #store reject rate for given sample size
    reject_rates = np.append(reject_rates, reject_rate)

reject_rates

Note that the cutoff values are *decreasing* in sample size. With a bigger sample, we don't need to observe as large a clickthrough rate to conclude that the null hypothesis (that the true clickthrough rate is 20%) is unlikely to be true.

In [None]:
#run this see cutoff values
cutoff_values

**Q2.5** Run the cell below to plot statistical **power** as a function of sample size (when the true clickthrough rate is 25%).

In [None]:
#plot rejection rates as function of sample size

plt.scatter(sample_sizes, reject_rates)
#plt.ylim(ymin= 0.0, ymax = 1.0)
plt.xlabel('Sample Size')
plt.ylabel('Test Power (Null Rejection Rate)')
plt.title('Sample Size and Rejection Rates')

**Q2.6** Describe the pattern you see. How does the power of the test change with the sample size?

*Write answer here*

**Q2.7** Next, let's see how statistical power varies with the true clickthrough rate. As before, fix the sample size at 100. In the code provided below, we allow the true clickthrough rate to range from 21% to 35%.

In [None]:
#first, we create an array of true clickthrough rate values that you will loop through
#you can change these values if you would like

#revert to original sample size
n = 100

#set number of true clickthrough rate values to try
num_true_rates = 15

#set minimum and maxium sample size
r_min = 0.21
r_max = 0.35

#generate sequence of sample sizes ranging from r_min to r_max
row_num = np.arange(0, num_true_rates)
true_rates = r_min + ((r_max - r_min)/(num_true_rates - 1))*row_num

In [None]:
#create array of rejection rates for storage
reject_rates = make_array()

#loop through values of true_clickthrough_rate
for r in true_rates:
    #simulate empirical distribution for observed clickthrough rate using given value from true_rates array
    #note: as above, simulate 1,000 experiments (use exp_draws value)
    true_clickthrough_rates = ...
    
    #calculate rejection rate for given value from true_rates array
    #note: can use previously defined `cutoff`, which was measured for n = 100
    reject_rate = ...

    #store reject rate for given sample size
    reject_rates = np.append(reject_rates, reject_rate)

**Q2.8** Run the cell below to plot statistical power as a function of the true clickthrough rate.

In [None]:
#plot rejection rates as function of true clickthrough rate
plt.scatter(true_rates, reject_rates)
plt.xlabel('True Clickthrough Rate')
plt.ylabel('Test Power (Null Rejection Rate)')
plt.title('True Clickthrough Rate and Rejection Rates')

**Q2.9** Describe the pattern you see. How does the power of the test change with the true clickthrough rate?

*Write answer here*

In conclusion: when designing an experiment, it is important to consider the magnitude of the treatment effect you would like to be able to detect, and to choose a sample size accordingly. It is standard to aim for a sample size that provides 80% power for a target effect size. In this case, if that target effect size is in fact the size of the true treatment effect, you will (correctly) reject the null hypothesis of no treatment effect *80% of the time.* From **Q2.4**, you should see that we need a sample size of about 400 to have 80% power if the true clickthrough rate is 0.25. You can measure this for other values of the true clickthrough rate by changing the value for `true_clickthrough_rate`.

----
Congratulations, you've finished Lab 3! To submit the lab, run the two cells below:

In [None]:
# For your convenience, you can run this cell to run all the tests at once
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

In [None]:
_ = ok.submit()