#Designing Randomized Control Experiments?

## Introduction
---

The focus of this lab is on choosing the size or length of an experiment. This stage of experiment design is called a **power analysis**. In determining the scale of an experiment, we face a key tradeoff: larger samples give us more precise and definitive answers, which help us make decisions, but at some cost (e.g. opportunity cost, time).

In this lab, rather than design an experiment from scratch, we will take an existing experiment--a pricing experiment conducted by the job matching platform ZipRecruiter--and consider how things might have gone differently if ZipRecruiter had made different choices when designing the experiment. This will illustrate the importance of carefully considering the tradeoffs associated with experiment scale up front.

**This week:** We discuss experiment design under practical constraints, including situations where it is difficult to force people into or exclude people from the treatment and how to determine the size or duration of an experiment.

**Next week:** We will talk about what to do when we cannot or have not run any experiment at all. We will discuss how to identify `natural' quasi-experiments that can approximate the virtues of an RCT.

## Background: A Pricing Experiment at ZipRecruiter
---

In August 2015, ZipRecruiter was the fastest-growing HR company in the United States. The platform matched recruiters with qualified prospective employees using a data-driven algorithm. Job applicants could upload their resumés and apply for jobs for free. Business customers paid a monthly fee to gain access to the resumés of matched applicants. Over 40,000 registered companies posted jobs on the platform per month.

Ziprecruiter’s largest business segment consisted of “starter” firms, small companies with fewer than 5 employees. This segment represented nearly 50% of its business customer base. Encouraged by its growth, in early 2015, Ziprecruiter raised its base monthly price in the starter segment to \\$99 per month. As both the customer base and revenues continued to grow, Ziprecruiter took more interest in its pricing.

In early August 2015, the Ziprecruiter management team met with two marketing professors from the University of Chicago Booth School of Business, Jean-Pierre Dubé and Sanjog Misra, to discuss pricing. They designed and implemented a pricing experiment focused exclusively on new businesses in the starter segment that were reaching Ziprecruiter’s paywall for the first time.

In order to obtain a price quote from Ziprecruiter, a firm must first register for Ziprecuiter’s services by navigating a series of pages on the ziprecruiter.com website until it reaches the paywall. At the paywall, the new firm is issued a price quote. It must then use a credit card to pay the subscription fee in order to gain access to Ziprecruiter’s services and network of applicants and resumes.

The experiment was conducted between August 28, 2015 and September 29, 2015. During this period, 7,867 unique prospective starter customers reached Ziprecruiter’s paywall. Each prospective customer was randomly assigned to one of ten prices, ranging from \\$19 to \\$399 per month. One of the price points included in the experiment was Ziprecruiter’s standard \\$99  monthly rate, which we can think of as the control group. There are about 800 customers quoted each price.

In this lab we will analyze data from this experiment. We will also consider the following thought experiment: what would have happened if Ziprecruiter had run a smaller experiment with fewer customers and fewer price options? This thought experiment will clarify the importance of power analyses and thinking strategically when designing an experiment.

### Table of Contents

1. [Opening the data](#openingdata) <br>
2. [Analyzing the experiment](#analyzingexperiment) <br>
3. [What if ZipRecruiter had run a smaller experiment?](#whatif) <br>
4. [Conducting a power analysis?](#poweranalysis) <br>
5. [Introducing the `power.t.test` function](#function) <br>
6. [Power calculations for a range of sample sizes](#powerbysample) <br>
7. [Power calculations for a range of effect sizes](#powerbyeffect) <br>

### Opening the data  <a id='openingdata'></a>

We will use data on prospective starter customers that were included in the experiment. Each row represents a prospective customer. The data include three columns: `price`, `paid`, and `revenue`.

`price`: the price offered to the customer

`paid`: an indicator for whether the customer signed up for the service

`revenue`: monthly revenue associated with the customer (`revenue` = `price`*`paid`)

In [None]:
#load libraries

library(dplyr)
library(purrr)
library(ggplot2)
library(estimatr)

In [None]:
# Load the data
ZipRecruiter <- read.csv("Ziprecruiter.csv")

head(ZipRecruiter)

### Analyzing the experiment <a id='analyzingexperiment'></a>

Let's begin by evaluating the original experiment. There are two outcomes we will look at: conversion rates (`paid`) and revenue per prospective customer (`revenue`). We can evaluate the experiment using a regression model, where we regress the outcome on indicator variables for each value of `price`.

We will start by looking at the conversion rate.

   <span style="color:red">**Warning:** </span> we must write "`as.factor(price)`" not "`price`" to get indicator variables for each value of `price`.
- as.factor(`price`) is interpreted by R as a series of indicator variables for each price level. We will use the `ref` argument to set the reference level (i.e., the omitted price).
- Just `price` would be interpreted by R as the original continuous variable, the price faced by the customer.

In [None]:
#regression model for conversion rates
#note: set price = 99 as reference level
conversion_model <-
summary(conversion_model)

Of course, conversion rates are decreasing in price. But conversion rates are perhaps surprisingly high even at the highest price point.

It is perhaps easier to see the results in a bar chart.

In [None]:
#store summary statistics by price from the experiment
results <- ZipRecruiter %>% 
    group_by(price) %>%
    summarize(price = mean(price),
              rate = mean(paid),
              rate_se = sd(paid)/sqrt(n()),
              avg_revenue = mean(revenue),
              revenue_se = sd(revenue)/sqrt(n())) %>%
    mutate(control = ifelse(price == 99, 1, 0))

#what does this new table look like?
head(results)

In [None]:
#plot conversion rates by price
ggplot(results, aes(x = price, y = rate, fill = as.factor(control))) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_errorbar(aes(ymin = rate - 1.96*rate_se, ymax = rate + 1.96*rate_se), width = 10) +
  xlab('Price') + ylab("Conversion Rate") + theme(legend.position="none")


Given that conversion rates don't tank completely at higher price levels, ZipRecruiter may actually make more money by charging a higher price than \\$99. Let's look at a similar regression for revenue per lead.

In [None]:
#regression model for revenue per lead
#note: set price = 99 as reference level
revenue_model <- 

summary(revenue_model)

Again we can look at the results in bar chart form:

In [None]:
#plot revenue per lead by price
ggplot(results, aes(x = price, y = avg_revenue, fill = as.factor(control))) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_errorbar(aes(ymin = avg_revenue - 1.96*revenue_se, ymax = avg_revenue + 1.96*revenue_se), width = 10) +
  xlab('Price') + ylab("Revenue per Lead") + theme(legend.position="none")


Despite the fact that conversion rates are decreasing in price, revenue per lead is actually *increasing* in price, at least up to \\$249 or so. That much is clear, even given the uncertainty reflected in confidence intervals. 

**ZipRecruiter ultimately chooses \\$249 as their price 3 days after the experiment ends.**

### What if ZipRecruiter had run a smaller experiment? <a id='whatif'></a>

Consider this thought experiment: what if ZipRecruiter had decided to conduct a smaller experiment? What would they have concluded if they had taken a smaller sample and just looked at two prices, say \\$99 and \\$249? Would they have come to the same conclusion and increased their price? Or might they made a mistake and come to a different conclusion?

Suppose that only 20\% as many customers were assigned to prices of \\$99 and \\$249. That's about 160 (= 0.2 * 800) customers in each group.

The cell below draws a random sample of observations from the original experiment with 20% as many observations as the original experiment. You can run the cell below a few times to see how your answer changes.

In [None]:
#what are results using 20% sample?
sample <- ZipRecruiter %>%
filter(...) %>%
slice_sample(prop = 0.20, replace = TRUE)

#re-estimate regression using sample
sample_model <- 

summary(sample_model)

You should see that, from iteration and iteration, the treatment effect estimate can vary quite a bit. And certainly the t-statistic is smaller. In some samples, we may not even get a statistically significant effect. This is the cost of running a smaller experiment--more noise in our estimates.

To look at this pattern more systematically, let's draw 1,000 random samples and look at how much t-statistics can vary and how often we would (correctly) reject the null hypothesis that prices of \\$99 and \\$249 lead to the same revenue per lead.

In [None]:
#create data frame to store estimates
estimates = data.frame()

#loop through the following code 1000 times
for (i in 1:1000) {

    #draw sample
    sample <- 

    #estimate model
    sample_model <- 

    #store coefficients
    output = c(summary(sample_model)$coefficients[2,1], summary(model)$coefficients[2,3])

    #append coefficients to data frame
    estimates = rbind(estimates, output)
}

#label column names
colnames(estimates)<-c("estimate", "tstat")

#what does the end result look like?
head(estimates)

For each new sample we draw, we run another hypothesis and get another t-statistic. Here's what the distribution of those t-statistics looks like:

In [None]:
#histogram of t-statistics
ggplot(estimates, aes(x=tstat)) + geom_histogram() + 
    theme_classic() + ggtitle("Histogram of t-Statistics") + 
    geom_vline(xintercept = 1.96, color="red") + 
    xlab("t-Stat") + ylab("Frequency")

Finally, let's calculate how often we (correctly) reject the null hypothesis.

In [None]:
#calculate null rejection rate

We would only correctly reject the null about 60% of the time. That means that, if we had run this smaller experiment instead, we would have drawn the wrong conclusion 40% of the time and left a sizeable amount of revenue on the table.


What would happen if we had chosen another price that was even closer to status quo price, \\$99? You can see this by looking at the results for \\$79. They are essentially indistinguishable from the conversion rates and revenue per prospective customer for \\$99, even using all the experiment data. If we had just run an experiment with \\$99 and \\$79, we would have needed a bigger experiment to determine whether that was a good idea.

### Conducting a power analysis <a id='poweranalysis'></a>

Larger sample sizes mean more precise estimates of treatment effects.

- But it can take a lot of resources to run a large experiment.

What sample size should we choose? We can answer this question with *power calculations*.

The *power* of a statistical test is **the probability of finding a statistically significant difference** for a given effect size. 

We often measure effect size in **standard deviations**. Let's calculate the standard deviation of `revenue` in the experiment when the price is \\$99 or \\$249.



In [None]:
#calculate standard deviation of `revenue` if price is $99 or $249

   - The standard deviation of our outcomes is about **\\$72**.
  
   - We know from the original experiment that charging \\$249 increases revenue per lead by about **\\$18.**
  
   - Thus the effect size in standard deviations (sd) is 18/72 = **0.25 SD**.

### Introducing the `power.t.test` function <a id='function'></a>
    
How powered was our n = 320 (160 per group) experiment to detect a 0.25 SD effect at the 0.05 significance level? 
    
  - We will use the `power.t.test` function for our power calculations.

In [None]:
 # Arguments:
  # delta = effect size (sd=1 means the outcome is in standard deviations).
    # our effect size is 0.3.
  # sig.level is significance level: we choose a 0.05 significance threshold.
  # n is our sample size (per group).

power.t.test(delta=...,sd=...,sig.level=...,n=...)

Our 320-customer sample was only powered at the 0.61 level. This is close to what we got in the simulation above.

- This means that a significant relationship would only be detected 61% of the time. The other 29% would be **false negatives**.

What about the original 800 per group experiment?

In [None]:
power.t.test(delta=...,sd=...,sig.level=...,n =...)

Now our power is 0.9988. This is potentially overkill.

### Power calculations for a range of sample sizes <a id='powerbysample'></a>

So far:
 - We calculated the **power** given an effect size, significance level, and a sample size. 

But `power.t.test` is more flexible than that. If you feed `power.t.test` **three** of the following inputs, and the `power.t.test` will give you the fourth.

 - **Sample size**: number of observations per experimental group
 - **Significance level** (we typically choose p < 0.05 or 0.01)
 - **Power** (we typically choose power = 0.8 or 0.9)
 - **Effect size**. Setting SD = 1 means the effect size will be measured in standard deviations
     
For example: 
- **Input** significance level, power, and effect size. `power.t.test` outputs **sample size**.

- **Input** significance level, sample size, effect size. `power.t.test` outputs **power**.

It's often useful to look at the power across a range of sample sizes.

- As an example, let's consider samples sizes per price ranging from 100 to 3000 prospective customers for an effect size of 0.1 standard deviations.

In [None]:
# This code generates a list from 100 to 3000 counting by 50.
samplesizes <- seq(from=100,to=3000,by=50)

# Now we calculate power for each value on the list.
power.samplesizes <- power.t.test(n=samplesizes,delta=0.1,sd=1,sig.level=0.05,type="two.sample")$power

# And plot the results:
plot(samplesizes,
     power.samplesizes,
     xlim=c(0,3000),
     xlab="Sample size",
     ylab="Expected power: 0.1 sd effect",
     ylim=c(0,1),
     col="blue")

So, if we want to achieve power of 0.8 or more, we will need at least ~1600 participants per group!

What if we thought the effect size was **0.3 standard deviations**?

In [None]:
power.samplesizes <- power.t.test(n=...,delta=...,sd=...,sig.level=...,type="two.sample")$power

plot(samplesizes,
     power.samplesizes,
     xlim=c(0,3000),
     xlab="Sample size",
     ylab="Expected power: 0.3 sd effect",
     ylim=c(0,1),
     col="red")

We would need only ~200 people per group to achieve our desired statistical power!

**Challenge**: What effect size do we use?

(Note: if we already knew this, we wouldn't need to run an experiment!)

- Use smallest effect that would pass cost-benefit assessment or some other benchmark
- Or: use informed guess based on context or prior findings

### Power calculations for a range of effect sizes <a id='powerbyeffect'></a>

We can also think about what effect size we would be able to reliably detect for a given sample size. This is relevant for when you're deciding what is worth testing in the first place. For a relatively small experiment, a subtle intervention or small tweak may have such a small expected effect size that it's not worth testing.

Here we'll set a sample size of 800 per group. We'll consider effect sizes ranging from 0.01 to 0.3 SDs in increments of 0.01 SDs.

In [None]:
# This code generates a list from 0.01 to 0.3 counting by 0.01.
effectsizes <- seq(from=0.01,to=0.3,by=0.01)

# Now we calculate power for each effect size value on the list.
power.effectsizes <- power.t.test(n= ...,delta=...,sd=...,sig.level=...,type="two.sample")$power

# And plot the results:
plot(effectsizes,
     power.effectsizes,
     xlim=c(0,0.3),
     xlab="Effect size",
     ylab="Expected power: sample of 800",
     ylim=c(0,1),
     col="blue")

For a sample size of 800 per group, we should expect an effect size of about 0.15 SDs to have power greater than 0.8. For this sample size, we should reconsider testing interventions where we would need to be able to detect smaller effects than 0.15 SDs.