# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science: 

## Homework 8: AB Testing


**Harvard University**<br/>
**Fall 2019**<br/>
**Instructors**: Pavlos Protopapas, Kevin Rader and Chris Tanner

<hr style="height:2pt">



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels as sm
import scipy as sp


### Problem 1: Amazon expenditures

Amazon is considering updating their app.  They decide to perform an AB test to determine whether the new version of their app will lead to more expenditures per customer.  Unbeknownst to them, each customer is randomly assigned to either the `old` or `new` version of the app and will interact with the app for an entire month in this selected format.  Total expenditure is measured at the end of the month.

The data set ``amazon.csv`` contains a subset of data collected for this AB test.  The variables measured are:

- **expenditure**: the amount spent by the customer that month, in dollars
- **treatment**: either 'new' or 'old' representing the version of the app
- **prime**: a binary indiciator for whether or not the customer is an Amaozn Prime member.

**Note**: do not do any transformations on the data for this problem.

**1.1** Perform a formal hypothesis test to determine whether the new app has a different average monthly expenditure per customer than the old version of the app.  Report the test statistic and the conclusion from the test in the separate quiz for this assignment.  

**1.2** Perform a formal hypothesis test to determine whether the new app has a different average monthly expenditure per customer than the old version of the app within the 
Prime customers subgroup.  Report the test statistic in the separate quiz for this assignment. 

**1.3** Perform a formal hypothesis test to determine whether the new app has a different average monthly expenditure per customer than the old version of the app within the 
non-Prime customers subgroup.  Report the test statistic in the separate quiz for this assignment. 

**1.4** Provide one glaring way to improve this study based on the results seen above.  This explanation only needs 1 or 2 sentences of justification.

---

### Problem 2: Obama 2008 Data (click-through rate)

In this problem, we will step you through analyzing the data we saw in lecture 23 regarding the Obama 2008 data set.

**2.1** Perform a formal hypothesis test to determine whether there is a difference in sign-up rates across the 4 button choices.  Report the test statistic and the conclusion from the test in the separate quiz for this assignment.  The sign-up rates were:

- `sign up`: $5851/77858 = 0.07515$ 
- `sign up now`: $5660/77151 = 0.07336$ 
- `join us now`: $5915/77644 = 0.07618$
- `learn= more`: $6927/77729 = 0.08912$

**2.2** Perform a formal hypothesis test to determine whether the ``learn more`` button has a different true sign up rate than the default ``sign up`` button that was originally used.  Report the test statistic in the separate quiz for this assignment.


---

### Problem 3: Multiple Comparisons Simulation

As mentioned in class, a sequential randomized design is prone to the issue of multiple comparisons.  This problem steps you through determining this and correcting it 2 different ways.  Imagine the following AB test with a binary outcome and 2 treatment groups: after a burn-in period of 1200 subjects (600 in each arm), the appropriate $z$-test for proportions is performed to compare the two groups after every 200 subjects until up to 3000 total subjects are enrolled (so at $n=1200,1400,...,3000$ with exactly half in each group).  If either treatment is deemed statistically significantly better than the other (at a two-sided $\alpha=0.05$ level), than the study is stopped. 

Assume that the null hypothesis is true (there is no actual effect of treatment) and the rate of success is truly 20% in both groups (you can think of the response variable as the click-through rate and the treatment groups represent two different website layouts).

**3.1** Determine the overall Type I error under these conditions when each test is performed at the $\alpha=0.05$ level.  Perform 2,000 simulation replicates.  Report this overall Type I error rate in the separate quiz.

**3.2** Determine a single $\alpha$ such that the overall Type I error will be 0.05.  Report this adjusted $\alpha$ rate in the separate quiz.

**3.3** An alternative approach to adjusting $\alpha$ for multiple comparisons is to vary the $\alpha$ of each test as the study progresses.  One common approach to doing this is to increase the $\alpha$ by a specific amount each time a test is run.  So for example, the first test could be run at $\alpha_1=0.001$, the second at $\alpha_2=0.002$, the third at $\alpha_3=0.003$, $\dots$, the tenth $\alpha_{10}=0.010$ (these are not necessarily the values needed to fix overall Type I error rate at 0.05).  Provide an advantage to each of the two choices for adjusting $\alpha$ as described here and in **3.2**. These can be short 1-2 sentence explanations for each.



In [None]:
# Type I error rate: False Positive
# Type II error rate: false negative

In [15]:
import statsmodels.stats.weightstats as sms

In [18]:
from statsmodels.stats.proportion import proportions_ztest

In [64]:
right_survive

239836

In [94]:
from random import choices
population=[0,1]
weights=[0.8,0.2]

right_survive = []
left_survive = []
right_tot = 0
left_tot = 0
pval_list = []

for j in range(0,2000):
    right_survive = []
    left_survive = []
    right_tot = 0
    left_tot = 0

    z=0
    y=1
    x=0
    arm = right_tot
    while x < 2401:
        x+=1
        if x<600:
            right_tot+=1
            left_tot+=1
            right_survive.append(choices(population, weights)[0])
            left_survive.append(choices(population, weights)[0])
        if x == 600:
            right_tot+=1
            left_tot+=1
            right_survive.append(choices(population, weights)[0])
            left_survive.append(choices(population, weights)[0])
            stat, pval =proportions_ztest([sum(left_survive),sum(right_survive)],[left_tot,right_tot])
            if pval<0.05:
                z=1
                pass
        
        if x > 600:
            for survive, tot in zip([left_survive, right_survive],[left_tot, right_tot]):
                y=1
                while y == 1:
                    x+=1
                    tot+=1
                    surv = choices(population, weights)[0]
                    survive.append(surv)
                    if x %200 == 0:
                        stat, pval =proportions_ztest([sum(left_survive),sum(right_survive)],[left_tot,right_tot])
                        if pval<0.05:
                            z=1
                            pass

                    y = surv                
                
    pval_list.append(z)
    

In [95]:
sum(pval_list)/len(pval_list)

0.2265

In [96]:
len(pval_list)

2000

In [145]:
from random import choices
population=[0,1]
weights=[0.8,0.2]

right_survive = []
left_survive = []
right_tot = 0
left_tot = 0
pval_list2 = []

for j in range(0,2000):
    right_survive = []
    left_survive = []
    right_tot = 0
    left_tot = 0

    z=1
    y=1
    x=0
    arm = right_tot
    while x < 3001:
        x+=1
        if x<600:
            right_tot+=1
            left_tot+=1
            right_survive.append(choices(population, weights)[0])
            left_survive.append(choices(population, weights)[0])
        if x == 600:
            right_tot+=1
            left_tot+=1
            right_survive.append(choices(population, weights)[0])
            left_survive.append(choices(population, weights)[0])
            stat, pval =proportions_ztest([sum(left_survive),sum(right_survive)],[left_tot,right_tot])
            z=pval
        
        if x > 600:
            y=1
            while y == 1:
                left_tot+=1
                surv = choices(population, weights)[0]
                left_survive.append(surv)
                if x %200 == 0:
                    stat, pval =proportions_ztest([sum(left_survive),sum(right_survive)],[left_tot,right_tot])
                    z=min([z, pval])
                
                x+=1
                y = surv                

            y=1
            while y == 1:
                right_tot+=1
                surv = choices(population, weights)[0]
                right_survive.append(surv)
                if x %200 == 0:
                    stat, pval =proportions_ztest([sum(left_survive),sum(right_survive)],[left_tot,right_tot])
                    z=min([z, pval])
                
                x+=1
                y = surv                
    pval_list2.append(z)

    
    

In [146]:
len(pval_list2)*0.05

100.0

In [148]:
pval_list2_sort=sorted(pval_list2)
pval_list2_sort[100]

0.017684036050601076

3.3 An alternative approach to adjusting  ùõº
Œ±
  for multiple comparisons is to vary the  ùõº
Œ±
  of each test as the study progresses. One common approach to doing this is to increase the  ùõº
Œ±
  by a specific amount each time a test is run. So for example, the first test could be run at  ùõº1=0.001
Œ±
1
=
0.001
 , the second at  ùõº2=0.002
Œ±
2
=
0.002
 , the third at  ùõº3=0.003
Œ±
3
=
0.003
 ,  ‚Ä¶
‚Ä¶
 , the tenth  ùõº10=0.010
Œ±
10
=
0.010
  (these are not necessarily the values needed to fix overall Type I error rate at 0.05). Provide an advantage to each of the two choices for adjusting  ùõº
Œ±
  as described here and in 3.2. These can be short 1-2 sentence explanations for each.

**3.3 Answer:**
The advantage of using an increasing alpha value would be the better opportunity to stop the experiment earlier, which would be important when testing in medicine. The advantage for the same alpha value could be computational ease, which would be important for very large datasets. 