This example will demonstrate how to use the randomization inference tools to analyze the results of a two variant A/B test, where the primary metric is continuous (e.g. dollars spent).  We'll first look at an example where the distributions of the continuous metric in both the control and treatment group are nicely behaved.  Here, conventional techniques like a t-test are sufficient.  We'll compare the application of the t-test to the randomization inference results.  

In a second example, we'll look at data from an experiment where the outcome metric is heavily skewed by zeros (zero inflated).  In this case, the usual t-test approach is not applicable.  Instead, we'll use the randomization inference tools to look at the experiment results without needing to make any parametric assumptions about our data.  

In [1]:
import sys
import pandas as pd

In [2]:
# For now, this is a simple way to import the tools we need.
# Be sure to update the path to point to where the relevant code is stored on your machine
# TBD on eventually making this importable
sys.path.append('../../../ab_testing_utils/')

In [4]:
import skewed_metric_utils
import sim_utils

### Conventional example

First, we need to simulate some data.  We can use some of the pre-built tools in sim_utils.  

Here, we'll assume that we're looking at the results of a two variant A/B tes.  Furthemore, we'll assume that:

1. took place over the course of 4 weeks (4 * 7 = 28 days)  
2. On average, we had about 10,000 eligible units per day that could be assigned to either treatment or control 
3. The assignment to treatment or control is even. i.e. the probability of being assigned to the control group is 50%
4. Our primary metric here is dollars spent per unit
5. In the control group, we'll assume that the mean dollars spent per unit is \\$1,000.  In the treatment group, we'll assume that the mean is \\$2,000, i.e. the treatment really did have a positive impact.  

To make this simple, we'll just assume that the dollars spent in each group follows a normal distribution, centered on the mean. This means that the mean dollars spent is an appropriate test statistic here.  

In [None]:
daily_num_observations = 10000
number_of_days_for_experiment = 4 * 7
p_vals = "equal"
group_col = 'group'

In [5]:
simulate_data = sim_utils.SimulateSkewedContinuous()

In [None]:
df = simulate_data.simulate_zero_skewed_outcomes()