# How to use auto_ab library

In [None]:
import sys, yaml, os, json
import pandas as pd
import numpy as np

sys.path.append(str('../'))
from auto_ab import ABTest, Splitter, VarianceReduction, Graphics

## Loading config file

Config file is in *yaml* format and is located in the root of the library.
Later in file config is available via *config* variable.

In [None]:
try:
    project_dir = os.path.dirname(os.path.abspath(''))
    config_file = os.path.join(project_dir, 'config.yaml')
    with open (config_file, 'r') as file:
        config = yaml.safe_load(file)
except yaml.YAMLError as exc:
    print(exc)
    sys.exit(1)
except Exception as e:
    print('Error reading the config file')
    sys.exit(1)
    

gf = Graphics()

# Preparation to the experiment

## Loading dataset

- **sex, married, country** — features
- **height** — target if target is continuous
- **clicks, sessions** — numerator and denominator if target is ratio

In [None]:
data = pd.read_csv(os.path.join(project_dir, 'data/internal/guide/data.csv'), index_col='id')
data.head()

## Initialization of splitter

If you are going to run MDE simulation, **split_rate** parameter can be omitted as it will be placed to the splitter during the simulation.

In [None]:
splitter = Splitter(split_rate=config['splitter']['split_rate'])

## Initialization of A/B-test

Here
- **alpha** — significance level
- **beta** — probability of type II error
- **alternative** — 'less', 'more', 'two-sided'

In [None]:
ab = ABTest(alpha=config['hypothesis']['alpha'], 
            beta=config['hypothesis']['beta'],
            alternative=config['hypothesis']['alternative'])

### Set loaded dataset as analyzed

Here
- **id_col** — id column of a dataset

In [None]:
ab.use_dataset(data, id_col=config['data']['id_col'],
              target=config['data']['target'])

### Set previously defined splitter for test

Assign defined splitter to the test.

In [None]:
ab.splitter = splitter

### Set list of split rates for MDE exploration

Set a list of split rates between control/treatment you are going to test.

In [None]:
ab.split_rates = config['simulation']['split_rates']

### Set list of increments for MDE exploration

Here
- **inc_var** — list of increments, i.e. [1, 2, 3, 4, 5]
- **extra_paramms** — extra parameters for increment, currently not used in analysis

In [None]:
ab.set_increment(inc_var=config['simulation']['increment']['vars'],
                extra_params=config['simulation']['increment']['extra_params'])

### Create metric which you want to compare

In the example below, we want to compare 10th percentile of control and treatment distributions.
Metric must return a value over set of numbers.

In [None]:
def metric(X: np.array) -> float:
    return np.quantile(X, 0.5)

### MDE simulation in order to find the best combination of split rate—increment

Here
- **n_iter** — number of iterations of simulation
- **n_boot_samples** — set if you chose bootstrap hypothesis testing
- **metric_type** — metric type: ratio or solid (continuous)
- **metric** — Python function as tested metric (quantile, median, mean, etc)
- **strategy** — strategy of hypothesis testing
- **strata** — strata column name for variance reduction
- **strata_weights** — weights of each unique value in strata column as a dictionary
- **to_csv** — whether or not to save the result to csv file
- **csv_path** — path to the newly created csv file

In [None]:
res = ab.mde_simulation(n_iter=config['simulation']['n_iter'],
                        n_boot_samples=config['simulation']['n_boot_samples'],
                       metric_type=config['metric']['metric_type'],
                       metric=metric,
                       strategy=config['hypothesis']['strategy'],
                       strata=config['hypothesis']['strata'],
                       strata_weights=config['hypothesis']['strata_weights'],
                       to_csv=config['result']['to_csv'],
                       csv_path=config['result']['csv_path'])

### Print simulation log

Here
- **first key** — split rate
- **second key** — increment
- **value** — share of rejected H0

In [None]:
print(json.dumps(res, indent=4))

### Visualize simulation log in plot

In [None]:
gf.plot_simulation_log(config['result']['csv_path'])

# Actual A/B test

During this step, dataset of outcomes is gathered and is ready for the analysis.

# A/A test

Yes, it must be run before A/B test, but let's assume that we have data after A/B test and now we need to assure that splitter is OK.

In [None]:
ab_data = pd.read_csv(os.path.join(project_dir, 'data/internal/guide/ab_data.csv'))
ab_data.head()

In [None]:
splitter = Splitter(split_rate=0.5)
res = splitter.aa_test(X=ab_data, target='height_now', alpha=0.05, n_iter=1000)
print(f'Share of iterations when control and treatment groups are equal: {res}')

# Variance reduction

## Loading dataset generated during A/B-test

Here
- **height_now** — experiment metric during experiment
- **height_prev** — experiment metric before experiment
- **weight_now** — highly correlated feature with metric during experiment
- **weight_prev** — highly correlated feature with metric before experiment
- **noise_now** — feature during experiment that is just noise
- **noise_prev** — feature before experiment that is just noise
- **groups** — groups column

In [None]:
ab_data = pd.read_csv(os.path.join(project_dir, 'data/internal/guide/ab_data.csv'))
ab_data.head()

## Initial distribution of tested metrics

In [None]:
gf.plot_distributions(ab_data, 'height_now', 'groups', 50)

As can be seen, distributions are identical.

## Add increment to the treatment group

In [None]:
ab = ABTest(alpha=config['hypothesis']['alpha'], 
            beta=config['hypothesis']['beta'],
            alternative=config['hypothesis']['alternative'])

treatment = ab_data.loc[ab_data.groups == 'B', 'height_now']
treatment_increased = ab._add_increment('solid', treatment, 5)
ab_data.loc[ab_data.groups == 'B', 'height_now'] = treatment_increased

gf.plot_distribution(treatment_increased, bins=50)

## Initial control and increased treatment distribution

In [None]:
gf.plot_distributions(ab_data, 'height_now', 'groups', 50)

## Use CUPED to reduce variance

After the execution, new column is introduced — **height_now_cuped**.

In [None]:
vr = VarianceReduction()
ab_data_cuped = vr.cuped(ab_data, target='height_now', groups='groups', covariate='height_prev')
print(ab_data_cuped.head())

In [None]:
gf.plot_distributions(ab_data_cuped, 'height_now_cuped', 'groups', 50)

As can be seen, variance reduced **from 160 to 170** and **from 190 to 180** for control and **from 165 to 175** and **from 195 to 185** for treatment.

## Use CUPAC to reduce variance

Below you can see the model that was created to predict covariate to experiment period.
After the execution, new column is introduced — **target_pred**.

In [None]:
ab_data_cupac = vr.cupac(ab_data, target_prev='height_prev', target_now='height_now',
               factors_prev=['weight_prev'],
               factors_now=['weight_now'], groups='groups')

In [None]:
print(ab_data_cupac.head())

In [None]:
gf.plot_distributions(ab_data_cupac, 'height_now_cuped', 'groups', 50)

As can be seen, variance reduced **from 160 to 170** on the left and **from 190 to 180** on the right.

# A/B-test analysis

Metric tested in the experiment in 10th quantile.

In [None]:
def metric(X: np.array) -> float:
    return np.quantile(X, 0.1)

In [None]:
ab = ABTest(alpha=config['hypothesis']['alpha'],
            beta=config['hypothesis']['beta'],
            alternative=config['hypothesis']['alternative'])

control = ab_data_cuped.loc[ab_data_cuped.groups == 'A', 'height_now_cuped'].to_numpy()
treatment = ab_data_cuped.loc[ab_data_cuped.groups == 'B', 'height_now_cuped'].to_numpy()

is_rejected = ab.test_hypothesis_buckets(control, treatment, metric, 100)
result = 'rejected' if is_rejected == 1 else 'not rejected'
print(f'H0: {result}')