This tutorial and the assets can be downloaded as part of the [Wallaroo Tutorials repository](https://github.com/WallarooLabs/Wallaroo_Tutorials/tree/main/wallaroo-testing-tutorials/abtesting).

## A/B Testing

A/B testing is a method that provides the ability to test out ML models for performance, accuracy or other useful benchmarks.  A/B testing is contrasted with the Wallaroo Shadow Deployment feature.  In both cases, two sets of models are added to a pipeline step:

* Control or Champion model:  The model currently used for inferences.
* Challenger model(s): One or more models that are to be compared to the champion model.

The two feature are different in this way:

| Feature | Description |
|---|---|
| A/B Testing | A subset of inferences are submitted to either the champion ML model or a challenger ML model. |
| Shadow Deploy | All inferences are submitted to the champion model and one or more challenger models. |

So to repeat:  A/B testing submits *some* of the inference requests to the champion model, some to the challenger model with one set of outputs, while shadow testing submits *all* of the inference requests to champion and shadow models, and has separate outputs.

This tutorial demonstrate how to conduct A/B testing in Wallaroo.  For this example we will be using an open source model that uses an [Aloha CNN LSTM model](https://www.researchgate.net/publication/348920204_Using_Auxiliary_Inputs_in_Deep_Learning_Models_for_Detecting_DGA-based_Domain_Names) for classifying Domain names as being either legitimate or being used for nefarious purposes such as malware distribution.  

For our example, we will perform the following:

* Create a workspace for our work.
* Upload the Aloha model and a challenger model.
* Create a pipeline that can ingest our submitted data with the champion model and the challenger model set into a A/B step.
* Run a series of sample inferences to display inferences that are run through the champion model versus the challenger model, then determine which is more efficient.

All sample data and models are available through the [Wallaroo Quick Start Guide Samples repository](https://github.com/WallarooLabs/quickstartguide_samples).

## Steps

### Import libraries

Here we will import the libraries needed for this notebook.

In [5]:
import wallaroo
from wallaroo.object import EntityNotFoundError
import os
import pandas as pd
import json
from IPython.display import display

# used to display dataframe information without truncating
from IPython.display import display
pd.set_option('display.max_colwidth', None)

In [6]:
wallaroo.__version__

'2023.1.0rc1'

### Arrow Support

As of the 2023.1 release, Wallaroo provides support for dataframe and Arrow for inference inputs.  This tutorial allows users to adjust their experience based on whether they have enabled Arrow support in their Wallaroo instance or not.

If Arrow support has been enabled, `arrowEnabled=True`. If disabled or you're not sure, set it to `arrowEnabled=False`

The examples below will be shown in an arrow enabled environment.

In [7]:
import os
# Only set the below to make the OS environment ARROW_ENABLED to TRUE.  Otherwise, leave as is.
os.environ["ARROW_ENABLED"]="True"

if "ARROW_ENABLED" not in os.environ or os.environ["ARROW_ENABLED"].casefold() == "False".casefold():
    arrowEnabled = False
else:
    arrowEnabled = True
print(arrowEnabled)

True


### Connect to the Wallaroo Instance

This command will be used to set up a connection to the Wallaroo cluster and allow creating and use of Wallaroo inference engines.

In [8]:
# Client connection from local Wallaroo instance

# wl = wallaroo.Client()

# SSO login through keycloak

wallarooPrefix = "doc-test"
wallarooSuffix = "wallaroocommunity.ninja"

wl = wallaroo.Client(api_endpoint=f"https://{wallarooPrefix}.api.{wallarooSuffix}", 
                    auth_endpoint=f"https://{wallarooPrefix}.keycloak.{wallarooSuffix}", 
                    auth_type="sso")

ERROR:root:Keycloak token refresh got error: 400 - {"error":"invalid_grant","error_description":"Invalid refresh token"}


Please log into the following URL in a web browser:

	https://doc-test.keycloak.wallaroocommunity.ninja/auth/realms/master/device?user_code=FOAN-WGMH

Login successful!


### Create Workspace

We will create a workspace to manage our pipeline and models.  The following variables will set the name of our sample workspace then set it as the current workspace for all other commands.

To allow this tutorial to be run multiple times or by multiple users in the same Wallaroo instance, a random 4 character prefix will be added to the workspace, pipeline, and model.

In [9]:
workspace_name = 'abtestingworkspace'


In [10]:
def get_workspace(name):
    workspace = None
    for ws in wl.list_workspaces():
        if ws.name() == name:
            workspace= ws
    if(workspace == None):
        workspace = wl.create_workspace(name)
    return workspace

In [11]:
workspace = get_workspace(workspace_name)

wl.set_current_workspace(workspace)

{'name': 'abtestingworkspace', 'id': 5, 'archived': False, 'created_by': '42247718-e3c7-41eb-9b23-a3639ff77fd5', 'created_at': '2023-03-01T19:29:35.565813+00:00', 'models': [], 'pipelines': []}

### Set Up the Champion and Challenger Models

Now we upload the Champion and Challenger models to our workspace.  We will use two models:

1. `aloha-cnn-lstm` model.
2. `aloha-cnn-lstm-new` (a retrained version)

### Set the Champion Model

We upload our champion model, labeled as `control`.

In [12]:
control =  wl.upload_model("aloha-control",   'models/aloha-cnn-lstm.zip').configure('tensorflow')

### Set the Challenger Model

Now we upload the Challenger model, labeled as `challenger`.

In [13]:
challenger = wl.upload_model("aloha-challenger",   'models/aloha-cnn-lstm-new.zip').configure('tensorflow')

### Define The Pipeline

Here we will configure a pipeline with two models and set the control model with a random split chance of receiving 2/3 of the data.  Because this is a random split, it is possible for one model or the other to receive more inferences than a strict 2:1 ratio, but the more inferences are run, the more likely it is for the proper ratio split.

In [14]:
pipeline = (wl.build_pipeline("randomsplitpipeline-demo")
            .add_random_split([(2, control), (1, challenger)], "session_id"))

### Deploy the pipeline

Now we deploy the pipeline so we can run our inference through it.

In [15]:
experiment_pipeline = pipeline.deploy()

In [16]:
experiment_pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.13.15',
   'name': 'engine-6cd755545-778dc',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'randomsplitpipeline-demo',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'aloha-control',
      'version': '65e28c87-aaa8-451e-9a0a-be587f0d500c',
      'sha': 'fd998cd5e4964bbbb4f8d29d245a8ac67df81b62be767afbceb96a03d1a01520',
      'status': 'Running'},
     {'name': 'aloha-challenger',
      'version': '74ce34c9-2e52-45b1-9e50-2d5e0c1be000',
      'sha': '223d26869d24976942f53ccb40b432e8b7c39f9ffcf1f719f3929d7595bceaf3',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.244.15.48',
   'name': 'engine-lb-ddd995646-8wc2v',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': []}

# Run a single inference
Now we have our deployment set up let's run a single inference. In the results we will be able to see the inference results as well as which model the inference went to under model_id.  We'll run the inference request 5 times, with the odds are that the challenger model being run at least once.

In [17]:
results = []
if arrowEnabled is True:
    # use dataframe JSON files
    for x in range(5):
        result = experiment_pipeline.infer_from_file("data/data-1.df.json")
        #display(result.loc[:,["model_name","outputs"]])
        display(result)       
else:
    # use Wallaroo JSON files
    results.append(experiment_pipeline.infer_from_file("data/data-1.json"))
    results.append(experiment_pipeline.infer_from_file("data/data-1.json"))
    results.append(experiment_pipeline.infer_from_file("data/data-1.json"))
    results.append(experiment_pipeline.infer_from_file("data/data-1.json"))
    results.append(experiment_pipeline.infer_from_file("data/data-1.json"))
    for result in results:
        print(result[0].model())
        print(result[0].data()[7])

Unnamed: 0,model_name,model_version,pipeline_name,outputs,elapsed,time,original_data,check_failures,shadow_data
0,aloha-control,65e28c87-aaa8-451e-9a0a-be587f0d500c,randomsplitpipeline-demo,"[{'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0015195842133834958]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9829147458076477]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.012099549174308777]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [4.759115836350247e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [2.028934977715835e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.00031977228354662657]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.011029261164367199]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9975640177726746]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.010341613553464413]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.008038961328566074]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.016155054792761803]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0062362332828342915]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0009985746582970023]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.79333675513837e-26]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.388984431455466e-27]}}, {'String': {'v': 1, 'dim': [1, 1], 'data': ['{""name"":""aloha-control"",""version"":""65e28c87-aaa8-451e-9a0a-be587f0d500c"",""sha"":""fd998cd5e4964bbbb4f8d29d245a8ac67df81b62be767afbceb96a03d1a01520""}']}}]",230640288,2023-03-01 19:30:01.241,,0,{}


Unnamed: 0,model_name,model_version,pipeline_name,outputs,elapsed,time,original_data,check_failures,shadow_data
0,aloha-control,65e28c87-aaa8-451e-9a0a-be587f0d500c,randomsplitpipeline-demo,"[{'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0015195842133834958]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9829147458076477]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.012099549174308777]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [4.759115836350247e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [2.028934977715835e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.00031977228354662657]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.011029261164367199]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9975640177726746]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.010341613553464413]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.008038961328566074]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.016155054792761803]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0062362332828342915]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0009985746582970023]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.79333675513837e-26]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.388984431455466e-27]}}, {'String': {'v': 1, 'dim': [1, 1], 'data': ['{""name"":""aloha-control"",""version"":""65e28c87-aaa8-451e-9a0a-be587f0d500c"",""sha"":""fd998cd5e4964bbbb4f8d29d245a8ac67df81b62be767afbceb96a03d1a01520""}']}}]",185191185,2023-03-01 19:30:01.859,,0,{}


Unnamed: 0,model_name,model_version,pipeline_name,outputs,elapsed,time,original_data,check_failures,shadow_data
0,aloha-challenger,74ce34c9-2e52-45b1-9e50-2d5e0c1be000,randomsplitpipeline-demo,"[{'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0015195842133834958]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9829147458076477]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.012099549174308777]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [4.759115836350247e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [2.028934977715835e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.00031977228354662657]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.011029261164367199]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9975640177726746]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.010341613553464413]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.008038961328566074]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.016155054792761803]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0062362332828342915]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0009985746582970023]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.79333675513837e-26]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.388984431455466e-27]}}, {'String': {'v': 1, 'dim': [1, 1], 'data': ['{""name"":""aloha-challenger"",""version"":""74ce34c9-2e52-45b1-9e50-2d5e0c1be000"",""sha"":""223d26869d24976942f53ccb40b432e8b7c39f9ffcf1f719f3929d7595bceaf3""}']}}]",186747119,2023-03-01 19:30:02.438,,0,{}


Unnamed: 0,model_name,model_version,pipeline_name,outputs,elapsed,time,original_data,check_failures,shadow_data
0,aloha-control,65e28c87-aaa8-451e-9a0a-be587f0d500c,randomsplitpipeline-demo,"[{'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0015195842133834958]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9829147458076477]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.012099549174308777]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [4.759115836350247e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [2.028934977715835e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.00031977228354662657]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.011029261164367199]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9975640177726746]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.010341613553464413]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.008038961328566074]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.016155054792761803]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0062362332828342915]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0009985746582970023]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.79333675513837e-26]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.388984431455466e-27]}}, {'String': {'v': 1, 'dim': [1, 1], 'data': ['{""name"":""aloha-control"",""version"":""65e28c87-aaa8-451e-9a0a-be587f0d500c"",""sha"":""fd998cd5e4964bbbb4f8d29d245a8ac67df81b62be767afbceb96a03d1a01520""}']}}]",10970042,2023-03-01 19:30:03.027,,0,{}


Unnamed: 0,model_name,model_version,pipeline_name,outputs,elapsed,time,original_data,check_failures,shadow_data
0,aloha-control,65e28c87-aaa8-451e-9a0a-be587f0d500c,randomsplitpipeline-demo,"[{'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0015195842133834958]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9829147458076477]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.012099549174308777]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [4.759115836350247e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [2.028934977715835e-05]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.00031977228354662657]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.011029261164367199]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.9975640177726746]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.010341613553464413]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.008038961328566074]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.016155054792761803]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0062362332828342915]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [0.0009985746582970023]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.79333675513837e-26]}}, {'Float': {'v': 1, 'dim': [1, 1], 'data': [1.388984431455466e-27]}}, {'String': {'v': 1, 'dim': [1, 1], 'data': ['{""name"":""aloha-control"",""version"":""65e28c87-aaa8-451e-9a0a-be587f0d500c"",""sha"":""fd998cd5e4964bbbb4f8d29d245a8ac67df81b62be767afbceb96a03d1a01520""}']}}]",11199847,2023-03-01 19:30:03.448,,0,{}


### Run Inference Batch

We will submit 1000 rows of test data through the pipeline, then loop through the responses and display which model each inference was performed in.  The results between the control and challenger should be approximately 2:1.

In [18]:
if arrowEnabled is True:
    #Read in the test data as one dataframe
    test_data = pd.read_json('data/data-1k.df.json')
    # For each row, submit that row as a separate dataframe
    # Add the results to the responses array
    df = pd.DataFrame()
    for index, row in test_data.head(1000).iterrows():
        df = df.append(experiment_pipeline.infer(row.to_frame('text_input').reset_index()))
    display(df.model_name.value_counts())
else:
    l = []
    responses =[]
    from data import test_data
    for nth in range(1000):
        responses.extend(experiment_pipeline.infer(test_data.data[nth]))
    l = [r.raw['model_name'] for r in responses]
    df = pd.DataFrame({'model': l})
    display(df.model.value_counts())

aloha-control       685
aloha-challenger    315
Name: model_name, dtype: int64

### Test Challenger

Now we have run a large amount of data we can compare the results.

For this experiment we are looking for a significant change in the fraction of inferences that predicted a probability of the seventh category being high than 0.5 so we can determine whether our challenger model is more "successful" than the champion model at identifying category 7.

In [19]:
control_count = 0
challenger_count = 0

control_success = 0
challenger_success = 0

if arrowEnabled is True:
    # do nothing
    for index, row in df.iterrows():
        if row["model_name"] == "aloha-control":
            # display(row["model_name"])
            control_count += 1
            if row["outputs"][7]['Float']['data'][0] > .5:
                control_success += 1
        else:
            challenger_count += 1
            if row["outputs"][7]['Float']['data'][0] > .5:
                challenger_success += 1
else:
    for r in responses:
        if r.raw['model_name'] == "aloha-control":
            control_count += 1
            if(r.raw['outputs'][7]['Float']['data'][0] > .5):
                control_success += 1
        else:
            challenger_count +=1
            if(r.raw['outputs'][7]['Float']['data'][0] > .5):
                challenger_success += 1
           
            
print("control class 7 prediction rate: " + str(control_success/control_count))
print("challenger class 7 prediction rate: " + str(challenger_success/challenger_count))

control class 7 prediction rate: 0.9766423357664233
challenger class 7 prediction rate: 0.9777777777777777


### Logs

Logs can be viewed with the Pipeline method `logs()`.

In Arrow enabled Wallaroo environments, logs are returned as a Pandas DataFrame object.

In Arrow disabled Wallaroo environments, logs are returned as the Wallaroo Log object.

In either case, only the first 5 logs will be shown.

In [23]:
logs = experiment_pipeline.logs(limit=5)
display(logs)

Unnamed: 0,time,message
0,2023-03-01 20:57:02.548,"{""model_name"":""anomaly-housing-model"",""model_version"":""4354ad0b-d1c8-4b7f-a4b3-52837f4ffdcb"",""pipeline_name"":""anomaly-housing-pipeline3"",""outputs"":[{""Float"":{""v"":1,""dim"":[1,1],""data"":[28.64939308166504]}}],""elapsed"":326218,""time"":1677704220674,""original_data"":null,""check_failures"":[[]],""shadow_data"":{}}"
1,2023-03-01 20:57:02.548,"{""model_name"":""anomaly-housing-model"",""model_version"":""4354ad0b-d1c8-4b7f-a4b3-52837f4ffdcb"",""pipeline_name"":""anomaly-housing-pipeline3"",""outputs"":[{""Float"":{""v"":1,""dim"":[1,1],""data"":[27.588991165161133]}}],""elapsed"":296016,""time"":1677704221094,""original_data"":null,""check_failures"":[[]],""shadow_data"":{}}"
2,2023-03-01 20:57:02.548,"{""model_name"":""anomaly-housing-model"",""model_version"":""4354ad0b-d1c8-4b7f-a4b3-52837f4ffdcb"",""pipeline_name"":""anomaly-housing-pipeline3"",""outputs"":[{""Float"":{""v"":1,""dim"":[1,1],""data"":[23.101274490356445]}}],""elapsed"":307917,""time"":1677704221538,""original_data"":null,""check_failures"":[[]],""shadow_data"":{}}"
3,2023-03-01 20:59:55.650,"{""model_name"":""anomaly-housing-model"",""model_version"":""4354ad0b-d1c8-4b7f-a4b3-52837f4ffdcb"",""pipeline_name"":""anomaly-housing-pipeline3"",""outputs"":[{""Float"":{""v"":1,""dim"":[1,1],""data"":[350.46990966796875]}}],""elapsed"":348312,""time"":1677704394641,""original_data"":null,""check_failures"":[[{""False"":{""expr"":""anomaly-housing-model.outputs[0][0] < 100""}}]],""shadow_data"":{}}"
4,2023-03-01 21:00:24.111,"{""model_name"":""anomaly-housing-model"",""model_version"":""4354ad0b-d1c8-4b7f-a4b3-52837f4ffdcb"",""pipeline_name"":""anomaly-housing-pipeline3"",""outputs"":[{""Float"":{""v"":1,""dim"":[1,1],""data"":[350.46990966796875]}}],""elapsed"":336412,""time"":1677704423102,""original_data"":null,""check_failures"":[[{""False"":{""expr"":""anomaly-housing-model.outputs[0][0] < 100""}}]],""shadow_data"":{}}"


### Undeploy Pipeline

With the testing complete, we undeploy the pipeline to return the resources back to the environment.

In [21]:
experiment_pipeline.undeploy()


0,1
name,randomsplitpipeline-demo
created,2023-03-01 19:29:40.280260+00:00
last_updated,2023-03-01 19:29:41.631810+00:00
deployed,False
tags,
versions,"10964262-1ef6-4e2e-a4ab-7477717a9daa, 661856be-d9e8-402d-b25a-5e0e389eecdf"
steps,aloha-control
