# A/B Testing two different TF models

A/B testing, also known as split testing, is a method used to compare two versions of a product or system to determine which one performs better. In the context of machine learning models deployed on a SageMaker endpoint with production variants, A/B testing involves directing a portion of the incoming traffic to different variants of the deployed models to assess their performance under real-world conditions.

Here's how A/B testing is implemented within the framework of SageMaker production variants:

Create Production Variants: Each version of the machine learning model is represented as a production variant. These variants could differ based on factors like hardware (CPU/GPU), data subsets (e.g., comedy/drama movies), or deployment regions (e.g., US West or Germany North).

Specify Traffic Allocation: Allocate a certain percentage of incoming traffic to each production variant. For instance, you might decide to split the traffic evenly between two variants (50%/50%) for a straightforward A/B test.

Deploy and Monitor: Deploy the production variants within a SageMaker endpoint. As traffic flows into the endpoint, requests are distributed according to the specified allocation. Monitor the performance metrics, such as accuracy, latency, or other relevant KPIs, for each variant.

Analyze Results: Compare the performance of the different variants based on the collected metrics. This analysis helps determine which variant is more effective in meeting the desired objectives, such as improved accuracy or reduced latency.

Adjust and Iterate: Depending on the results, you can adjust the traffic allocation, make modifications to the models, or even introduce entirely new variants. A/B testing is an iterative process that allows for continuous improvement.

Decision Making: Based on the findings, you can make informed decisions about whether to choose one variant over another for full deployment or to introduce further enhancements.
By leveraging A/B testing in the context of SageMaker production variants, organizations can systematically evaluate and compare different versions of machine learning models, ensuring that decisions are data-driven and align with business objectives. This approach mitigates risks associated with deploying untested models and provides a mechanism for ongoing optimization.


![](../data/readme_pics/AB-Testing.png)

I have used traffic splitting to direct subsets of users to different model variants for the purpose of comparing and testing different models in live production. The goal is to see which variants perform better. Often, these tests need to run for a long period of time (weeks) to be statistically significant. The figure shows 2 different recommendation models deployed using a random 50-50 traffic split between the 2 variants.

In [None]:
import boto3
import sagemaker
import pandas as pd
import time
import csv

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name="sagemaker", region_name=region)
cw = boto3.Session().client(service_name="cloudwatch", region_name=region)

In [None]:
# Delete prev SageMaker Endpoint
%store -r autopilot_endpoint_name
sm.delete_endpoint(EndpointName=autopilot_endpoint_name)
print("Autopilot Endpoint has been deleted to save resources.")
%store -r training_job_name
print(training_job_name)

In [None]:
# Copy the Model to the Notebook
!aws s3 cp s3://$bucket/$training_job_name/output/model.tar.gz ./model.tar.gz
!mkdir -p ./model/
!tar -xvzf ./model.tar.gz -C ./model/
# Show the Prediction Signature
!saved_model_cli show --all --dir ./model/tensorflow/saved_model/0/

In [None]:
!pygmentize ./inference.py

In [None]:
# Variant A Model
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="tensorflow",
    region=region,
    version="2.3.1",
    py_version="py37",
    instance_type="ml.m5.4xlarge",
    image_scope="inference",
)
print(inference_image_uri)

timestamp = "{}".format(int(time.time()))

model_a_name = "{}-{}-{}".format(training_job_name, "varianta", timestamp)

sess.create_model_from_job(
    name=model_a_name, training_job_name=training_job_name, role=role, image_uri=inference_image_uri
)

In [None]:
# Variant B Model
model_b_name = "{}-{}-{}".format(training_job_name, "variantb", timestamp)

sess.create_model_from_job(
    name=model_b_name, training_job_name=training_job_name, role=role, image_uri=inference_image_uri
)

# Canary Rollouts and A/B Testing

Canary rollouts are a deployment strategy commonly employed to introduce new machine learning models into production environments cautiously. This method involves releasing the new model to only a small subset of users, typically around 5%, allowing for live testing in a production setting without affecting the entire user base immediately. The rationale behind canary rollouts is to minimize potential negative impacts by exposing the new model to a limited audience before widespread adoption.

To implement canary rollouts with a service like Amazon SageMaker, instead of a straightforward deploy() operation, an approach involving Endpoint Configuration with multiple variants is utilized. In the context of SageMaker, an "Endpoint Configuration" is a collection of settings, including the type and number of instances, that determines how an inference endpoint is configured.

Here's how canary rollouts is achieved:

Create Endpoint Configuration: Define an endpoint configuration with multiple variants, each corresponding to a different version of your machine learning model. For a canary rollout, this would include the existing model version (baseline) and the new version (canary).

Specify Traffic Distribution:Assign the desired traffic distribution for each variant. In the case of a canary rollout, you might allocate 95% of the traffic to the existing model (baseline) and 5% to the new model (canary).

Update Endpoint: Apply the new endpoint configuration to your existing inference endpoint. This step triggers SageMaker to gradually shift the traffic according to the specified distribution.

Monitor and Evaluate: Monitor the performance and behavior of the canary model in the production environment. This includes tracking key metrics, such as accuracy and latency, to ensure that the new model meets expectations.

Gradual Rollout or Rollback: Depending on the observed performance, you can choose to gradually increase the traffic to the canary model or rollback to the baseline model if issues arise.

In [None]:
from sagemaker.session import production_variant

timestamp = "{}".format(int(time.time()))

endpoint_config_name = "{}-{}-{}".format(training_job_name, "abtest", timestamp)

variantA = production_variant(
    model_name=model_a_name,
    instance_type="ml.m5.4xlarge",
    initial_instance_count=1,
    variant_name="VariantA",
    initial_weight=50,
)

variantB = production_variant(
    model_name=model_b_name,
    instance_type="ml.m5.4xlarge",
    initial_instance_count=1,
    variant_name="VariantB",
    initial_weight=50,
)

endpoint_config = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name, ProductionVariants=[variantA, variantB]
)

In [None]:
model_ab_endpoint_name = "{}-{}-{}".format(training_job_name, "abtest", timestamp)

endpoint_response = sm.create_endpoint(EndpointName=model_ab_endpoint_name, EndpointConfigName=endpoint_config_name)
%store model_ab_endpoint_name
%store -r experiment_name
%store -r trial_name

In [None]:
from smexperiments.trial import Trial
from smexperiments.tracker import Tracker

timestamp = "{}".format(int(time.time()))
trial = Trial.load(trial_name=trial_name)
print(trial)
tracker_deploy = Tracker.create(display_name="deploy", sagemaker_boto_client=sm)
deploy_trial_component_name = tracker_deploy.trial_component.trial_component_name
print("Deploy trial component name {}".format(deploy_trial_component_name))

In [None]:
# Attach the `deploy` Trial Component and Tracker as a Component to the Trial
trial.add_trial_component(tracker_deploy.trial_component)
# Track the Endpoint Name
tracker_deploy.log_parameters(
    {
        "endpoint_name": model_ab_endpoint_name,
    }
)

# must save after logging
tracker_deploy.trial_component.save()

from sagemaker.analytics import ExperimentAnalytics

lineage_table = ExperimentAnalytics(
    sagemaker_session=sess,
    experiment_name=experiment_name,
    metric_names=["validation:accuracy"],
    sort_by="CreationTime",
    sort_order="Ascending",
)

lineage_df = lineage_table.dataframe()
lineage_df.shape

In [None]:
waiter = sm.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=model_ab_endpoint_name)

In [None]:
# Simulate a Prediction from an Application
from sagemaker.tensorflow.model import TensorFlowPredictor
from sagemaker.serializers import JSONLinesSerializer
from sagemaker.deserializers import JSONLinesDeserializer

predictor = TensorFlowPredictor(
    endpoint_name=model_ab_endpoint_name,
    sagemaker_session=sess,
    model_name="saved_model",
    model_version=0,
    content_type="application/jsonlines",
    accept_type="application/jsonlines",
    serializer=JSONLinesSerializer(),
    deserializer=JSONLinesDeserializer(),
)

In [None]:
# Predict the `star_rating` with Ad Hoc `review_body` Samples
inputs = [{"features": ["This is great!"]}, {"features": ["This is bad."]}]

predicted_classes = predictor.predict(inputs)

for predicted_class in predicted_classes:
    print("Predicted star_rating: {}".format(predicted_class))


# Predict the `star_rating` with `review_body` Samples from our TSV's
df_reviews = pd.read_csv(
    "./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz",
    delimiter="\t",
    quoting=csv.QUOTE_NONE,
    compression="gzip",
)
df_sample_reviews = df_reviews[["review_body", "star_rating"]].sample(n=50)
df_sample_reviews = df_sample_reviews.reset_index()
print(df_sample_reviews.shape)


def predict(review_body):
    inputs = [{"features": [review_body]}]
    predicted_classes = predictor.predict(inputs)
    return predicted_classes[0]["predicted_label"]


df_sample_reviews["predicted_class"] = df_sample_reviews["review_body"].map(predict)
df_sample_reviews.head(5)

# Review the REST Endpoint Performance Metrics in CloudWatch



Amazon SageMaker emits metrics such as Latency and Invocations for each variant in Amazon CloudWatch. I have queried CloudWatch to get the InvocationsPerVariant to show how invocations are split across variants.

In [None]:
# Review the REST Endpoint Performance Metrics in a Dataframe
from datetime import datetime, timedelta

import boto3
import pandas as pd


def get_invocation_metrics_for_endpoint_variant(
    endpoint_name, namespace_name, metric_name, variant_name, start_time, end_time
):
    metrics = cw.get_metric_statistics(
        Namespace=namespace_name,
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=60,
        Statistics=["Sum"],
        Dimensions=[{"Name": "EndpointName", "Value": endpoint_name}, {"Name": "VariantName", "Value": variant_name}],
    )

    if metrics["Datapoints"]:
        return (
            pd.DataFrame(metrics["Datapoints"])
            .sort_values("Timestamp")
            .set_index("Timestamp")
            .drop("Unit", axis=1)
            .rename(columns={"Sum": variant_name})
        )
    else:
        return pd.DataFrame()


def plot_endpoint_metrics_for_variants(endpoint_name, namespace_name, metric_name, start_time=None):
    try:
        start_time = start_time or datetime.now() - timedelta(minutes=60)
        end_time = datetime.now()

        metrics_variantA = get_invocation_metrics_for_endpoint_variant(
            endpoint_name=model_ab_endpoint_name,
            namespace_name=namespace_name,
            metric_name=metric_name,
            variant_name=variantA["VariantName"],
            start_time=start_time,
            end_time=end_time,
        )

        metrics_variantB = get_invocation_metrics_for_endpoint_variant(
            endpoint_name=model_ab_endpoint_name,
            namespace_name=namespace_name,
            metric_name=metric_name,
            variant_name=variantB["VariantName"],
            start_time=start_time,
            end_time=end_time,
        )

        metrics_variants = metrics_variantA.join(metrics_variantB, how="outer")
        metrics_variants.plot()
    except:
        pass

# Showing the Metrics for Each Variant

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(20)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="/aws/sagemaker/Endpoints", metric_name="CPUUtilization"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="Invocations"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="InvocationsPerInstance"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="ModelLatency"
)

# Shifting All Traffic to Variant B


In [None]:
updated_endpoint_config = [
    {
        "VariantName": variantA["VariantName"],
        "DesiredWeight": 0,
    },
    {
        "VariantName": variantB["VariantName"],
        "DesiredWeight": 100,
    },
]
sm.update_endpoint_weights_and_capacities(
    EndpointName=model_ab_endpoint_name, DesiredWeightsAndCapacities=updated_endpoint_config
)

waiter = sm.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=model_ab_endpoint_name)

In [None]:
# Run Some Predictions
df_sample_reviews["predicted_class"] = df_sample_reviews["review_body"].map(predict)
df_sample_reviews.head(5)

# Metrics for Each Variant

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(20)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="/aws/sagemaker/Endpoints", metric_name="CPUUtilization"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="Invocations"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="InvocationsPerInstance"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="ModelLatency"
)

In [None]:
# Remove Variant A to Reduce Cost
# Modifing the Endpoint Configuration to only use variant B.
import time

timestamp = "{}".format(int(time.time()))

updated_endpoint_config_name = "{}-{}".format(training_job_name, timestamp)

updated_endpoint_config = sm.create_endpoint_config(
    EndpointConfigName=updated_endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": variantB["VariantName"],
            "ModelName": model_b_name,  # Only specify variant B to remove variant A
            "InstanceType": "ml.m5.4xlarge",
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 100,
        }
    ],
)

In [None]:
sm.update_endpoint(EndpointName=model_ab_endpoint_name, EndpointConfigName=updated_endpoint_config_name)

In [None]:
waiter = sm.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=model_ab_endpoint_name)

In [None]:
# Run Some Predictions
df_sample_reviews["predicted_class"] = df_sample_reviews["review_body"].map(predict)
df_sample_reviews

# Show the Metrics for Each Variant

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(20)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="/aws/sagemaker/Endpoints", metric_name="CPUUtilization"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="Invocations"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="InvocationsPerInstance"
)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(5)
plot_endpoint_metrics_for_variants(
    endpoint_name=model_ab_endpoint_name, namespace_name="AWS/SageMaker", metric_name="ModelLatency"
)

In [None]:
sm.delete_endpoint(EndpointName=model_ab_endpoint_name)