

# A/B Testing Deployment with Amazon SageMaker

This Jupyter notebook guides you through implementing an A/B testing deployment strategy for machine learning models using Amazon SageMaker.

## Prerequisites

- An AWS account with SageMaker access
- Basic understanding of Python and machine learning concepts


## Lab Overview

In this lab, you will:

1. Set up a SageMaker environment
2. Prepare a small dataset and train two simple models
3. Deploy models using A/B testing
4. Evaluate model performance
5. Clean up resources

Let's get started!

## 1. Environment Setup

In [None]:
import boto3
import sagemaker
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
from sagemaker.sklearn.estimator import SKLearn

session = sagemaker.Session()
role = get_execution_role()

## 2. Data Preparation and Model Training

We'll use a small subset of the iris dataset for quick training:

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y = iris.data[:100], iris.target[:100]  # Using only 100 samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_data = pd.DataFrame(X_train, columns=iris.feature_names)
train_data['target'] = y_train
test_data = pd.DataFrame(X_test, columns=iris.feature_names)
test_data['target'] = y_test

train_data.to_csv('train.csv', index=False, header=False)
test_data.to_csv('test.csv', index=False, header=False)

s3_input_train = session.upload_data(path='train.csv', key_prefix='sagemaker/iris/train')
s3_input_test = session.upload_data(path='test.csv', key_prefix='sagemaker/iris/test')

Now, let's train two simple models:

In [None]:
%%writefile train_model_a.py
from sklearn.ensemble import RandomForestClassifier
import argparse, joblib, os
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--n-estimators', type=int, default=10)
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
args, _ = parser.parse_known_args()

train_data = np.loadtxt(os.path.join(args.train, 'train.csv'), delimiter=',')
X = train_data[:, :-1]
y = train_data[:, -1]

model = RandomForestClassifier(n_estimators=args.n_estimators)
model.fit(X, y)

joblib.dump(model, os.path.join(args.model_dir, 'model.joblib'))

In [None]:
%%writefile train_model_b.py
from sklearn.svm import SVC
import argparse, joblib, os
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--C', type=float, default=1.0)
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
args, _ = parser.parse_known_args()

train_data = np.loadtxt(os.path.join(args.train, 'train.csv'), delimiter=',')
X = train_data[:, :-1]
y = train_data[:, -1]

model = SVC(C=args.C)
model.fit(X, y)

joblib.dump(model, os.path.join(args.model_dir, 'model.joblib'))

In [None]:
# Train Model A with Spot Instances
sklearn_estimator_a = SKLearn(
    entry_point='train_model_a.py',
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',  #choose one that is available at your region
    framework_version='0.23-1',
    hyperparameters={'n-estimators': 10},
    use_spot_instances=True,
    max_run=3600,  # 1 hour maximum runtime
    max_wait=3605  # Maximum time to wait for spot instances (slightly longer than max_run)
)

sklearn_estimator_a.fit({'train': s3_input_train})

# Train Model B with Spot Instances
sklearn_estimator_b = SKLearn(
    entry_point='train_model_b.py',
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',  #choose one that is available at your region
    framework_version='0.23-1',
    hyperparameters={'C': 1.0},
    use_spot_instances=True,
    max_run=3600,  # 1 hour maximum runtime
    max_wait=3605  # Maximum time to wait for spot instances (slightly longer than max_run)
)

sklearn_estimator_b.fit({'train': s3_input_train})

## 3. A/B Testing Deployment

Now, let's deploy both models using A/B testing:

In [None]:
from sagemaker.session import production_variant
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from datetime import datetime

# Explicitly create SageMaker Models from trained estimators
model_a = sklearn_estimator_a.create_model()
model_b = sklearn_estimator_b.create_model()

# Explicitly register the models with unique names
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_a.name = f"model-a-{timestamp}"
model_b.name = f"model-b-{timestamp}"

model_a._create_sagemaker_model(instance_type='ml.m5.large', accelerator_type=None)
model_b._create_sagemaker_model(instance_type='ml.m5.large', accelerator_type=None)

# Create production variants for A/B testing
variant1 = production_variant(
    model_name=model_a.name,
    instance_type="ml.m5.large",
    initial_instance_count=1,
    variant_name='ModelA',
    initial_weight=50
)

variant2 = production_variant(
    model_name=model_b.name,
    instance_type="ml.m5.large",
    initial_instance_count=1,
    variant_name='ModelB',
    initial_weight=50
)

endpoint_name = 'iris-ab-test-endpoint'




In [None]:
# Deploy the endpoint with the variants
session.endpoint_from_production_variants(
    name=endpoint_name,
    production_variants=[variant1, variant2]
)

# Initialize predictor
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=session,
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)

## 4. Evaluate Model Performance

Let's test our A/B deployment:

In [None]:
import time

results = {'ModelA': 0, 'ModelB': 0}

for _ in range(50):
    response = predictor.predict(X_test[0].tolist())
    variant_used = response['ResponseMetadata']['HTTPHeaders']['x-amzn-sagemaker-production-variant']
    results[variant_used] += 1
    time.sleep(0.1)  # To avoid throttling

print("Model A was used:", results['ModelA'], "times")
print("Model B was used:", results['ModelB'], "times")

## 5. Clean Up Resources

Always remember to clean up your resources to avoid unnecessary charges:

In [None]:
# Clean Up Resources - IMPORTANT to avoid ongoing charges

# 1. Delete the endpoint
print("Cleaning up resources...")
try:
    predictor.delete_endpoint()
    print("Endpoint deleted successfully")
except Exception as e:
    print(f"Error deleting endpoint: {e}")

# 2. Delete all endpoint configurations
sm_client = boto3.client('sagemaker')
try:
    endpoint_configs = sm_client.list_endpoint_configs()
    for config in endpoint_configs['EndpointConfigs']:
        config_name = config['EndpointConfigName']
        if endpoint_name in config_name:
            print(f"Deleting endpoint configuration: {config_name}")
            sm_client.delete_endpoint_config(EndpointConfigName=config_name)
    print("All endpoint configurations deleted")
except Exception as e:
    print(f"Error deleting endpoint configurations: {e}")

# 3. Delete all models
try:
    models = sm_client.list_models()
    for model in models['Models']:
        model_name = model['ModelName']
        if model_a.name in model_name or model_b.name in model_name:
            print(f"Deleting model: {model_name}")
            sm_client.delete_model(ModelName=model_name)
    print("All models deleted")
except Exception as e:
    print(f"Error deleting models: {e}")

print("All resources have been cleaned up")

## Conclusion

In this lab, you learned how to implement A/B testing deployment for machine learning models using Amazon SageMaker. You created two simple models, deployed them using A/B testing, and evaluated their usage. Remember to always clean up your resources after completing your experiments.

## Common Mistakes and Best Practices

- Ensure you have sufficient permissions to create and manage SageMaker resources.
- Always use the smallest instance type that can handle your workload to minimize costs.
- Remember to delete endpoints and models after you're done to avoid ongoing charges.
- When implementing A/B testing in production, consider using more sophisticated metrics and longer evaluation periods.

