# Multi-Armed Bandits and Reinforcement Learning with Amazon SageMaker

We demonstrate how you can manage your own contextual multi-armed bandit workflow on SageMaker using the built-in [AWS Reinforcement Learning Container](https://github.com/aws/sagemaker-rl-container) container to train and deploy contextual bandit models. We show how to train these models that interact with a live environment (using a simulated client application) and continuously update the model with efficient exploration.

### Why Contextual Bandits?

Wherever we look to personalize content for a user (content layout, ads, search, product recommendations, etc.), contextual bandits come in handy. Traditional personalization methods collect a training dataset, build a model and deploy it for generating recommendations. However, the training algorithm does not inform us on how to collect this dataset, especially in a production system where generating poor recommendations lead to loss of revenue. Contextual bandit algorithms help us collect this data in a strategic manner by trading off between exploiting known information and exploring recommendations which may yield higher benefits. The collected data is used to update the personalization model in an online manner. Therefore, contextual bandits help us train a personalization model while minimizing the impact of poor recommendations.

![](img/multi_armed_bandit_maximize_reward.png)

To implement the exploration-exploitation strategy, we need an iterative training and deployment system that: (1) recommends an action using the contextual bandit model based on user context, (2) captures the implicit feedback over time and (3) continuously trains the model with incremental interaction data. In this notebook, we show how to setup the infrastructure needed for such an iterative learning system. While the example demonstrates a bandits application, these continual learning systems are useful more generally in dynamic scenarios where models need to be continually updated to capture the recent trends in the data (e.g. tracking fraud behaviors based on detection mechanisms or tracking user interests over time). 

In a typical supervised learning setup, the model is trained with a SageMaker training job and it is hosted behind a SageMaker hosting endpoint. The client application calls the endpoint for inference and receives a response. In bandits, the client application also sends the reward (a score assigned to each recommendation generated by the model) back for subsequent model training. These rewards will be part of the dataset for the subsequent model training. 

# Relevant Links

In-Practice
* [AWS Blog Post on Contextual Multi-Armed Bandits](https://aws.amazon.com/blogs/machine-learning/power-contextual-bandits-using-continual-learning-with-amazon-sagemaker-rl/)
* [Multi-Armed Bandits at StitchFix](https://multithreaded.stitchfix.com/blog/2020/08/05/bandits/)
* [Introduction to Contextual Bandits](https://getstream.io/blog/introduction-contextual-bandits/)
* [Vowpal Wabbit Contextual Bandit Algorithms](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-algorithms)

Theory
* [Learning to Interact](https://hunch.net/~jl/interact.pdf)
* [Contextual Bandit Bake-Off](https://arxiv.org/pdf/1802.04064.pdf)
* [Doubly-Robust Policy Evaluation and Learning](https://arxiv.org/pdf/1103.4601.pdf)

Code
* [AWS Open Source Reinforcement Learning Containers](https://github.com/aws/sagemaker-rl-container)
* [AWS Open Source Bandit Experiment Manager](./common/sagemaker_rl/orchestrator/workflow/manager)
* [Vowpal Wabbit Reinforcement Learning Framework](https://github.com/VowpalWabbit/)

# AWS Open Source Bandit `ExperimentManager` Library

![](img/multi_armed_bandit_traffic_shift.png)

The bandit model is implemented by the open source [**Bandit Experiment Manager**](./common/sagemaker_rl/orchestrator/workflow/manager/) provided with this example.  This This implementation continuously updates a Vowpal Wabbit reinforcement learning model using Amazon SageMaker, DynamoDB, Kinesis, and S3.

The client application, a recommender system with a review service in our case, pings the SageMaker hosting endpoint that is serving the bandit model.  The application sends the an `event` with the `context` (ie. user, product, and review text) to the bandit model and receives a recommended action from the bandit model.  In our case, the action is 1 of 2 BERT models that we are testing.  The bandit model stores this event data (given context and recommended action) in S3 using Amazon Kinesis.  _Note:  The context makes this a "contextual bandit" and differentiates this implementation from a regular multi-armed bandit._

The client application uses the recommended BERT model to classify the review text as star rating 1 through 5 and  compares the predicted star rating to the user-selected star rating.  If the BERT model correctly predicts the star rating of the review text (ie. matches the user-selected star rating), then the bandit model is rewarded with `reward=1`.  If the BERT model incorrectly classifies the star rating of the review text, the bandit model is not rewarded (`reward=0`).

The client application stores the rewards data in S3 using Amazon Kinesis.  Periodically (ie. every 100 rewards), we incrementally train an updated bandit model with the latest the reward and event data.  This updated bandit model is evaluated against the current model using a holdout dataset of rewards and events.  If the bandit model accuracy is above a given threshold relative to the existing model, it is automatically deployed in a blue/green manner with no downtime.  SageMaker RL supports offline evaluation by performing counterfactual analysis (CFA).  By default, we apply [**doubly robust (DR) estimation**](https://arxiv.org/pdf/1103.4601.pdf) method. The bandit model tries to minimize the cost (`1 - reward`), so a smaller evaluation score indicates better bandit model performance.

Unlike traditional A/B tests, the bandit model will learn the best BERT model (action) for a given context over time and begin to shift traffic to the best model.  Depending on the aggressiveness of the bandit model algorithm selected, the bandit model will continuously explore the under-performing models, but start to favor and exploit the over-performing models.  And unlike A/B tests, multi-armed bandits allow you to add a new action (ie. BERT model) dynamically throughout the life of the experiment.  When the bandit model sees the new BERT model, it will start sending traffic and exploring the accuracy of the new BERT model - alongside the existing BERT models in the experiment.

#### Local Mode

To facilitate experimentation, we provide a `local_mode` that runs the contextual bandit example using the SageMaker Notebook instance itself instead of the SageMaker training and hosting cluster instances.  The workflow remains the same in `local_mode`, but runs much faster for small datasets.  Hence, it is a useful tool for experimenting and debugging.  However, it will not scale to production use cases with high throughput and large datasets.  In `local_mode`, the training, evaluation, and hosting is done in the local [SageMaker Vowpal Wabbit Docker Container](https://github.com/aws/sagemaker-rl-container).

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [2]:
%store -r tensorflow_endpoint_name

In [3]:
try:
    tensorflow_endpoint_name
    print('[OK]')
except NameError:
    print('+++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in this section before you continue.')
    print('+++++++++++++++++++++++++++++++')

[OK]


In [4]:
print(tensorflow_endpoint_name)

tensorflow-training-2021-01-23-06-16-08-737-tf-1611432312


In [5]:
%store -r pytorch_endpoint_name

In [6]:
try:
    pytorch_endpoint_name
    print('[OK]')    
except NameError:
    print('+++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in this section before you continue.')
    print('+++++++++++++++++++++++++++++++')

[OK]


In [7]:
print(pytorch_endpoint_name)

tensorflow-training-2021-01-23-06-16-08-737-pt-1611433340


# Configure the 2 BERT Models to Test with our Bandit Experiment

Now that the last trained bandit model is deployed as a SageMaker Endpoint, the client application will send the context to the endpoint and receive the recommended action.  The bandit model will recommend 1 of 2 actions in our example:  `1` or `2` which correspond to BERT model 1 and BERT model 2, respectively.  Let's configure these 2 BERT models below.

In [8]:
model1_endpoint_name = tensorflow_endpoint_name

In [9]:
print(model1_endpoint_name)

tensorflow-training-2021-01-23-06-16-08-737-tf-1611432312


In [10]:
try:
    waiter = sm.get_waiter('endpoint_in_service')
    waiter.wait(EndpointName=model1_endpoint_name)
except:
    print('###################')
    print('The endpoint is not running.')
    print('Please re-run the previous section to deploy the endpoint.')
    print('###################')    

In [11]:
import json
from sagemaker.tensorflow.model import TensorFlowPredictor

model1_predictor = TensorFlowPredictor(endpoint_name=model1_endpoint_name,
                                       sagemaker_session=sess,
                                       model_name='saved_model',
                                       model_version=0,
                                       content_type='application/jsonlines',
                                       accept_type='application/jsonlines')

content_type is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [12]:
inputs = [
    {"review_body": "This is great!"},
    {"review_body": "This is bad."}
]

predicted1_classes_str = model1_predictor.predict(inputs)
predicted1_classes = predicted1_classes_str.splitlines()

for predicted1_class_json, input_data in zip(predicted1_classes, inputs):
    predicted1_class = json.loads(predicted1_class_json)['predicted_label']
    print('Predicted star_rating: {} for review_body "{}"'.format(predicted1_class, input_data["review_body"]))

Predicted star_rating: 5 for review_body "This is great!"
Predicted star_rating: 3 for review_body "This is bad."


In [13]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">Model 1 SageMaker REST Endpoint</a></b>'.format(region, model1_endpoint_name)))


In [14]:
model2_endpoint_name = pytorch_endpoint_name

In [15]:
print(model2_endpoint_name)

tensorflow-training-2021-01-23-06-16-08-737-pt-1611433340


In [16]:
try:
    waiter = sm.get_waiter('endpoint_in_service')
    waiter.wait(EndpointName=model2_endpoint_name)
except:
    print('###################')
    print('The endpoint is not running.')
    print('Please re-run the previous section to deploy the endpoint.')
    print('###################')    

In [17]:
import json
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
        
model2_predictor = Predictor(endpoint_name=model2_endpoint_name,
                             sagemaker_session=sess,
                             serializer=JSONSerializer(), 
                             deserializer=JSONDeserializer(),
                             content_type='application/jsonlines',
                             accept_type='application/jsonlines')

content_type is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [18]:
inputs = [
    {"review_body": "This is great!"},
    {"review_body": "This is bad."}
]

predicted2_classes_str = model2_predictor.predict(inputs)
predicted2_classes = predicted2_classes_str.splitlines()

for predicted2_class_json, input_data in zip(predicted2_classes, inputs):
    predicted2_class = json.loads(predicted2_class_json)['predicted_label']
    print('Predicted star_rating: {} for review_body "{}"'.format(predicted2_class, input_data["review_body"]))

Predicted star_rating: 4 for review_body "This is great!"
Predicted star_rating: 4 for review_body "This is bad."


In [19]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">Model 2 SageMaker REST Endpoint</a></b>'.format(region, model2_endpoint_name)))


In [20]:
import yaml
import sys
import numpy as np
import time
import sagemaker

sys.path.append('common')
sys.path.append('common/sagemaker_rl')

In [21]:
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
%config InlineBackend.figure_format='retina'

### Configuration

The configuration for the bandits application can be specified in a `config.yaml` file as can be seen below. It configures the AWS resources needed. The DynamoDB tables are used to store metadata related to experiments, models and data joins. The `private_resource` specifices the SageMaker instance types and counts used for training, evaluation and hosting. The SageMaker container image is used for the bandits application. This config file also contains algorithm and SageMaker-specific setups.  Note that all the data generated and used for the bandits application will be stored in `s3://sagemaker-{REGION}-{AWS_ACCOUNT_ID}/{experiment_id}/`.

Please make sure that the `num_arms` parameter in the config is equal to the number of actions in the client application (which is defined in the cell below).

The Docker image is defined here:  https://github.com/aws/sagemaker-rl-container/blob/master/vw/docker/8.7.0/Dockerfile

In [22]:
!pygmentize 'config.yaml'

[94mresource[39;49;00m:
  [94mshared_resource[39;49;00m:
    [94mresources_cf_stack_name[39;49;00m: [33m"[39;49;00m[33mBanditsSharedResourceStack[39;49;00m[33m"[39;49;00m [37m# cloud formation stack[39;49;00m
    [94mexperiment_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsExperimentTable[39;49;00m[33m"[39;49;00m [37m# Dynamo table for status of an experiment[39;49;00m
    [94mmodel_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsModelTable[39;49;00m[33m"[39;49;00m [37m# Dynamo table for status of all models trained[39;49;00m
    [94mjoin_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsJoinTable[39;49;00m[33m"[39;49;00m [37m# Dynamo table for status of all joining job for reward ingestion[39;49;00m
    [94miam_role[39;49;00m:
      [94mrole_name[39;49;00m: [33m"[39;49;00m[33mBanditsIAMRole[39;49;00m[33m"[39;49;00m
  [94mprivate_resource[39;49;00m:
    

In [23]:
config_file = 'config.yaml'
with open(config_file, 'r') as yaml_file:
    config = yaml.load(yaml_file, Loader=yaml.FullLoader)

# Additional permissions for the IAM role
IAM role requires additional permissions for [AWS CloudFormation](https://aws.amazon.com/cloudformation/), [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [Amazon Athena](https://aws.amazon.com/athena/). Make sure the SageMaker role you are using has the permissions.

In [24]:
# from markdown_helper import *
# from IPython.display import Markdown

# display(Markdown(generate_help_for_experiment_manager_permissions(role)))

### Client Application (Environment)
The client application simulates a live environment that uses the bandit model to recommend a BERT model to classify review text submitted by the application user. 

The logic of reward generation resides in the client application.  We simulate the online learning loop with feedback.  The data consists of 2 actions - 1 for each BERT model under test.  If the bandit model selects the right class, then the model is rewarded with `reward=1`.  Otherwise, the bandit model receives `reward=0`.

The workflow of the client application is as follows:
- Our client application picks sample review text at random, which is sent to the bandit model (SageMaker endpoint) to recommend an action (BERT model) to classify the review text into star rating 1 through 5.
- The bandit model returns an action, an action probability, and an `event_id` for this prediction event.
- Since the client application uses the Amazon Customer Reviews Dataset, we know the true star rating for the review text
- The client application compares the predicted and true star rating and assigns a reward to the bandit model using Amazon Kinesis, S3, and DynamoDB.  (The `event_id` is used to join the event and reward data.)

`event_id` is a unique identifier for each interaction. It is used to join inference data `<state, action, action_probability>` with the reward data. 

In a later cell of this notebook, we illustrate how the client application interacts with the bandit model endpoint and receives the recommended action (BERT model).

### Step-by-step bandits model development

[**Bandit Experiment Manager**](./common/sagemaker_rl/orchestrator/workflow/manager/) is the top level class for all the Bandits/RL and continual learning workflows. Similar to the estimators in the [Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk), `ExperimentManager` contains methods for training, deployment and evaluation. It keeps track of the job status and reflects current progress in the workflow.

Start the application using the `ExperimentManager` class 

In [25]:
import time
timestamp = int(time.time())

bandit_experiment_name = 'bandits-{}'.format(timestamp)

# `ExperimentManager` will create a AWS CloudFormation Stack of additional resources needed for the Bandit experiment. 

In [26]:
from orchestrator.workflow.manager.experiment_manager import ExperimentManager

bandit_experiment_manager = ExperimentManager(config, experiment_id=bandit_experiment_name)

INFO:orchestrator.resource_manager:Creating a new CloudFormation stack for Shared Resources. You can always reuse this StackName in your other experiments
INFO:orchestrator.resource_manager:[
    {
        "ParameterKey": "IAMRoleName",
        "ParameterValue": "BanditsIAMRole",
        "UsePreviousValue": true,
        "ResolvedValue": "string"
    },
    {
        "ParameterKey": "ExperimentDbName",
        "ParameterValue": "BanditsExperimentTable",
        "UsePreviousValue": true,
        "ResolvedValue": "string"
    },
    {
        "ParameterKey": "ExperimentDbRCU",
        "ParameterValue": "5",
        "UsePreviousValue": true,
        "ResolvedValue": "string"
    },
    {
        "ParameterKey": "ExperimentDbWCU",
        "ParameterValue": "5",
        "UsePreviousValue": true,
        "ResolvedValue": "string"
    },
    {
        "ParameterKey": "ModelDbName",
        "ParameterValue": "BanditsModelTable",
        "UsePreviousValue": true,
        "ResolvedValue": "strin

In [27]:
try:
    bandit_experiment_manager.clean_resource(experiment_id=bandit_experiment_manager.experiment_id)
    bandit_experiment_manager.clean_table_records(experiment_id=bandit_experiment_manager.experiment_id)
except:
    print('Ignore any errors.  Errors are OK.')



Ignore any errors.  Errors are OK.


In [28]:
bandit_experiment_manager = ExperimentManager(config, experiment_id=bandit_experiment_name)

INFO:orchestrator.resource_manager:Using Resources in CloudFormation stack named: BanditsSharedResourceStack for Shared Resources.


# Initialize the Bandit Model
To start a new experiment, we need to initialize the first bandit model or "policy" in reinforcement learning terminology.  

If we have historical data in the format `(state, action, action probability, reward)`, we can perform a "warm start" and learn the bandit model offline.  

However, let's assume we are starting with no historical data and initialize a random bandit model using `initialize_first_model()`.

In [29]:
bandit_experiment_manager.initialize_first_model()

INFO:orchestrator:Next Model name would be bandits-1611645944-model-id-1611645987
INFO:orchestrator:Start training job for model 'bandits-1611645944-model-id-1611645987''
INFO:orchestrator:Training job will be executed in 'SageMaker' mode


2021-01-26 07:26:28 Starting - Starting the training job..



.
2021-01-26 07:26:53 Starting - Launching requested ML instancesProfilerReport-1611645987: InProgress
...

ERROR:orchestrator:Failed to start new Model Training job for ModelId {next_model_to_train_id}
ERROR:orchestrator:An error occurred (ThrottlingException) when calling the DescribeTrainingJob operation (reached max retries: 4): Rate exceeded


# ^^ Ignore `Failed to delete: /tmp/...` message above.  This is OK. ^^

# Check Experiment State:  TRAINED
`training_state`: `TRAINED`

Note the `last_trained_model_id` variable.

In [30]:
from pprint import pprint

pprint(bandit_experiment_manager._jsonify())

{'evaluation_workflow_metadata': {'evaluation_state': None,
                                  'last_evaluation_job_id': None,
                                  'next_evaluation_job_id': None},
 'experiment_id': 'bandits-1611645944',
 'hosting_workflow_metadata': {'hosting_endpoint': None,
                               'hosting_state': None,
                               'last_hosted_model_id': None,
                               'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': None,
                               'last_joined_job_id': None,
                               'next_join_job_id': None},
 'training_workflow_metadata': {'last_trained_model_id': 'bandits-1611645944-model-id-1611645987',
                                'next_model_to_train_id': None,
                                'training_state': 'TRAINED'}}


# Deploy the Bandit Model

Once training and evaluation is done, we can deploy the model.

In [31]:
print('Deploying newly-trained bandit model: {}'.format(bandit_experiment_manager.last_trained_model_id))


Deploying newly-trained bandit model: bandits-1611645944-model-id-1611645987


In [32]:
print('Deploying bandit model_id {}'.format(bandit_experiment_manager.last_trained_model_id))

bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id) 


Deploying bandit model_id bandits-1611645944-model-id-1611645987


INFO:orchestrator:Model 'bandits-1611645944-model-id-1611645987' is ready to deploy.
INFO:orchestrator:No hosting endpoint found, creating a new hosting endpoint.
INFO:orchestrator.resource_manager:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for storing sagemaker data
INFO:orchestrator.resource_manager:Creating firehose delivery stream...
INFO:orchestrator.resource_manager:Creating firehose delivery stream...
INFO:orchestrator.resource_manager:Creating firehose delivery stream...
INFO:orchestrator.resource_manager:Creating firehose delivery stream...
INFO:orchestrator.resource_manager:Creating firehose delivery stream...
INFO:orchestrator.resource_manager:Successfully created delivery stream 'bandits-1611645944'


-----------------!

# Check Experiment State:  DEPLOYED
`hosting_state`: `DEPLOYED`

The `last_trained_model_id` and `last_hosted_model_id` are now the same as we just deployed the bandit model.

In [33]:
from pprint import pprint

pprint(bandit_experiment_manager._jsonify())

{'evaluation_workflow_metadata': {'evaluation_state': None,
                                  'last_evaluation_job_id': None,
                                  'next_evaluation_job_id': None},
 'experiment_id': 'bandits-1611645944',
 'hosting_workflow_metadata': {'hosting_endpoint': 'arn:aws:sagemaker:us-east-1:835319576252:endpoint/bandits-1611645944',
                               'hosting_state': 'DEPLOYED',
                               'last_hosted_model_id': 'bandits-1611645944-model-id-1611645987',
                               'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': None,
                               'last_joined_job_id': None,
                               'next_join_job_id': None},
 'training_workflow_metadata': {'last_trained_model_id': 'bandits-1611645944-model-id-1611645987',
                                'next_model_to_train_id': None,
                                'training_state': 'TRAINED'}}


# Initialize the Client Application

In [34]:
import csv
import numpy as np

class ClientApp():
    def __init__(self, data, num_events, bandit_model, bert_model_map):
        self.bandit_model = bandit_model
        self.bert_model_map = bert_model_map
        
        self.num_actions = 2

        df_reviews = pd.read_csv(data, 
                                 delimiter='\t', 
                                 quoting=csv.QUOTE_NONE,
                                 compression='gzip')
        df_scrubbed = df_reviews[['review_body', 'star_rating']].sample(n=num_events) # .query('star_rating == 1')
        df_scrubbed = df_scrubbed.reset_index()
        df_scrubbed.shape
        np_reviews = df_scrubbed.to_numpy()

        np_reviews = np.delete(np_reviews, 0, 1)
        
        # Last column is the label, the rest are the features (contexts)
        self.labels = np_reviews[:, -1]
        self.contexts = np_reviews[:, :-1].tolist()

        self.optimal_rewards = [1]
        self.rewards_tmp_buffer = []
        self.joined_data_tmp_buffer = []
        self.all_joined_data_buffer = []
        
        self.action_count = {}

    def increment_action_count(self, action):
        try:
            action_count = self.action_count[action]
        except:
            self.action_count[action] = 0
            action_count = 0
            
        self.action_count[action] = action_count + 1
                
    def choose_random_context(self):
        context_index = np.random.choice(len(self.contexts))
        context = self.contexts[context_index]
        return context_index, context    

    def clear_tmp_buffers(self):
        self.rewards_tmp_buffer.clear()
        self.joined_data_tmp_buffer.clear()

    def get_reward(self, 
                   context_index, 
                   action, 
                   event_id, 
                   bandit_model_id, 
                   action_prob, 
                   sample_prob, 
                   local_mode):

        context_to_predict = self.contexts[context_index][0]
    
        label = self.labels[context_index]
        
        bert_model = self.bert_model_map[action]

        self.increment_action_count(action)
        
        inputs = [
            {"review_body": context_to_predict},
        ]

        predicted_classes_str = bert_model.predict(inputs)
        predicted_classes = predicted_classes_str.splitlines()

        for predicted_class_json, input_data in zip(predicted_classes, inputs):
            predicted_class = json.loads(predicted_class_json)['predicted_label']
            print('Predicted star_rating: {}, actual star_rating {}, review_body "{}"'.format(predicted_class, label, input_data["review_body"]))
               
        # Calculate difference between predicted and actual label
        if abs(int(predicted_class) - int(label)) == 0:
            reward = 1
        else:
            reward = 0

        if local_mode:
            json_blob = {
                         "reward": reward,
                         "event_id": event_id,
                         "action": action,
                         "action_prob": action_prob,
                         "model_id": bandit_model_id,
                         "observation": [context_index],
                         "sample_prob": sample_prob
                        }
            
            self.joined_data_tmp_buffer.append(json_blob)            
        else:
            json_blob = {
                         "reward": reward, 
                         "event_id": event_id
                        }
            self.rewards_tmp_buffer.append(json_blob)
        
        return reward
    

In [35]:
bandit_model = bandit_experiment_manager.predictor
print(bandit_model)

if not bandit_model:
    raise Exception("No predictor")

<orchestrator.resource_manager.Predictor object at 0x7f5445a5ea90>


In [36]:
client_app = ClientApp(data='./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz',
                       num_events=100,
                       bandit_model=bandit_model,
                       bert_model_map={
                         1: model1_predictor,
                         2: model2_predictor
                       })

Make sure that `num_arms` specified in `config.yaml` is equal to the total unique actions in the simulation application.

In [37]:
# print('Testing {} BERT models'.format(client_app.num_actions))

# assert client_app.num_actions == bandit_experiment_manager.config["algor"]["algorithms_parameters"]["num_arms"]

In [38]:
# import time

# context_index, context = client_app.choose_random_context()
# action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=[context_index])

# print('event ID: {}\nbert_model_id: {}\naction_probability: {}'.format(event_id, action, action_prob, bandit_model_id))

# Generate Sample Events to Test the Bandit `ExperimentManager`
Thsi will generated sample contexts to pass as events to the bandit using the Amazon Customer Reviews Dataset.  The bandit model will recommend an action (BERT model) based on the context and current state of the bandit.  We will assign a reward using the star ratings from Amazon Customer Reviews Dataset.

Client application generates a reward after receiving the recommended action and stores the tuple `<eventID, reward>` in S3. In this case, reward is 1 if predicted action is the true class, and 0 otherwise. SageMaker hosting endpoint saves all the inferences `<eventID, state, action, action probability>` to S3 using [**Kinesis Firehose**](https://aws.amazon.com/kinesis/data-firehose/). The `ExperimentManager` joins the reward with state, action and action probability using [**Amazon Athena**](https://aws.amazon.com/athena/). 

In [39]:
# local_mode = bandit_experiment_manager.local_mode

# num_events = 100 

# print('Generating {} sample events...'.format(num_events))

# for i in range(num_events):
#     context_index, context = client_app.choose_random_context()
#     action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=[context_index])

#     reward = client_app.get_reward(context_index=context_index, 
#                                    action=action, 
#                                    event_id=event_id, 
#                                    bandit_model_id=bandit_model_id, 
#                                    action_prob=action_prob, 
#                                    sample_prob=sample_prob, 
#                                    local_mode=local_mode)    

# Create Bandit Model Training Data

Join `Event` and `Reward` data and upload to S3 in the following format:

```
{
 'reward': 0, # 0 if the model is wrong, +1 if the model is correct
 'event_id': 131181492351609994318271340276526219266, # unique event id
 'action': 1, # suggested action (bert_model_id 1 or 2)
 'action_prob': 0.9995, # probability that the suggested action is correct
 'model_id': 'bandits-1597631299-model-id-1597631304', # unique bandit_model_id
 'observation': [54], # feature (review_id)
 'sample_prob': 0.43410828171830174 
}
```


In [40]:
# if local_mode:
#     print('Using local mode with memory buffers.')
#     print()
#     print(client_app.joined_data_tmp_buffer)
#     bandit_experiment_manager.ingest_joined_data(client_app.joined_data_tmp_buffer)
# else:
#     print("Using production mode with Kinesis Firehose.  Waiting to flush to S3...")
#     print()
#     time.sleep(60) # Wait for firehose to flush data to S3
#     rewards_s3_prefix = bandit_experiment_manager.ingest_rewards(client_app.rewards_tmp_buffer)
#     bandit_experiment_manager.join(rewards_s3_prefix)

# Check Experiment Status:  JOINED
`joining_workflow_metadata: {'joining_state': 'SUCCEEDED'}`

In [41]:
# from pprint import pprint

# pprint(bandit_experiment_manager._jsonify())

# Review Bandit Model Training Data

In [42]:
# print('Bandit model training data {}'.format(bandit_experiment_manager.last_joined_job_train_data))

In [43]:
# from sagemaker.s3 import S3Downloader

# bandit_model_train_data_s3_uri = S3Downloader.list(bandit_experiment_manager.last_joined_job_train_data)[0]
# print(bandit_model_train_data_s3_uri)

In [44]:
# from sagemaker.s3 import S3Downloader

# bandit_model_train_data = S3Downloader.read_file(bandit_model_train_data_s3_uri)
# print(bandit_model_train_data)

# Train the Bandit Model

Now we can train a new model with newly collected experiences, and host the resulting model.

In [45]:
# print('Trained bandit model id {}'.format(bandit_experiment_manager.last_trained_model_id))

# bandit_experiment_manager.train_next_model(input_data_s3_prefix=bandit_experiment_manager.last_joined_job_train_data)

# Ignore ^^ `Failed to delete` Error Above ^^ 

# Deploy the Bandit Model

In [46]:
# print('Deploying bandit model id {}'.format(bandit_experiment_manager.last_trained_model_id))

# bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id)

# Check Experiment Status:  DEPLOYED
`deploying_state`:  `SUCCEEDED`

In [47]:
# from pprint import pprint

# pprint(bandit_experiment_manager._jsonify())

# Continuously Train, Evaluate, and Deploy Bandit Models
The above cells explained the individual steps in the training workflow. To train a model to convergence, we will continually train the model based on data collected with client application interactions. We demonstrate the continual training and evaluation loop in a single cell below.

_**Train and Evaluate**_:
After every training cycle, we evaluate if the newly trained model (`last_trained_model_id`) would perform better than the one currently deployed (`last_hosted_model_id`) using a holdout evaluation dataset.  Details of the join, train, and evaluation steps are tracked in the `BanditsJoinTable` and `BanditsModelTable` DynamoDB tables.  When you have multiple experiments, you can compare them in the `BanditsExperimentTable` DynamoDB table.

_**Deploy**_: If the new bandit model is better than the current bandit model (based on offline evaluation), we will automatically deploy the new bandit model using a blue-green deployment to avoid downtime.

In [None]:
do_evaluation = False
total_loops = 1 # Increase for higher accuracy
retrain_batch_size = 100 # Model will be trained after every `batch_size` number of data instances
rewards_list = []
event_list = []

all_joined_train_data_s3_uri_list = []
all_joined_eval_data_s3_uri_list = []

local_mode = bandit_experiment_manager.local_mode

start_time = time.time()
for loop_no in range(total_loops):
    print(f"""
    ################################
    # Incremental Training Loop {loop_no+1}
    ################################
    """)
    
    # Generate experiences and log them
    for i in range(retrain_batch_size):
        context_index, context = client_app.choose_random_context()
        action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=[context_index])

        reward = client_app.get_reward(context_index=context_index, 
                                       action=action, 
                                       event_id=event_id, 
                                       bandit_model_id=bandit_model_id, 
                                       action_prob=action_prob, 
                                       sample_prob=sample_prob, 
                                       local_mode=local_mode)

        rewards_list.append(reward)
        
    # Publish rewards sum for this batch to CloudWatch for monitoring 
    bandit_experiment_manager.cw_logger.publish_rewards_for_simulation(
        bandit_experiment_manager.experiment_id,
        sum(rewards_list[-retrain_batch_size:])/retrain_batch_size
    )
    
    # Join the events and rewards data to use for the next bandit-model training job
    # Use 90% as the training dataset and 10% as the the holdout evaluation dataset
    if local_mode:        
        bandit_experiment_manager.ingest_joined_data(client_app.joined_data_tmp_buffer,
                                                     ratio=0.90)
    else:
        # Kinesis Firehose => S3 => Athena
        print('Waiting for firehose to flush data to s3...')
        time.sleep(60) 
        rewards_s3_prefix = bandit_experiment_manager.ingest_rewards(client_app.rewards_tmp_buffer)
        bandit_experiment_manager.join(rewards_s3_prefix, ratio=0.90)
            
    # Train 
    bandit_experiment_manager.train_next_model(
        input_data_s3_prefix=bandit_experiment_manager.last_joined_job_train_data)

    all_joined_train_data_s3_uri_list.append(bandit_experiment_manager.last_joined_job_train_data)

    # Evaluate and/or deploy the new bandit model
    if do_evaluation:
        bandit_experiment_manager.evaluate_model(
            input_data_s3_prefix=bandit_experiment_manager.last_joined_job_eval_data,
            evaluate_model_id=bandit_experiment_manager.last_trained_model_id)

        eval_score_last_trained_model = bandit_experiment_manager.get_eval_score(
            evaluate_model_id=bandit_experiment_manager.last_trained_model_id,
            eval_data_path=bandit_experiment_manager.last_joined_job_eval_data)

        bandit_experiment_manager.evaluate_model(
            input_data_s3_prefix=bandit_experiment_manager.last_joined_job_eval_data,
            evaluate_model_id=bandit_experiment_manager.last_hosted_model_id) 

        all_joined_eval_data_s3_uri_list.append(bandit_experiment_manager.last_joined_job_eval_data)
    
        # Eval score is a measure of `regret`, so a lower eval score is better
        eval_score_last_hosted_model = bandit_experiment_manager.get_eval_score(
            evaluate_model_id=bandit_experiment_manager.last_hosted_model_id, 
            eval_data_path=bandit_experiment_manager.last_joined_job_eval_data)
    
        print('New bandit model evaluation score {}'.format(eval_score_last_hosted_model))
        print('Current bandit model evaluation score {}'.format(eval_score_last_trained_model))

        if eval_score_last_trained_model <= eval_score_last_hosted_model:
            print('Deploying new bandit model id {} in loop {}'.format(bandit_experiment_manager.last_trained_model_id, loop_no))
            bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id)
        else:
            print('Not deploying bandit model id {} in loop {}'.format(bandit_experiment_manager.last_trained_model_id, loop_no))
    else:
        # Just deploy the new bandit model without evaluating against previous model
        print('Deploying new bandit model id {} in loop {}'.format(bandit_experiment_manager.last_trained_model_id, loop_no))
        bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id)
    
    client_app.clear_tmp_buffers()
    
print(f'Total time taken to complete {total_loops} loops: {time.time() - start_time}')


    ################################
    # Incremental Training Loop 1
    ################################
    
Predicted star_rating: 3, actual star_rating 5, review_body "I have had other internet security suites that seemed to work well.  This one works as well as any I have purchased or downloaded.  I even forget I have it until it warns me of a problem.  Matched with their anti-virus software makes this the best I have found to keep my laptop running without problems."
Predicted star_rating: 3, actual star_rating 5, review_body "A very good product. Meets my expectations. Why do you require so many words? I hope this is enough. Well?"
Predicted star_rating: 3, actual star_rating 5, review_body "Have used TurboTax for Federal & PA state returns for many tears.  Since most forms are for same payers and payees as previous year, most of the work is carried over from the past. And that's one of the best features - namely, the ability to to transfer data from last year's returns to th

INFO:orchestrator.resource_manager:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for storing sagemaker data
INFO:orchestrator:Waiting for reward data to be uploaded.
INFO:orchestrator:Successfully upload reward files to s3 bucket path s3://sagemaker-us-east-1-835319576252/bandits-1611645944/rewards_data/bandits-1611645944-1611646854/rewards-1611646854
INFO:orchestrator:Creating resource for joining job...
INFO:orchestrator:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for athena queries
INFO:orchestrator:Started joining job...
INFO:orchestrator:Splitting data into train/evaluation set with ratio of 0.9
INFO:orchestrator:Joined data will be stored under s3://sagemaker-us-east-1-835319576252/bandits-1611645944/joined_data/bandits-1611645944-join-job-id-1611646854
INFO:orchestrator:Use last trained model bandits-1611645944-model-id-1611645987 as pre-trained model for training
INFO:orchestrator:Starting training job for ModelId 'bandits-1611645944-mode

2021-01-26 07:41:51 Starting - Starting the training job..



.




2021-01-26 07:42:15 Starting - Launching requested ML instancesProfilerReport-1611646911: InProgress
...

ERROR:orchestrator:An error occurred (ThrottlingException) when calling the DescribeTrainingJob operation (reached max retries: 4): Rate exceeded
INFO:orchestrator:Model 'bandits-1611645944-model-id-1611646910' is ready to deploy.


Deploying new bandit model id bandits-1611645944-model-id-1611646910 in loop 0




# _Ignore Any Errors ^^ Above ^^_

# Check Experiment State:  EVALUATED

`evaluation_state`: `EVALUATED`

The same bandit_model_id will appear in both `last_trained_model_id` and `last_evaluation_job_id` fields below.

In [None]:
from pprint import pprint

pprint(bandit_experiment_manager._jsonify())

# Check Experiment Status:  JOINED
`joining_state`:  `SUCCEEDED`

In [None]:
from pprint import pprint

pprint(bandit_experiment_manager._jsonify())

# Check Experiment Status:  DEPLOYED
`deploying_state`:  `SUCCEEDED`

In [None]:
from pprint import pprint

pprint(bandit_experiment_manager._jsonify())

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">Bandit SageMaker REST Endpoint</a></b>'.format(region, bandit_experiment_name)))


# Copy All Joined Event and Reward Data from S3 to Local Notebook

In [None]:
from sagemaker.s3 import S3Downloader

all_joined_data_s3_uri_list = all_joined_train_data_s3_uri_list + all_joined_eval_data_s3_uri_list

df_list = []

for joined_data_s3_prefix_uri in all_joined_data_s3_uri_list:
    joined_data_s3_uri_file_path = './'

    joined_data_s3_uri = S3Downloader.list(joined_data_s3_prefix_uri)[0]    
    S3Downloader.download(joined_data_s3_uri, joined_data_s3_uri_file_path)
    joined_data_local_file_path = joined_data_s3_uri.split('/')[-1]

    df = pd.read_csv(joined_data_local_file_path, 
                     delimiter=',', 
                     quoting=csv.QUOTE_ALL)
    
    df_list.append(df)

In [None]:
all_joined_data_df = pd.concat(df_list, ignore_index=True)
all_joined_data_df.tail(10)

# Review Invocations of BERT Model 1 and 2

In [None]:
print('Total Invocations of BERT Model 1:  {}'.format(client_app.action_count[1]))
print('Total Invocations of BERT Model 2:  {}'.format(client_app.action_count[2]))

In [None]:
from datetime import datetime, timedelta

import boto3
import pandas as pd

cw = boto3.Session().client(service_name='cloudwatch', region_name=region)

def get_invocation_metrics_for_endpoint_variant(endpoint_name,
                                                namespace_name,
                                                metric_name,
                                                variant_name,
                                                start_time,
                                                end_time):
    metrics = cw.get_metric_statistics(
        Namespace=namespace_name,
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=60,
        Statistics=["Sum"],
        Dimensions=[
            {
                "Name": "EndpointName",
                "Value": endpoint_name
            },
            {
                "Name": "VariantName",
                "Value": variant_name
            }
        ]
    )

    if metrics['Datapoints']:
        return pd.DataFrame(metrics["Datapoints"])\
                .sort_values("Timestamp")\
                .set_index("Timestamp")\
                .drop("Unit", axis=1)\
                .rename(columns={"Sum": variant_name})
    else:
        return pd.DataFrame()


# Gather BERT Model 1 Invocations Metrics
_Please be patient.  This will take 1-2 minutes._

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(75)

start_time = start_time or datetime.now() - timedelta(minutes=60)
end_time = datetime.now()
        
model1_endpoint_invocations = get_invocation_metrics_for_endpoint_variant(
                                    endpoint_name=model1_endpoint_name,
                                    namespace_name='AWS/SageMaker',                                   
                                    metric_name='Invocations',
                                    variant_name='AllTraffic',
                                    start_time=start_time, 
                                    end_time=end_time)

model1_endpoint_invocations

# Gather BERT Model 2 Invocations Metrics
_Please be patient.  This will take 1-2 minutes._

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

time.sleep(75)

start_time = start_time or datetime.now() - timedelta(minutes=60)
end_time = datetime.now()
        
model2_endpoint_invocations = get_invocation_metrics_for_endpoint_variant(
                                    endpoint_name=model2_endpoint_name,
                                    namespace_name='AWS/SageMaker',                                   
                                    metric_name='Invocations',
                                    variant_name='AllTraffic',
                                    start_time=start_time, 
                                    end_time=end_time)

model2_endpoint_invocations

In [None]:
rcParams['figure.figsize'] = 15, 10

x1 = range(0, model1_endpoint_invocations.size)
y1 = model1_endpoint_invocations['AllTraffic']
plt.plot(x1, y1, label="BERT Model 1")

x1 = range(0, model2_endpoint_invocations.size)
y1 = model2_endpoint_invocations['AllTraffic']
plt.plot(x1, y1, label="BERT Model 2")

plt.legend(loc=0, prop={'size': 20})
plt.xlabel('Time (Minutes)')
plt.ylabel('Number of Invocations')

# Check the Invocation Metrics for the BERT Models

In [None]:
from IPython.core.display import display, HTML
    
display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#metricsV2:namespace=AWS/SageMaker;dimensions=EndpointName,VariantName;search={}">Model 1 SageMaker REST Endpoint</a></b>'.format(region, model1_endpoint_name)))


In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#metricsV2:namespace=AWS/SageMaker;dimensions=EndpointName,VariantName;search={}">Model 2 SageMaker REST Endpoint</a></b>'.format(region, model2_endpoint_name)))


# Visualize Bandit Action Probabilities
This is the probability that the bandit model will choose a particular BERT model (action).

In [None]:
rcParams['figure.figsize'] = 15, 10

x1 = all_joined_data_df.query('action==1').index
y1 = all_joined_data_df.query('action==1').action_prob
plt.scatter(x1, y1, label="BERT Model 1")

x2 = all_joined_data_df.query('action==2').index
y2 = all_joined_data_df.query('action==2').action_prob
plt.scatter(x2, y2, label="BERT Model 2")

plt.legend(loc=3, prop={'size': 20})
plt.xlabel('Bandit Model Training Instances')
plt.ylabel('Action Probability')

In [None]:
print('Mean action probability for BERT Model 1: {}'.format(all_joined_data_df.query('action==1')['action_prob'].mean()))

In [None]:
print('Mean action probability for BERT Model 2: {}'.format(all_joined_data_df.query('action==2')['action_prob'].mean()))

# Visualize Bandit Sample Probabilities
Despite the action probability, we sample from all actions (BERT models).  Below is the sample probability for the chosen BERT model.

In [None]:
rcParams['figure.figsize'] = 15, 10

x1 = all_joined_data_df.query('action==1').index
y1 = all_joined_data_df.query('action==1').sample_prob
plt.scatter(x1, y1, label="BERT Model 1")

x2 = all_joined_data_df.query('action==2').index
y2 = all_joined_data_df.query('action==2').sample_prob
plt.scatter(x2, y2, label="BERT Model 2")

plt.legend(loc=0, prop={'size': 20})
plt.xlabel('Bandit Model Training Instances')
plt.ylabel('Sample Probability')

In [None]:
print('Mean sample probability for BERT Model 1: {}'.format(all_joined_data_df.query('action==1')['sample_prob'].mean()))

In [None]:
print('Mean sample probability for BERT Model 2: {}'.format(all_joined_data_df.query('action==2')['sample_prob'].mean()))

# Visualize Bandit Rewards

You can visualize the bandit-model training performance by plotting the rolling mean reward across client interactions.

Here rolling mean reward is calculated on the last `rolling_window` number of data instances, where each data instance corresponds to a single client interaction.

In [None]:
rolling_window = 100

rcParams['figure.figsize'] = 15, 10
lwd = 5
cmap = plt.get_cmap('tab20')
colors=plt.cm.tab20(np.linspace(0, 1, 20))

rewards_df = pd.DataFrame(rewards_list, columns=['bandit']).rolling(rolling_window).mean()
#rewards_df['perfect'] = sum(client_app.optimal_rewards) / len(client_app.optimal_rewards)

rewards_df.tail(10)

In [None]:
rewards_df.plot(y=['bandit'],  #, 'perfect'], 
                linewidth=lwd)
plt.legend(loc=4, prop={'size': 20})
plt.tick_params(axis='both', which='major', labelsize=15)
plt.yticks([0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.00])
plt.xticks([100, 200, 300, 400, 500, 600, 700, 800, 900, 1000])

plt.xlabel('Training Instances (Model is Updated Every %s Instances)' % retrain_batch_size, size=20)
plt.ylabel('Rolling {} Mean Reward'.format(rolling_window), size=30)
plt.grid()
plt.show()

In [None]:
rewards_df['bandit'].mean()

# Monitor the Bandit Model in CloudWatch

In [None]:
from markdown_helper import *
from IPython.display import Markdown

display(Markdown(bandit_experiment_manager.get_cloudwatch_dashboard_details()))

# Review the DynamoDB Tables and S3 Data

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/dynamodb/home?region={}#tables:selected=BanditsExperimentTable;tab=items">Bandits Experiment DynamoDB Table</a></b>'.format(region)))


In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">Bandits Experiment S3 Data</a></b>'.format(bucket, bandit_experiment_manager.experiment_id, region)))


In [None]:
%store bandit_experiment_name

# Release Resources

We have three DynamoDB tables from the bandits application above.  To better maintain them, we should remove the related records if the experiment has finished. 

Only execute the clean up cells below when you've finished the current experiment and want to deprecate everything associated with it. 

_The CloudWatch metrics will be removed during this cleanup step._

In [None]:
# try:
#     sm.delete_endpoint(
#          EndpointName=bandit_experiment_name
#     )
# except:
#     pass

In [None]:
print('Cleaning up experiment_id {}'.format(bandit_experiment_manager.experiment_id))
try:
    bandit_experiment_manager.clean_resource(experiment_id=bandit_experiment_manager.experiment_id)
    bandit_experiment_manager.clean_table_records(experiment_id=bandit_experiment_manager.experiment_id)
except:
    print('Ignore any errors.  Errors are OK.')

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}