# Contextual Bandits with Amazon SageMaker RL

We demonstrate how you can manage your own contextual multi-armed bandit workflow on SageMaker using the built-in [Vowpal Wabbit (VW)](https://github.com/VowpalWabbit/vowpal_wabbit) container to train and deploy contextual bandit models. We show how to train these models that interact with a live environment (using a simulated client application) and continuously update the model with efficient exploration.

### Why Contextual Bandits?

Wherever we look to personalize content for a user (content layout, ads, search, product recommendations, etc.), contextual bandits come in handy. Traditional personalization methods collect a training dataset, build a model and deploy it for generating recommendations. However, the training algorithm does not inform us on how to collect this dataset, especially in a production system where generating poor recommendations lead to loss of revenue. Contextual bandit algorithms help us collect this data in a strategic manner by trading off between exploiting known information and exploring recommendations which may yield higher benefits. The collected data is used to update the personalization model in an online manner. Therefore, contextual bandits help us train a personalization model while minimizing the impact of poor recommendations.

### What does this notebook contain?

To implement the exploration-exploitation strategy, we need an iterative training and deployment system that: (1) recommends an action using the contextual bandit model based on user context, (2) captures the implicit feedback over time and (3) continuously trains the model with incremental interaction data. In this notebook, we show how to setup the infrastructure needed for such an iterative learning system. While the example demonstrates a bandits application, these continual learning systems are useful more generally in dynamic scenarios where models need to be continually updated to capture the recent trends in the data (e.g. tracking fraud behaviors based on detection mechanisms or tracking user interests over time). 

In a typical supervised learning setup, the model is trained with a SageMaker training job and it is hosted behind a SageMaker hosting endpoint. The client application calls the endpoint for inference and receives a response. In bandits, the client application also sends the reward (a score assigned to each recommendation generated by the model) back for subsequent model training. These rewards will be part of the dataset for the subsequent model training. 

# More Resources

* https://aws.amazon.com/blogs/machine-learning/power-contextual-bandits-using-continual-learning-with-amazon-sagemaker-rl/
* https://getstream.io/blog/introduction-contextual-bandits/
* https://github.com/VowpalWabbit/
* https://github.com/aws/sagemaker-rl-container
* [Bandit Experiment Manager](./common/sagemaker_rl/orchestrator/workflow/manager)

![](../../../img/multi_armed_bandit_maximize_reward.png)

![](../../../img/multi_armed_bandit_traffic_shift.png)

The contextual bandit training workflow is controlled by an experiment manager provided with this example. The client application (say a recommender system application) pings the SageMaker hosting endpoint that is serving the bandits model. The application sends the state (user features) as input and receives an action (recommendation) as a response. The client application sends the recommended action to the user and stores the received reward in S3. The SageMaker hosted endpoint also stores inference data (state and action) in S3. The experiment manager joins the inference data with rewards as they become available. The joined data is used to update the model with a SageMaker training job. The updated model is evaluated offline and deployed to SageMaker hosting endpoint if the model evaluation score improves upon prior models. 

Below is an overview of the subsequent cells in the notebook: 
* Configuration: this includes details related to SageMaker and other AWS resources needed for the bandits application. 
* IAM role setup: this creates appropriate execution role and shows how to add more permissions to the role, needed for specific AWS resources.
* Client application (Environment): this shows the simulated client application.
* Step-by-step bandits model development: 
 1. Model Initialization (random or warm-start) 
 2. Deploy the First Model 
 3. Initialize the Client Application 
 4. Reward Ingestion 
 5. Model Re-training and Re-deployment 
* Bandits model deployment with the end-to-end loop. 
* Visualization 
* Cleanup 

#### Local Mode

To facilitate experimentation, we provide a `local_mode` that runs the contextual bandit example using the SageMaker Notebook instance itself instead of SageMaker training and hosting instances. The workflow remains the same in `local_mode`, but runs much faster for small datasets. Hence, it is a useful tool for experimentation and debugging. However, it will not scale to production use cases with high throughput and large datasets. 

In `local_mode`, the training, evaluation and hosting is done with the SageMaker VW docker container. The join is not handled by SageMaker, and is done inside the client application. The rest of the textual explanation assumes that the notebook is run in SageMaker mode.

In [None]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [None]:
import yaml
import sys
import numpy as np
import time
import sagemaker

sys.path.append('common')
sys.path.append('common/sagemaker_rl')

from markdown_helper import *
from IPython.display import Markdown

### Configuration

The configuration for the bandits application can be specified in a `config.yaml` file as can be seen below. It configures the AWS resources needed. The DynamoDB tables are used to store metadata related to experiments, models and data joins. The `private_resource` specifices the SageMaker instance types and counts used for training, evaluation and hosting. The SageMaker container image is used for the bandits application. This config file also contains algorithm and SageMaker-specific setups.  Note that all the data generated and used for the bandits application will be stored in `s3://sagemaker-{REGION}-{AWS_ACCOUNT_ID}/{experiment_id}/`.

Please make sure that the `num_arms` parameter in the config is equal to the number of actions in the client application (which is defined in the cell below).

The Docker image is defined here:  https://github.com/aws/sagemaker-rl-container/blob/master/vw/docker/8.7.0/Dockerfile

In [None]:
!pygmentize 'config.yaml'

In [None]:
config_file = 'config.yaml'
with open(config_file, 'r') as yaml_file:
    config = yaml.load(yaml_file)

# Additional permissions for the IAM role
IAM role requires additional permissions for [AWS CloudFormation](https://aws.amazon.com/cloudformation/), [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [Amazon Athena](https://aws.amazon.com/athena/). Make sure the SageMaker role you are using has the permissions.

In [None]:
# display(Markdown(generate_help_for_experiment_manager_permissions(sagemaker_role)))

### Client application (Environment)
The client application simulates a live environment that uses the SageMaker bandits model to serve recommendations to users. The logic of reward generation resides in the client application. We simulate the online learning loop with feedback.  The data consists of 5 classes, and if the agent selects the right class, then reward is 1.  Otherwise, the agent obtains a reward 0.

The workflow of the client application is as follows:
- The client application picks a context at random, which is sent to the SageMaker endpoint for retrieving an action.
- SageMaker endpoint returns an action, associated probability and `event_id`.
- Since this simulator was generated from the dataset, we know the true class for that context. 
- The application reports the reward to the experiment manager using S3, along with the corresponding `event_id`.

`event_id` is a unique identifier for each interaction. It is used to join inference data `<state, action, action probability>` with the rewards. 

In a later cell of this notebook, where there exists a hosted endpoint, we illustrate how the client application interacts with the endpoint and gets the recommended action.

### Step-by-step bandits model development

[**Bandit Experiment Manager**](./common/sagemaker_rl/orchestrator/workflow/manager/) is the top level class for all the Bandits/RL and continual learning workflows. Similar to the estimators in the [Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk), `ExperimentManager` contains methods for training, deployment and evaluation. It keeps track of the job status and reflects current progress in the workflow.

Start the application using the `ExperimentManager` class 

In [None]:
import time

timestamp = int(time.time())

experiment_name = 'bandits-{}'.format(timestamp)

# `ExperimentManager` will create a AWS CloudFormation Stack of additional resources needed for the Bandit experiment. 

In [None]:
from orchestrator.workflow.manager.experiment_manager import ExperimentManager

bandit_experiment_manager = ExperimentManager(config, experiment_id=experiment_name)

In [None]:
try:
    bandit_experiment_manager.clean_resource(experiment_id=bandit_experiment_manager.experiment_id)
    bandit_experiment_manager.clean_table_records(experiment_id=bandit_experiment_manager.experiment_id)
except:
    print('Ignore any errors.  Errors are OK.')

In [None]:
bandit_experiment_manager = ExperimentManager(config, experiment_id=experiment_name)

# Initialize the Bandit Model
To start a new experiment, we need to initialize the first bandit model or "policy" in reinforcement learning terminology.  

If we have historical data in the format `(state, action, action probability, reward)`, we can perform a "warm start" and learn the bandit model offline.  

However, let's assume we are starting with no historical data and initialize a random bandit model using `initialize_first_model()`.

In [None]:
bandit_experiment_manager.initialize_first_model()

# ^^ Ignore `Failed to delete: /tmp/...` message above.  This is OK. ^^

# Check Experiment State:  TRAINED
`training_state`: `TRAINED`

Remember the `last_trained_model_id`.

In [None]:
bandit_experiment_manager._jsonify()

# Deploy the Bandit Model

Once training and evaluation is done, we can deploy the model.

In [None]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

In [None]:
# Check the model_id of the last model trained.
print('Deploying newly-trained bandit model: {}'.format(bandit_experiment_manager.last_trained_model_id))

In [None]:
print('Deploying bandit model_id {}'.format(bandit_experiment_manager.last_trained_model_id))

bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id) 

# Check Experiment State
`hosting_state`: `DEPLOYED`

The `last_trained_model_id` and `last_hosted_model_id` are now the same as we just deployed the bandit model.

In [None]:
bandit_experiment_manager._jsonify()

# Initialize the Client Application

Now that the last trained model is hosted, client application can send out the state, hit the endpoint, and receive the recommended action. There are 2 models that we want to test:  model1 and model2.  This translates to 2 actions that the bandit model will predict.



In [None]:
%store -r step_functions_pipeline_endpoint_name

In [None]:
model_1_endpoint_name = step_functions_pipeline_endpoint_name

In [None]:
print(model_1_endpoint_name)

In [None]:
client = boto3.client('sagemaker')
waiter = client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=model_1_endpoint_name)

In [None]:
import json
from sagemaker.tensorflow.serving import Predictor

model1 = Predictor(endpoint_name=model_1_endpoint_name,
                   sagemaker_session=sess,
                   content_type='application/json',
                   model_name='saved_model',
                   model_version=0)

In [None]:
reviews = ["This is not good."]

model1_predicted_classes = model1.predict(reviews)

for predicted_class, review in zip(model1_predicted_classes, reviews):
    print('[Predicted Star Rating: {}]'.format(predicted_class), review)

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">Model 1 SageMaker REST Endpoint</a></b>'.format(region, model_1_endpoint_name)))


In [None]:
%store -r step_functions_pipeline_endpoint_name_random

In [None]:
model_2_endpoint_name = step_functions_pipeline_endpoint_name_random

In [None]:
print(model_2_endpoint_name)

In [None]:
waiter = sm.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=model_2_endpoint_name)

In [None]:
import json
from sagemaker.tensorflow.serving import Predictor

model2 = Predictor(endpoint_name=model_2_endpoint_name,
                   sagemaker_session=sess,
                   content_type='application/json',
                   model_name='saved_model',
                   model_version=0)

In [None]:
reviews = ["This is not good."]

model2_predicted_classes = model2.predict(reviews)

for predicted_class, review in zip(model2_predicted_classes, reviews):
    print('[Predicted Star Rating: {}]'.format(predicted_class), review)

In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">Model 2 SageMaker REST Endpoint</a></b>'.format(region, model_2_endpoint_name)))


In [None]:
import csv
import numpy as np

class Simulation():
    def __init__(self, data, num_events, bandit_model, bert_model_map):
        self.bandit_model = bandit_model
        self.bert_model_map = bert_model_map
        
        self.num_actions = 2

        df_reviews = pd.read_csv(data, 
                                 delimiter='\t', 
                                 quoting=csv.QUOTE_NONE,
                                 compression='gzip')
        df_scrubbed = df_reviews[['review_id', 'star_rating']].sample(n=num_events) #.query('star_rating == 1')
        df_scrubbed = df_scrubbed.reset_index()
        df_scrubbed.shape
        np_reviews = df_scrubbed.to_numpy()

        np_reviews = np.delete(np_reviews, 0, 1)
        
        # Last column is label, the rest are the features (contexts)
        labels = np_reviews[:, -1] #.astype(int)
        contexts = np_reviews[:, :-1]

        self.contexts = contexts
        self.labels = labels

        print(self.contexts)
        print(self.labels)
        
        self.optimal_rewards = [1]
        self.rewards_buffer = []
        self.joined_data_buffer = []

    def choose_random_context(self):
        context_index = np.random.choice(self.contexts.shape[0])
        context = self.contexts[context_index]
        return context_index, context    

    def clear_buffer(self):
        self.rewards_buffer.clear()
        self.joined_data_buffer.clear()    

    def get_reward(self, 
                   context_index, 
                   action, 
                   event_id, 
                   bandit_model_id, 
                   action_prob, 
                   sample_prob, 
                   local_mode):
#        print('context_index {}'.format(context_index))       
    
        context = [context_index]
    
#        print('context {}'.format(context))
        label = self.labels[context_index]
#        print('label {}'.format(label))
#        print('action {}'.format(action))
#        print('event_id {}'.format(event_id))
#        print('bandit_model_id {}'.format(bandit_model_id))
#        print('action_prob {}'.format(action_prob))

        bert_model = self.bert_model_map[action]
        bert_predicted_class = bert_model.predict([context])[0]
#        print('bert_predicted_class {}'.format(bert_predicted_class))
        
        if bert_predicted_class == label:
            reward = 1
        else:
            reward = -1
#        print('reward {}'.format(reward))

        if local_mode:
            json_blob = {"reward": reward,
                         "event_id": event_id,
                         "action": action,
                         "action_prob": action_prob,
                         "model_id": bandit_model_id,
                         "observation": context,
                         "sample_prob": sample_prob}
            
            self.joined_data_buffer.append(json_blob)
        else:
            json_blob = {"reward": reward, "event_id": event_id}
            self.rewards_buffer.append(json_blob)
        
        return reward
    

In [None]:
bandit_model = bandit_experiment_manager.predictor

sim_app = Simulation(data='./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz',
                     num_events=100,
                     bandit_model=bandit_model,
                     bert_model_map={
                                     1: model1,
                                     2: model2
                                    }
                    )

Make sure that `num_arms` specified in `config.yaml` is equal to the total unique actions in the simulation application.

In [None]:
print('Testing {} BERT models'.format(sim_app.num_actions))

assert sim_app.num_actions == bandit_experiment_manager.config["algor"]["algorithms_parameters"]["num_arms"]

In [None]:
import time

context_index, context = sim_app.choose_random_context()
action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=context) # obs=context)

print('event ID: {}\nbert_model_id: {}\naction_probability: {}'.format(event_id, action, action_prob, bandit_model_id))

# Ingest Reward

Client application generates a reward after receiving the recommended action and stores the tuple `<eventID, reward>` in S3. In this case, reward is 1 if predicted action is the true class, and 0 otherwise. SageMaker hosting endpoint saves all the inferences `<eventID, state, action, action probability>` to S3 using [**Kinesis Firehose**](https://aws.amazon.com/kinesis/data-firehose/). The `ExperimentManager` joins the reward with state, action and action probability using [**Amazon Athena**](https://aws.amazon.com/athena/). 

In [None]:
local_mode = bandit_experiment_manager.local_mode
num_events = 100 # collect events
print('Collecting {} events...'.format(num_events))

# Generate experiences and log them
for i in range(num_events):
    context_index, context = sim_app.choose_random_context()
    action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=context)

    # print('Context Index {}'.format(context_index))
    # print('Context {}'.format(context))    
    # print('Action (bert model to invoke) {}'.format(action))
    # print('Event ID {}'.format(event_id))
    # print('Bandit Model ID {}'.format(bandit_model_id))
    # print('Action Probability {}'.format(action_prob))
    # print('Sample Probability {}'.format(sample_prob))
    
    reward = sim_app.get_reward(context_index=context_index, 
                                action=action, 
                                event_id=event_id, 
                                bandit_model_id=bandit_model_id, 
                                action_prob=action_prob, 
                                sample_prob=sample_prob, 
                                local_mode=local_mode)
    

# Create Bandit Model Training Data

Join `Event` and `Reward` data to and upload to S3 in the following format:

```
{
 'reward': -1, # -1 if the model is wrong, +1 if the model is correct
 'event_id': 131181492351609994318271340276526219266, # unique event id
 'action': 1, # suggested action (bert_model_id 1 or 2)
 'action_prob': 0.9995, # probability that the suggested action is correct
 'model_id': 'bandits-1597631299-model-id-1597631304', # unique bandit_model_id
 'observation': [54], # feature (review_id)
 'sample_prob': 0.43410828171830174 
}
```


In [None]:
if local_mode:
    print('Using local mode with memory buffers.')
    print(sim_app.joined_data_buffer)
    bandit_experiment_manager.ingest_joined_data(sim_app.joined_data_buffer)
else:
    print("Using production mode with Kinesis Firehose.  Waiting to flush to S3...")
    time.sleep(60) # Wait for firehose to flush data to S3
    rewards_s3_prefix = bandit_experiment_manager.ingest_rewards(sim_app.rewards_buffer)
    bandit_experiment_manager.join(rewards_s3_prefix)

# Check Experiment Status:  Joined
`joining_state`:  `SUCCEEDED`

In [None]:
bandit_experiment_manager._jsonify()

# Review Bandit Model Training Data

In [None]:
print('Bandit model training data {}'.format(bandit_experiment_manager.last_joined_job_train_data))

In [None]:
from sagemaker.s3 import S3Downloader

bandit_model_train_data_s3_uri = S3Downloader.list(bandit_experiment_manager.last_joined_job_train_data)[0]
print(bandit_model_train_data_s3_uri)

In [None]:
from sagemaker.s3 import S3Downloader

bandit_model_train_data = S3Downloader.read_file(bandit_model_train_data_s3_uri)
print(bandit_model_train_data)

# Train the Bandit Model

Now we can train a new model with newly collected experiences, and host the resulting model.

In [None]:
print('Trained bandit model id {}'.format(bandit_experiment_manager.last_trained_model_id))

bandit_experiment_manager.train_next_model(input_data_s3_prefix=bandit_experiment_manager.last_joined_job_train_data)

# Ignore ^^ `Failed to delete` Error Above ^^ 

# Deploy the Bandit Model

In [None]:
print('Deploying bandit model id {}'.format(bandit_experiment_manager.last_hosted_model_id))

bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id)

# Continuously Deploy New Bandit Models

The above cells explained the individual steps in the training workflow. To train a model to convergence, we will continually train the model based on data collected with client application interactions. We demonstrate the continual training loop in a single cell below.

We include the evaluation step at each step before deployment to compare the model just trained (`last_trained_model_id`) against the model that is currently hosted (`last_hosted_model_id`). 
Details of each joining and training job can be tracked in `join_db` and `model_db` respectively. `model_db` also stores the evaluation scores. When you have multiple experiments, you can check their status in `experiment_db`.

# Evaluate Current Model Against Historical Model

After every training cycle, we evaluate if the newly trained model is better than the one currently deployed. Using the evaluation dataset, we evaluate how the new model would perform compared to the model that is currently deployed. SageMaker RL supports offline evaluation by performing counterfactual analysis (CFA). By default, we apply [**doubly robust (DR) estimation**](https://arxiv.org/pdf/1103.4601.pdf) method. The bandit policy tries to minimize the cost (1-reward) value in this case, so a smaller evaluation score indicates better policy performance.

_If you want the loops to finish faster, you can skip the evaluation by setting `do_evaluation=False` in the cell below._


In [None]:
do_evaluation = True

In [None]:
display(Markdown(bandit_experiment_manager.get_cloudwatch_dashboard_details()))

In [None]:
start_time = time.time()
total_loops = 5 # Increase for higher accuracy
retrain_batch_size = 100 # Model will be trained after every `batch_size` number of data instances
rewards_list = []

event_list = []

local_mode = bandit_experiment_manager.local_mode
for loop_no in range(total_loops):
    print(f"""
    #############
    #### Loop {loop_no+1}
    #############
    """)
    
    # Generate experiences and log them
    for i in range(retrain_batch_size):
        context_index, context = sim_app.choose_random_context()
        action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=context)

        print('Context Index {}'.format(context_index))
        print('Context {}'.format(context))    
        print('Action (bert model to invoke) {}'.format(action))
        print('Event ID {}'.format(event_id))
        print('Bandit Model ID {}'.format(bandit_model_id))
        print('Action Probability {}'.format(action_prob))
        print('Sample Probability {}'.format(sample_prob))

        # reward = sim_app.get_reward(user_id, action, event_id, bandit_model_id, action_prob, sample_prob, local_mode)
        reward = sim_app.get_reward(context_index=context_index, 
                                    action=action, 
                                    event_id=event_id, 
                                    bandit_model_id=bandit_model_id, 
                                    action_prob=action_prob, 
                                    sample_prob=sample_prob, 
                                    local_mode=local_mode)

        rewards_list.append(reward)  
        
    # Publish rewards sum for this batch to CloudWatch for monitoring 
    bandit_experiment_manager.cw_logger.publish_rewards_for_simulation(
        bandit_experiment_manager.experiment_id,
        sum(rewards_list[-retrain_batch_size:])/retrain_batch_size
    )
    
    # Join the events and rewards data to use for the next bandit-model training job
    if local_mode:
        bandit_experiment_manager.ingest_joined_data(sim_app.joined_data_buffer,
                                                     ratio=0.90)
    else:
        # Kinesis Firehose => S3 => Athena
        print("Waiting for firehose to flush data to s3...")
        time.sleep(60) 
        rewards_s3_prefix = bandit_experiment_manager.ingest_rewards(sim_app.rewards_buffer)
        bandit_experiment_manager.join(rewards_s3_prefix, ratio=0.90)
    
    # Train 
    bandit_experiment_manager.train_next_model(
        input_data_s3_prefix=bandit_experiment_manager.last_joined_job_train_data)

    # Evaluate and/or deploy the new bandit model
    if do_evaluation:
        bandit_experiment_manager.evaluate_model(
            input_data_s3_prefix=bandit_experiment_manager.last_joined_job_eval_data,
            evaluate_model_id=bandit_experiment_manager.last_trained_model_id)

        eval_score_last_trained_model = bandit_experiment_manager.get_eval_score(
            evaluate_model_id=bandit_experiment_manager.last_trained_model_id,
            eval_data_path=bandit_experiment_manager.last_joined_job_eval_data)

        bandit_experiment_manager.evaluate_model(
            input_data_s3_prefix=bandit_experiment_manager.last_joined_job_eval_data,
            evaluate_model_id=bandit_experiment_manager.last_hosted_model_id)

        eval_score_last_hosted_model = bandit_experiment_manager.get_eval_score(
            evaluate_model_id=bandit_experiment_manager.last_hosted_model_id, 
            eval_data_path=bandit_experiment_manager.last_joined_job_eval_data)
    
        if eval_score_last_trained_model <= eval_score_last_hosted_model:
            bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id)
        else:
            print('Not deploying model in loop {}'.format(loop_no))
    else:
        # Just deploy the new bandit model without evaluating against previous model
        bandit_experiment_manager.deploy_model(model_id=bandit_experiment_manager.last_trained_model_id)
    
    sim_app.clear_buffer()
    
print(f"Total time taken to complete {total_loops} loops: {time.time() - start_time}")

# Review Bandit Model Joined Event and Reward Data

In [None]:
print('Bandit model event and reward data {}'.format(bandit_experiment_manager.last_joined_job_eval_data))

In [None]:
from sagemaker.s3 import S3Downloader

bandit_model_joined_event_and_reward_data_s3_uri = S3Downloader.list(bandit_experiment_manager.last_joined_job_eval_data)[0]
print(bandit_model_joined_event_and_reward_data_s3_uri)

In [None]:
from sagemaker.s3 import S3Downloader

bandit_model_joined_event_and_reward_data = S3Downloader.read_file(bandit_model_joined_event_and_reward_data_s3_uri)
print(bandit_model_joined_event_and_reward_data)

# Copy Joined Event and Reward Data from S3 to Local Notebook

In [None]:
bandit_model_joined_event_and_reward_data_file_path = './'
bandit_model_joined_event_and_reward_data = S3Downloader.download(bandit_model_joined_event_and_reward_data_s3_uri, bandit_model_joined_event_and_reward_data_file_path)


In [None]:
bandit_model_joined_event_and_reward_data_local_file_path = bandit_model_joined_event_and_reward_data_s3_uri.split('/')[-1]

df_joined_events_and_rewards = pd.read_csv(bandit_model_joined_event_and_reward_data_local_file_path, 
                                    delimiter=',', 
                                    quoting=csv.QUOTE_ALL)
df_joined_events_and_rewards.query('action==1')

# Check Experiment State

`evaluation_state`: `EVALUATED`

The same bandit_model_id will appear in both `last_trained_model_id` and `last_evaluation_job_id` fields below.

In [None]:
bandit_experiment_manager._jsonify()

# Visualize the Bandit Rewards

You can visualize the bandit-model training performance by plotting the rolling mean reward across client interactions.

Here rolling mean reward is calculated on the last `rolling_window` number of data instances, where each data instance corresponds to a single client interaction.

In [None]:
import pandas as pd
from pylab import rcParams
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [None]:
def get_mean_reward(reward_lst, batch_size=retrain_batch_size):
    mean_rew=list()
    for r in range(len(reward_lst)):
        mean_rew.append(sum(reward_lst[:r+1]) * 1.0 / ((r+1)*retrain_batch_size))
    return mean_rew

In [None]:
rolling_window = 100

rcParams['figure.figsize'] = 15, 10
lwd = 5
cmap = plt.get_cmap('tab20')
colors=plt.cm.tab20(np.linspace(0, 1, 20))

rewards_df = pd.DataFrame(rewards_list, columns=['bandit']).rolling(rolling_window).mean()
#rewards_df['oracle'] = sum(sim_app.optimal_rewards) / len(sim_app.optimal_rewards)

rewards_df.plot(y=['bandit'], # 'oracle'], 
                linewidth=lwd)
plt.legend(loc=4, prop={'size': 20})
plt.tick_params(axis='both', which='major', labelsize=15)
plt.xlabel('Data instances (models were updated every %s data instances)' % retrain_batch_size, size=20)
plt.ylabel('Rolling {} Mean Reward'.format(rolling_window), size=30)
plt.grid()
plt.show()

# Check the Invocation Metrics for the BERT Models

In [None]:
from IPython.core.display import display, HTML
    
display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#metricsV2:namespace=AWS/SageMaker;dimensions=EndpointName,VariantName;search={}">Model 1 SageMaker REST Endpoint</a></b>'.format(region, model_1_endpoint_name)))


In [None]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#metricsV2:namespace=AWS/SageMaker;dimensions=EndpointName,VariantName;search={}">Model 2 SageMaker REST Endpoint</a></b>'.format(region, model_2_endpoint_name)))


# Analyze the Reward Data Across All Models

In [None]:
rewards_df

In [None]:
rewards_df.bandit.mean()

# Clean Up

We have three DynamoDB tables (experiment, join, model) from the bandits application above (e.g. `experiment_id='bandits-...'`). To better maintain them, we should remove the related records if the experiment has finished. Besides, having an endpoint running will incur costs. Therefore, we delete these components as part of the clean up process.

Only execute the clean up cells below when you've finished the current experiment and want to deprecate everything associated with it. 

_The CloudWatch metrics will be removed during this cleanup step._

In [None]:
# print('Cleaning up experiment_id {}'.format(bandit_experiment_manager.experiment_id))
# try:
#     bandit_experiment_manager.clean_resource(experiment_id=bandit_experiment_manager.experiment_id)
#     bandit_experiment_manager.clean_table_records(experiment_id=bandit_experiment_manager.experiment_id)
#     sim_app.clear_buffer()
# except:
#     print('Ignore any errors.  Errors are OK.')