# Contextual Bandits with Amazon SageMaker RL

We demonstrate how you can manage your own contextual multi-armed bandit workflow on SageMaker using the built-in [Vowpal Wabbit (VW)](https://github.com/VowpalWabbit/vowpal_wabbit) container to train and deploy contextual bandit models. We show how to train these models that interact with a live environment (using a simulated client application) and continuously update the model with efficient exploration.

### Why Contextual Bandits?

Wherever we look to personalize content for a user (content layout, ads, search, product recommendations, etc.), contextual bandits come in handy. Traditional personalization methods collect a training dataset, build a model and deploy it for generating recommendations. However, the training algorithm does not inform us on how to collect this dataset, especially in a production system where generating poor recommendations lead to loss of revenue. Contextual bandit algorithms help us collect this data in a strategic manner by trading off between exploiting known information and exploring recommendations which may yield higher benefits. The collected data is used to update the personalization model in an online manner. Therefore, contextual bandits help us train a personalization model while minimizing the impact of poor recommendations.

### What does this notebook contain?

To implement the exploration-exploitation strategy, we need an iterative training and deployment system that: (1) recommends an action using the contextual bandit model based on user context, (2) captures the implicit feedback over time and (3) continuously trains the model with incremental interaction data. In this notebook, we show how to setup the infrastructure needed for such an iterative learning system. While the example demonstrates a bandits application, these continual learning systems are useful more generally in dynamic scenarios where models need to be continually updated to capture the recent trends in the data (e.g. tracking fraud behaviors based on detection mechanisms or tracking user interests over time). 

In a typical supervised learning setup, the model is trained with a SageMaker training job and it is hosted behind a SageMaker hosting endpoint. The client application calls the endpoint for inference and receives a response. In bandits, the client application also sends the reward (a score assigned to each recommendation generated by the model) back for subsequent model training. These rewards will be part of the dataset for the subsequent model training. 

# Based on this blog post:

https://aws.amazon.com/blogs/machine-learning/power-contextual-bandits-using-continual-learning-with-amazon-sagemaker-rl/

![](../../../img/multi_armed_bandit_maximize_reward.png)

![](../../../img/multi_armed_bandit_traffic_shift.png)

The contextual bandit training workflow is controlled by an experiment manager provided with this example. The client application (say a recommender system application) pings the SageMaker hosting endpoint that is serving the bandits model. The application sends the state (user features) as input and receives an action (recommendation) as a response. The client application sends the recommended action to the user and stores the received reward in S3. The SageMaker hosted endpoint also stores inference data (state and action) in S3. The experiment manager joins the inference data with rewards as they become available. The joined data is used to update the model with a SageMaker training job. The updated model is evaluated offline and deployed to SageMaker hosting endpoint if the model evaluation score improves upon prior models. 

Below is an overview of the subsequent cells in the notebook: 
* Configuration: this includes details related to SageMaker and other AWS resources needed for the bandits application. 
* IAM role setup: this creates appropriate execution role and shows how to add more permissions to the role, needed for specific AWS resources.
* Client application (Environment): this shows the simulated client application.
* Step-by-step bandits model development: 
 1. Model Initialization (random or warm-start) 
 2. Deploy the First Model 
 3. Initialize the Client Application 
 4. Reward Ingestion 
 5. Model Re-training and Re-deployment 
* Bandits model deployment with the end-to-end loop. 
* Visualization 
* Cleanup 

#### Local Mode

To facilitate experimentation, we provide a `local_mode` that runs the contextual bandit example using the SageMaker Notebook instance itself instead of SageMaker training and hosting instances. The workflow remains the same in `local_mode`, but runs much faster for small datasets. Hence, it is a useful tool for experimentation and debugging. However, it will not scale to production use cases with high throughput and large datasets. 

In `local_mode`, the training, evaluation and hosting is done with the SageMaker VW docker container. The join is not handled by SageMaker, and is done inside the client application. The rest of the textual explanation assumes that the notebook is run in SageMaker mode.

In [3]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [4]:
import yaml
import sys
import numpy as np
import time
import sagemaker

sys.path.append('common')
sys.path.append('common/sagemaker_rl')

from markdown_helper import *
from IPython.display import Markdown

### Configuration

The configuration for the bandits application can be specified in a `config.yaml` file as can be seen below. It configures the AWS resources needed. The DynamoDB tables are used to store metadata related to experiments, models and data joins. The `private_resource` specifices the SageMaker instance types and counts used for training, evaluation and hosting. The SageMaker container image is used for the bandits application. This config file also contains algorithm and SageMaker-specific setups.  Note that all the data generated and used for the bandits application will be stored in `s3://sagemaker-{REGION}-{AWS_ACCOUNT_ID}/{experiment_id}/`.

Please make sure that the `num_arms` parameter in the config is equal to the number of actions in the client application (which is defined in the cell below).

The Docker image is defined here:  https://github.com/aws/sagemaker-rl-container/blob/master/vw/docker/8.7.0/Dockerfile

In [5]:
!pygmentize 'config.yaml'

[94mresource[39;49;00m:
  [94mshared_resource[39;49;00m:
    [37m# cloud formation stack[39;49;00m
    [94mresources_cf_stack_name[39;49;00m: [33m"[39;49;00m[33mBanditsSharedResourceStack[39;49;00m[33m"[39;49;00m
    [37m# Dynamo table for status of an experiment[39;49;00m
    [94mexperiment_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsExperimentTable[39;49;00m[33m"[39;49;00m
    [37m# Dynamo table for status of all models trained[39;49;00m
    [94mmodel_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsModelTable[39;49;00m[33m"[39;49;00m
    [37m# Dynamo table for status of all joining job for reward ingestion[39;49;00m
    [94mjoin_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsJoinTable[39;49;00m[33m"[39;49;00m
    [94miam_role[39;49;00m:
      [94mrole_name[39;49;00m: [33m"[39;49;00m[33mBanditsIAMRole[39;49;00m[33m"[39;49;00m
  [94mpr

In [6]:
config_file = 'config.yaml'
with open(config_file, 'r') as yaml_file:
    config = yaml.load(yaml_file)

  app.launch_new_instance()


# Additional permissions for the IAM role
IAM role requires additional permissions for [AWS CloudFormation](https://aws.amazon.com/cloudformation/), [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [Amazon Athena](https://aws.amazon.com/athena/). Make sure the SageMaker role you are using has the permissions.

In [7]:
# display(Markdown(generate_help_for_experiment_manager_permissions(sagemaker_role)))

### Client application (Environment)
The client application simulates a live environment that uses the SageMaker bandits model to serve recommendations to users. The logic of reward generation resides in the client application. We simulate the online learning loop with feedback.  The data consists of 5 classes, and if the agent selects the right class, then reward is 1.  Otherwise, the agent obtains a reward 0.

The workflow of the client application is as follows:
- The client application picks a context at random, which is sent to the SageMaker endpoint for retrieving an action.
- SageMaker endpoint returns an action, associated probability and `event_id`.
- Since this simulator was generated from the dataset, we know the true class for that context. 
- The application reports the reward to the experiment manager using S3, along with the corresponding `event_id`.

`event_id` is a unique identifier for each interaction. It is used to join inference data `<state, action, action probability>` with the rewards. 

In a later cell of this notebook, where there exists a hosted endpoint, we illustrate how the client application interacts with the endpoint and gets the recommended action.

### Step-by-step bandits model development

[**ExperimentManager**](./common/sagemaker_rl/orchestrator/workflow/manager/experiment_manager.py) is the top level class for all the Bandits/RL and continual learning workflows. Similar to the estimators in the [Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk), `ExperimentManager` contains methods for training, deployment and evaluation. It keeps track of the job status and reflects current progress in the workflow.

Start the application using the `ExperimentManager` class 

In [8]:
import time

timestamp = int(time.time())

experiment_name = 'bandits-{}'.format(timestamp)

# `ExperimentManager` will create a AWS CloudFormation Stack of additional resources needed for the Bandit experiment. 

In [9]:
from orchestrator.workflow.manager.experiment_manager import ExperimentManager

bandits_experiment = ExperimentManager(config, experiment_id=experiment_name)

INFO:orchestrator.resource_manager:Using Resources in CloudFormation stack named: BanditsSharedResourceStack for Shared Resources.


In [10]:
try:
    bandits_experiment.clean_resource(experiment_id=bandits_experiment.experiment_id)
    bandits_experiment.clean_table_records(experiment_id=bandits_experiment.experiment_id)
except:
    print('Ignore any errors.  Errors are OK.')

INFO:orchestrator:Deleting hosting endpoint 'bandits-1597616270'...


In [11]:
bandits_experiment = ExperimentManager(config, experiment_id=experiment_name)

INFO:orchestrator.resource_manager:Using Resources in CloudFormation stack named: BanditsSharedResourceStack for Shared Resources.


# Initialize the Bandit Model (aka. Warm Start)

In [12]:
# !pip install -q wrapt --upgrade --ignore-installed
# !pip install -q transformers==2.8.0
# !pip install -q tensorflow==2.1.0

In [13]:
# from transformers import DistilBertTokenizer

# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [14]:
import pandas as pd
import time
import uuid
import boto3
from urllib.parse import urlparse
import datetime
import json
import io
import numpy as np

# def remove_underrepresented_classes(features, labels, thresh=0.0005):
#     """Removes classes when number of datapoints fraction is below a threshold."""
#     total_count = labels.shape[0]
#     unique, counts = np.unique(labels, return_counts=True)
#     ratios = counts.astype('float') / total_count
#     vals_and_ratios = dict(zip(unique, ratios))
#     print('Unique classes and their ratio of total: %s' % vals_and_ratios)
#     keep = [vals_and_ratios[v] >= thresh for v in labels]
#     return features[keep], labels[np.array(keep)]

# def safe_std(values):
#     """Remove zero std values for ones."""
#     return np.array([val if val != 0.0 else 1.0 for val in values])

# def create_bandit_contexts_and_labels(contexts, labels, num_actions=None):
#     print('Contexts: {}'.format(contexts))
#     print('Labels: {}'.format(labels))
#     """Normalize contexts and encode deterministic rewards."""
#     if num_actions is None:
#         num_actions = np.max(labels) + 1
#     print('Num Actions {}'.format(num_actions))

#     num_contexts = contexts.shape[0]
#     print('Num Contexts {}'.format(num_contexts))

#     # Due to random subsampling in small problems, some features may be constant
#     sstd = safe_std(np.std(contexts, axis=0, keepdims=True)[0, :])
#     print('sstd: {}'.format(sstd))
    
#     # Normalize features
#     print('contexts before {}'.format(contexts))
#     contexts_after = ((contexts - np.mean(contexts, axis=0, keepdims=True)) / sstd)
#     print('contexts after {}'.format(contexts_after))
    
#     # One hot encode labels as rewards
#     rewards = np.zeros((num_contexts, num_actions))
#     rewards[np.arange(num_contexts), labels] = 1.0

#     print('Contexts {}'.format(contexts_after))
#     return contexts_after, rewards, (np.ones(num_contexts), labels)

In [15]:
# import numpy as np
# import pandas as pd
# import boto3
# from src.io_utils import parse_s3_uri
# import csv

# # Generate a batch of experiences for warm starting the policy.
# def generate_sample_warm_start_data(data, batch_size=50):
#     num_actions = 2
    
#     ############
#     # TODO:  Factor this code out
#     data = pd.read_csv(data, header=0).to_numpy()
# #    df = pd.read_csv(data, 
# #                     delimiter='\t', 
# #                     quoting=csv.QUOTE_NONE,
# #                     compression='gzip')
# #    df_scrubbed = df[['review_body', 'star_rating']].sample(n=100)
# #    df_scrubbed = df_scrubbed.reset_index()
# #    df_scrubbed.shape
# #    data = df_scrubbed.to_numpy()
    
#     print('data {}'.format(data))
    
#     # Last column is label, the rest are the features    
#     contexts = data[:, :-1]
#     labels = data[:, -1].astype(int) - 1  # convert to 0 based index

#     # TODO:  Convert raw text into tokens
    
#     print('Contexts before remove underrepresented {}'.format(contexts))
#     print('Labels before remove underrepresented {}'.format(labels))

#     contexts_without_underrepresented_classes, labels_without_underrepresented_classes = remove_underrepresented_classes(contexts, labels)
    
#     print('Contexts after remove underrepresented {}'.format(contexts_without_underrepresented_classes))
#     print('Labels after remove underrepresented {}'.format(labels_without_underrepresented_classes))
    
#     bandit_context, bandit_labels, _ = create_bandit_contexts_and_labels(contexts_without_underrepresented_classes, 
#                                                                          labels_without_underrepresented_classes, 
#                                                                          num_actions)

#     joined_data_buffer = []
#     for i in range(0, batch_size):
#         context_index_i = np.random.choice(bandit_context.shape[0])
#         context_i = bandit_context[context_index_i]
#         action = np.random.choice(num_actions) + 1 # random action
#         action_prob = 1 / num_actions # probability of picking a random action
#         reward = 1 if bandit_labels[context_index_i][action-1] == 1 else 0

#         json_blob = {
#                      "reward": reward,
#                      "event_id": 'not-apply-to-warm-start',
#                      "action": action,
#                      "action_prob": action_prob,
#                      "model_id": 'not-apply-to-warm-start',
#                      "observation": context_i.tolist(),
#                      "sample_prob": np.random.uniform(0.0, 1.0)
#         }

#         joined_data_buffer.append(json_blob)

#     print('joined_data_buffer {}'.format(joined_data_buffer))

#     return joined_data_buffer

# # def download_historical_data_from_s3(data_s3_prefix):
# #     """Download the warm start data from S3."""
# #     s3_client = boto3.client('s3')
# #     bucket, prefix, _ = parse_s3_uri(data_s3_prefix)

# #     results = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
# #     contents = results.get('Contents')
# #     key = contents[0].get('Key')
    
# #     data_file_name = 'statlog_warm_start.data'
# #     s3_client.download_file(bucket, key, data_file_name)

# # def evaluate_historical_data(data_file):
# #     """Calculate policy value of the logged policy."""
# #     # Assume logged data comes from same policy 
# #     # so no need for counterfactual analysis
# #     offline_data = pd.read_csv(data_file, sep=",")
# #     offline_data_mean = offline_data['reward'].mean()
# #     offline_data_cost = 1 - offline_data_mean
# #     offline_data_cost
# #     return offline_data_cost

To start a new experiment, we need to initialize the first model. In the case where historical data is available and is in the format of `<state, action, action probability, reward>`, we can warm start by learning the bandit model offline. Otherwise, we can initiate a random bandit model.  Let's generate a batch of randomly selected samples with size `batch_size`.  

# Load Warm Start Data from Local Disk

In [16]:
# batch_size = 10

# generated_sample_warm_start_data_buffer = generate_sample_warm_start_data(data='./data/historical_reviews_and_star_ratings.csv',
#                                                                           batch_size=batch_size)
# print(generated_sample_warm_start_data_buffer)

# Ingest Warm Start Data (into S3)

During ingestion, we split the batch into train and validation sets using the parameter `ratio`.  For example, if `ratio=0.90`, then 90% is used for training and 10% for validation.

In [17]:
# bandits_experiment.ingest_joined_data(generated_sample_warm_start_data_buffer,
#                                       ratio=0.9)

In [18]:
# print('Ingested warm start data here\n{}'.format(bandits_experiment.last_joined_job_train_data))

In [19]:
# !aws s3 ls --recursive s3://sagemaker-us-east-1-835319576252/bandits-1597538280/joined_data/bandits-1597538280-join-job-id-1597538297/train
# !aws s3 cp s3://$bucket/bandits-1597538280/joined_data/bandits-1597538280-join-job-id-1597538297/train/local-joined-data-1597538298.csv .

In [20]:
# bandits_experiment._jsonify()

In [21]:
# !cat ./local-joined-data-1597538298.csv

In [22]:
#print(bandits_experiment.last_joined_job_train_data)

In [23]:
bandits_experiment.initialize_first_model() # input_data_s3_prefix=bandits_experiment.last_joined_job_train_data) 

INFO:orchestrator:Next Model name would be bandits-1597616270-model-id-1597616291
INFO:orchestrator:Start training job for model 'bandits-1597616270-model-id-1597616291''
INFO:orchestrator:Training job will be executed in 'local' mode


Creating tmpagbc2cxt_algo-1-8d3b6_1 ... 
[1BAttaching to tmpagbc2cxt_algo-1-8d3b6_12mdone[0m
[36malgo-1-8d3b6_1  |[0m 2020-08-16 22:18:13,548 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-8d3b6_1  |[0m 2020-08-16 22:18:13,560 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-8d3b6_1  |[0m 2020-08-16 22:18:13,572 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-8d3b6_1  |[0m 2020-08-16 22:18:13,581 sagemaker-containers INFO     Invoking user script
[36malgo-1-8d3b6_1  |[0m 
[36malgo-1-8d3b6_1  |[0m Training Env:
[36malgo-1-8d3b6_1  |[0m 
[36malgo-1-8d3b6_1  |[0m {
[36malgo-1-8d3b6_1  |[0m     "additional_framework_parameters": {
[36malgo-1-8d3b6_1  |[0m         "sagemaker_estimator": "RLEstimator"
[36malgo-1-8d3b6_1  |[0m     },
[36malgo-1-8d3b6_1  |[0m     "channel_input_dirs": {},
[36malgo-1-8d3b6_1  |[0m     "current_host": "algo-1-8d3b6",



[36mtmpagbc2cxt_algo-1-8d3b6_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


# Check Experiment State
`training_state`: `TRAINED`

Remember the `last_trained_model_id`.

In [24]:
bandits_experiment._jsonify()

{'experiment_id': 'bandits-1597616270',
 'training_workflow_metadata': {'last_trained_model_id': 'bandits-1597616270-model-id-1597616291',
  'next_model_to_train_id': None,
  'training_state': <TrainingState.TRAINED: 'TRAINED'>},
 'hosting_workflow_metadata': {'last_hosted_model_id': None,
  'hosting_endpoint': None,
  'hosting_state': None,
  'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': None,
  'next_join_job_id': None,
  'last_joined_job_id': None},
 'evaluation_workflow_metadata': {'evaluation_state': None,
  'last_evaluation_job_id': None,
  'next_evaluation_job_id': None}}

# Evaluate current model against historical model

After every training cycle, we evaluate if the newly trained model is better than the one currently deployed. Using the evaluation dataset, we evaluate how the new model would perform compared to the model that is currently deployed. SageMaker RL supports offline evaluation by performing counterfactual analysis (CFA). By default, we apply [**doubly robust (DR) estimation**](https://arxiv.org/pdf/1103.4601.pdf) method. The bandit policy tries to minimize the cost (1-reward) value in this case, so a smaller evaluation score indicates better policy performance.

In [25]:
# # evaluate the current model
# bandits_experiment.evaluate_model(
#     input_data_s3_prefix=bandits_experiment.last_joined_job_eval_data,
#     evaluate_model_id=bandits_experiment.last_trained_model_id)

# eval_score_last_trained_model = bandits_experiment.get_eval_score(
#     evaluate_model_id=bandits_experiment.last_trained_model_id,
#     eval_data_path=bandits_experiment.last_joined_job_eval_data
# )

In [26]:
# # WHAT IS THIS
# print(bandits_experiment.last_joined_job_eval_data)

In [27]:
# # get baseline performance from the historical (warm start) data
# download_historical_data_from_s3(data_s3_prefix=bandits_experiment.last_joined_job_eval_data)
# baseline_score = evaluate_historical_data(data_file='statlog_warm_start.data')
# baseline_score

# Check Experiment State
`TRAINED` => `EVALUATED`

The model just trained appears in both `last_trained_model_id` and `last_evaluation_job_id`.

In [28]:
# bandits_experiment._jsonify()

# Deploy the Bandit Model

Once training and evaluation is done, we can deploy the model.

In [29]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [30]:
# Check the model_id of the last model trained.
print('Deploying newly-trained bandit model: {}'.format(bandits_experiment.last_trained_model_id))

Deploying newly-trained bandit model: bandits-1597616270-model-id-1597616291


In [31]:
print('Deploying bandit model_id {}'.format(bandits_experiment.last_trained_model_id))

bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id) 

Deploying bandit model_id bandits-1597616270-model-id-1597616291


INFO:orchestrator:Model 'bandits-1597616270-model-id-1597616291' is ready to deploy.


Attaching to tmp6z5ug40e_algo-1-z1mdo_1
[36malgo-1-z1mdo_1  |[0m 17:C 16 Aug 2020 22:18:34.721 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
[36malgo-1-z1mdo_1  |[0m 17:C 16 Aug 2020 22:18:34.721 # Redis version=5.0.6, bits=64, commit=00000000, modified=0, pid=17, just started
[36malgo-1-z1mdo_1  |[0m 17:C 16 Aug 2020 22:18:34.721 # Configuration loaded
[36malgo-1-z1mdo_1  |[0m 17:M 16 Aug 2020 22:18:34.722 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[36malgo-1-z1mdo_1  |[0m 17:M 16 Aug 2020 22:18:34.722 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
[36malgo-1-z1mdo_1  |[0m 17:M 16 Aug 2020 22:18:34.722 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[36malgo-1-z1mdo_1  |[0m 17:M 16 Aug 2020 22:18:34.723 # Server initialized
[36malgo-1-z1mdo_1  |[0m [08/16/2020 22:18:37 INFO 14

# Check Experiment State
`hosting_state`: `DEPLOYED`

The `last_trained_model_id` and `last_hosted_model_id` are now the same as we just deployed the bandit model.

In [32]:
bandits_experiment._jsonify()

{'experiment_id': 'bandits-1597616270',
 'training_workflow_metadata': {'next_model_to_train_id': None,
  'last_trained_model_id': 'bandits-1597616270-model-id-1597616291',
  'training_state': 'TRAINED'},
 'hosting_workflow_metadata': {'hosting_endpoint': 'local:arn-does-not-matter',
  'hosting_state': <HostingState.DEPLOYED: 'DEPLOYED'>,
  'last_hosted_model_id': 'bandits-1597616270-model-id-1597616291',
  'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': None,
  'next_join_job_id': None,
  'last_joined_job_id': None},
 'evaluation_workflow_metadata': {'evaluation_state': None,
  'last_evaluation_job_id': None,
  'next_evaluation_job_id': None}}

# Initialize the Client Application

Now that the last trained model is hosted, client application can send out the state, hit the endpoint, and receive the recommended action. There are 2 models that we want to test:  model1 and model2.  This translates to 2 actions that the bandit model will predict.



In [33]:
%store -r step_functions_pipeline_endpoint_name

In [34]:
print(step_functions_pipeline_endpoint_name)

training-pipeline-2020-08-14-14-25-27


In [35]:
client = boto3.client('sagemaker')
waiter = client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=step_functions_pipeline_endpoint_name)

In [36]:
import json
from sagemaker.tensorflow.serving import Predictor

model1 = Predictor(endpoint_name=step_functions_pipeline_endpoint_name,
                   sagemaker_session=sess,
                   content_type='application/json',
                   model_name='saved_model',
                   model_version=0)

In [37]:
reviews = ["This is not good."]

model1_predicted_classes = model1.predict(reviews)

for predicted_class, review in zip(model1_predicted_classes, reviews):
    print('[Predicted Star Rating: {}]'.format(predicted_class), review)

[Predicted Star Rating: 2] This is not good.


In [38]:
%store -r step_functions_pipeline_endpoint_name_random

In [39]:
print(step_functions_pipeline_endpoint_name_random)

training-pipeline-2020-08-15-05-04-23


In [40]:
client = boto3.client('sagemaker')
waiter = client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=step_functions_pipeline_endpoint_name_random)

In [41]:
import json
from sagemaker.tensorflow.serving import Predictor

model2 = Predictor(endpoint_name=step_functions_pipeline_endpoint_name_random,
                   sagemaker_session=sess,
                   content_type='application/json',
                   model_name='saved_model',
                   model_version=0)


In [42]:
reviews = ["This is not good."]

model2_predicted_classes = model2.predict(reviews)

for predicted_class, review in zip(model2_predicted_classes, reviews):
    print('[Predicted Star Rating: {}]'.format(predicted_class), review)

[Predicted Star Rating: 1] This is not good.


In [43]:
import csv
import numpy as np

# Application simulation
class Simulation():
    def __init__(self, data, bandit_model, bert_model_map):
        self.bandit_model = bandit_model
        self.bert_model_map = bert_model_map
        
        self.num_actions = 2

        ############
        # TODO:  Factor this code out
#        data = pd.read_csv(data_file_path, header=0).to_numpy()
        df_reviews = pd.read_csv(data, 
                                 delimiter='\t', 
                                 quoting=csv.QUOTE_NONE,
                                 compression='gzip')
        df_scrubbed = df_reviews[['review_body', 'star_rating']].sample(n=100)
        df_scrubbed = df_scrubbed.reset_index()
        df_scrubbed.shape
        np_reviews = df_scrubbed.to_numpy()

        np_reviews = np.delete(np_reviews, 0, 1)
#        print('np_reviews {}'.format(np_reviews))
        
        # Last column is label, the rest are the features    
        contexts = np_reviews[:, :-1]
#        labels = data[:, -1].astype(int) - 1  # convert to 0 based index
        labels = np_reviews[:, -1] #.astype(int)

#        contexts, labels = remove_underrepresented_classes(contexts, labels)
#        self.contexts, self.labels, _ = create_bandit_contexts_and_labels(context, labels, self.num_actions)
        self.contexts = contexts
        self.labels = labels

        print(self.contexts)
        print(self.labels)        
        
        self.opt_rewards = [1]        
        self.rewards_buffer = []
        self.joined_data_buffer = []

#     def choose_random_context(self):
#         context_index = np.random.choice(self.contexts.shape[0])
#         context = self.contexts[context_index]
#         return context_index, context    
    
    def get_reward(self, 
                   context_index, 
                   action, 
                   event_id, 
                   bandit_model_id, 
                   action_prob, 
                   sample_prob, 
                   local_mode):
        print('context_index {}'.format(context_index))       
        context = self.contexts[context_index]
        print('context {}'.format(context))
        label = self.labels[context_index]
        print('label {}'.format(label))
        print('action {}'.format(action))
        print('event_id {}'.format(event_id))
        print('bandit_model_id {}'.format(bandit_model_id))
        print('action_prob {}'.format(action_prob))
        print('self.labels {}'.format(self.labels))
#        print('self.labels[context_index][action-1] {}'.format(self.labels[context_index][action-1]))
#        print('self.labels[context_index][action] {}'.format(self.labels[context_index][action]))
        
#        reward = 1 if self.labels[context_index][action-1] == 1 else 0

        # TODO:  Invoke bert model and assign reward

#        if self.labels[context_index]

        bert_model = self.bert_model_map[action]
        print(type(context))
        import base64
        context_as_str = base64.binascii.b2a_base64(context).decode("ascii")
        print(type(context_as_str))
        bert_predicted_class = bert_model.predict([context_as_str])
        print('bert_predicted_class {}'.format(bert_predicted_class))
        
        if bert_predicted_class[0] == label:
            reward = 1
        else:
            reward = -1
        print('reward {}'.format(reward))

        if local_mode:
            json_blob = {"reward": reward,
                         "event_id": event_id,
                         "action": action,
                         "action_prob": action_prob,
                         "model_id": bandit_model_id,
                         "observation": self.contexts[context_index], #.tolist(),
                         "sample_prob": sample_prob}
            self.joined_data_buffer.append(json_blob)
        else:
            json_blob = {"reward": reward, "event_id": event_id}
            self.rewards_buffer.append(json_blob)
        
        return reward
    
    def clear_buffer(self):
        self.rewards_buffer.clear()
        self.joined_data_buffer.clear()

In [44]:
bandit_model = bandits_experiment.predictor

sim_app = Simulation(data='./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz',
                     bandit_model=bandit_model,
                     bert_model_map={
                                     1: model1,
                                     2: model2
                                    }
                    )

[['Would not take any emails I have so made one up and was the only way could get product to download. Not impressed.']
 ['QuickBooks is the easy answer to your small business financial records.<br />Less time spend on keeping books allows more time to be spent taking care of<br />clients.  Growing the business has never been easier.']
 ["I love Quicken because it organizes s much personal finance information.  Most accounts can be downloaded right into the program.  I gave 3 stars for this product because the business side is lacking.  It now allows you to see your dealings in an app on your smartphone but the program is very bare bones on the business side and not super user friendly.  It's not as close to QuickBoQuick books as I imagined it would be."]
 ['This new version has a lot of the same as previous versions that can be printed as pre-made cards, banners, calendars, posters, etc. But, now every individual item must be downloaded and often times there are additional charges; $0

Make sure that `num_arms` specified in `config.yaml` is equal to the total unique actions in the simulation application.

In [45]:
print('num_actions {}'.format(sim_app.num_actions))
print('config num_actions {}'.format(bandits_experiment.config["algor"]["algorithms_parameters"]["num_arms"]))

assert sim_app.num_actions == bandits_experiment.config["algor"]["algorithms_parameters"]["num_arms"]

num_actions 2
config num_actions 2


In [46]:
import time

# context = sim_app.choose_random_context()
# print(type('Type {}'.format(context)))
# print(context)

review_body = 'foo'
action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=review_body) # obs=context)

# Check prediction response by uncommenting the lines below
print('event ID: {}\naction-bert-model-id: {}\naction-probability: {}\nbandit-model-id: {}\n'.format(event_id, action, action_prob, bandit_model_id))

event ID: 152898010924890051113667195582138155010
action-bert-model-id: 1
action-probability: 0.9995
bandit-model-id: bandits-1597616270-model-id-1597616291



In [47]:
import time

# context = sim_app.choose_random_context()
# print(type('Type {}'.format(context)))
# print(context)

review_body = 'bar'
action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=review_body) # obs=context)

# Check prediction response by uncommenting the lines below
print('event ID: {}\naction-bert-model-id: {}\naction-probability: {}\nbandit-model-id: {}\n'.format(event_id, action, action_prob, bandit_model_id))

event ID: 152903366748676015383627109492069629954
action-bert-model-id: 1
action-probability: 0.9995
bandit-model-id: bandits-1597616270-model-id-1597616291



# Ingest Reward

Client application generates a reward after receiving the recommended action and stores the tuple `<eventID, reward>` in S3. In this case, reward is 1 if predicted action is the true class, and 0 otherwise. SageMaker hosting endpoint saves all the inferences `<eventID, state, action, action probability>` to S3 using [**Kinesis Firehose**](https://aws.amazon.com/kinesis/data-firehose/). The `ExperimentManager` joins the reward with state, action and action probability using [**Amazon Athena**](https://aws.amazon.com/athena/). 

In [48]:
local_mode = bandits_experiment.local_mode
batch_size = 100 # collect 500 data instances
print("Collecting batch of experience data...")

# Generate experiences and log them
for i in range(batch_size):
#     user_id, user_context = sim_app.choose_random_context()
#     print('User ID {}'.format(user_id))
#     print('User Context {}'.format(user_context.tolist()))    
#    review_body = 'foo'
    review_index = np.random.choice(100)
    review_body = sim_app.contexts[0]
    action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=review_body) #obs=user_context.tolist())
    print('Action (bert model to invoke) {}'.format(action))
    print('Event ID {}'.format(event_id))
    print('Bandit Model ID {}'.format(bandit_model_id))
    print('Action Probability {}'.format(action_prob))
    print('Sample Probability {}'.format(sample_prob))
    
    reward = sim_app.get_reward(context_index=review_index, 
                                action=action, 
                                event_id=event_id, 
                                bandit_model_id=bandit_model_id, 
                                action_prob=action_prob, 
                                sample_prob=sample_prob, 
                                local_mode=local_mode)
    

Collecting batch of experience data...
Action (bert model to invoke) 1
Event ID 157316401546014168694818913395288440834
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.7426008290027138
context_index 24
context ['Great stuff!!!!!']
label 5
action 1
event_id 157316401546014168694818913395288440834
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 157476310123261583835314090956570099714
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.012282591918139296
context_index 54
context ['Great product.  I have been using the software for a doze

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 159077445528299979258416795398110838786
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.7864370694864601
context_index 81
context ["Use this at least once a week just to clean out all the junk that accumulates.  It's hard to believe that many cookies can nest inside your computer without even realizing it.  Easy to use, cleans well."]
label 5
action 1
event_id 159077445528299979258416795398110838786
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 159211321315908457422865486288191094786
Bandit Model ID bandits

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 160326385605668839991089873300608909314
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.5743943938576627
context_index 20
context ['The software was received by download. H&R Block tax software was chosen because it handles our two state returns properly. TurboTax never properly attributed the tax we paid in another state and in the next year, even after I informed them, they did not correct the problem.']
label 5
action 1
event_id 160326385605668839991089873300608909314
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 161736601738370112069582527296538017794
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.8733444362221723
context_index 12
context ['Excellent.']
label 5
action 1
event_id 161736601738370112069582527296538017794
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 161862802693875833447648531584149225474
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.5962490302893584
context_index 22
context ["Ever since my husband and I found avast! antivirus a few years ago, w

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 163177552059873917572772002825248047106
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.6411380887677466
context_index 36
context ["I not only have PC MATIC  but I also have VIPRE voted by PC Mag as one of the top 5 best Antivirus programs out there. I was comparing both of them and have to say that PC MATIC not only does just as good a job as VIPRE but PC MATIC has several other benefits that VIPRE does not have, so for the money and what you get with PC MATIC it's a No-Brainer for me I'm renewing my PC MATIC subscription and letting VIPRE expire."]
label 5
action 1
event_id 163177552059873917572772002825248047106
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 164627768907265391427171351327256215554
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.37814700645799904
context_index 32
context ['I have been using Quicken for over 15 years very successfully.  This version crashes and crashes and wipes out files.  Very frustrating.']
label 1
action 1
event_id 164627768907265391427171351327256215554
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 164825745824319285441618410599183679490
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample 

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 165823298903438511147920177526918479874
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.9904476042609042
context_index 49
context ['il kie it']
label 5
action 1
event_id 165823298903438511147920177526918479874
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 166115562879855755720225625869873446914
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.7561618579315595
context_index 93
context ['Easy and works very well. Simple download and setup.']
label 5
action 

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 167388861655789627023229222706631278594
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.638309642827643
context_index 99
context ["Great service - best online asset for home users.  I've used it for years and am thankful for Avast!  I recommend it!"]
label 5
action 1
event_id 167388861655789627023229222706631278594
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 167523589146145133529307044194177646594
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.0162089

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 168547208290731552204320056520633090050
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.41891037039402645
context_index 54
context ['Great product.  I have been using the software for a dozen years and will continue using year after year.  I recommend using this product.']
label 5
action 1
event_id 168547208290731552204320056520633090050
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 168682398473556142014129424304849420290
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sampl

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 170049419412054637175766078139055341570
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.03260422922426209
context_index 57
context ["The product worked fine, until I had re refresh my PC and re install.  When I attempted to re-install, the system no longer recognized m account...so I wasted my weekend trying to just get logged in.  Tech support and Customer service are outsourced and useless.  Save the headache of dealing with these guys...they don't deserve to be considered professional."]
label 1
action 1
event_id 170049419412054637175766078139055341570
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndar

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 171406059876700513423891225844888240130
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.738769370487238
context_index 38
context ['Great software and have no issues from the installation or updates. I left the software scan my drives at least once or twice a month.']
label 5
action 1
event_id 171406059876700513423891225844888240130
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 171532145159088963976762933897220718594
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Prob

bert_predicted_class [2]
reward -1
Action (bert model to invoke) 1
Event ID 172524554745897763641266921459539902466
Bandit Model ID bandits-1597616270-model-id-1597616291
Action Probability 0.9995
Sample Probability 0.128476156433594
context_index 21
context ["I've used Norton for years. I've tried others but never had the same results. Good price for 1/3 as I can use them as needed."]
label 4
action 1
event_id 172524554745897763641266921459539902466
bandit_model_id bandits-1597616270-model-id-1597616291
action_prob 0.9995
self.labels [1 5 3 3 4 2 5 1 1 5 2 5 5 5 1 5 4 4 2 4 5 4 5 5 5 1 4 3 5 1 5 5 1 5 1 4 5
 5 5 5 1 1 2 5 5 1 1 5 1 5 5 5 5 2 5 5 1 1 5 5 4 1 1 5 5 5 4 1 5 1 4 5 5 1
 1 1 1 4 5 5 5 5 1 4 5 5 1 5 5 5 3 3 5 5 4 2 5 2 4 5]
<class 'numpy.ndarray'>
<class 'str'>
bert_predicted_class [2]
reward -1


In [49]:
# Join (observation, action) with rewards (can be delayed) and upload the data to S3
if local_mode:
    print(sim_app.joined_data_buffer)
    bandits_experiment.ingest_joined_data(sim_app.joined_data_buffer)
else:
    print("Waiting for firehose to flush data to s3...")
    time.sleep(60) # Wait for firehose to flush data to S3
    rewards_s3_prefix = bandits_experiment.ingest_rewards(sim_app.rewards_buffer)
    bandits_experiment.join(rewards_s3_prefix)

[{'reward': -1, 'event_id': 157316401546014168694818913395288440834, 'action': 1, 'action_prob': 0.9995, 'model_id': 'bandits-1597616270-model-id-1597616291', 'observation': array(['Great stuff!!!!!'], dtype=object), 'sample_prob': 0.7426008290027138}, {'reward': -1, 'event_id': 157476310123261583835314090956570099714, 'action': 1, 'action_prob': 0.9995, 'model_id': 'bandits-1597616270-model-id-1597616291', 'observation': array(['Great product.  I have been using the software for a dozen years and will continue using year after year.  I recommend using this product.'],
      dtype=object), 'sample_prob': 0.012282591918139296}, {'reward': -1, 'event_id': 157638393513570015531613595146017439746, 'action': 1, 'action_prob': 0.9995, 'model_id': 'bandits-1597616270-model-id-1597616291', 'observation': array(["The product worked fine, until I had re refresh my PC and re install.  When I attempted to re-install, the system no longer recognized m account...so I wasted my weekend trying to just

INFO:orchestrator:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for athena queries
INFO:orchestrator:Started dummy local joining job...
INFO:orchestrator:Splitting data into train/evaluation set with ratio of 0.8
INFO:orchestrator:Joined data will be stored under s3://sagemaker-us-east-1-835319576252/bandits-1597616270/joined_data/bandits-1597616270-join-job-id-1597616350
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597616270/joined_data/bandits-1597616270-join-job-id-1597616350/train/local-joined-data-1597616350.csv
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597616270/joined_data/bandits-1597616270-join-job-id-1597616350/eval/local-joined-data-1597616350.csv


# Reset the Simulator 

In [50]:
sim_app.clear_buffer()

# Check Experiment Status
`joining_state`:  `SUCCEEDED`

In [51]:
bandits_experiment._jsonify()

{'experiment_id': 'bandits-1597616270',
 'training_workflow_metadata': {'next_model_to_train_id': None,
  'last_trained_model_id': 'bandits-1597616270-model-id-1597616291',
  'training_state': 'TRAINED'},
 'hosting_workflow_metadata': {'hosting_endpoint': 'local:arn-does-not-matter',
  'hosting_state': 'DEPLOYED',
  'last_hosted_model_id': 'bandits-1597616270-model-id-1597616291',
  'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': 'SUCCEEDED',
  'last_joined_job_id': 'bandits-1597616270-join-job-id-1597616350',
  'next_join_job_id': None},
 'evaluation_workflow_metadata': {'evaluation_state': None,
  'last_evaluation_job_id': None,
  'next_evaluation_job_id': None}}

In [52]:
bandits_experiment.last_joined_job_train_data

's3://sagemaker-us-east-1-835319576252/bandits-1597616270/joined_data/bandits-1597616270-join-job-id-1597616350/train'

In [53]:
!aws s3 ls --recursive $bandits_experiment.last_joined_job_train_data
!aws s3 cp --recursive $bandits_experiment.last_joined_job_train_data ./last_joined_job_train_data/

2020-08-16 22:19:11      44544 bandits-1597616270/joined_data/bandits-1597616270-join-job-id-1597616350/train/local-joined-data-1597616350.csv
download: s3://sagemaker-us-east-1-835319576252/bandits-1597616270/joined_data/bandits-1597616270-join-job-id-1597616350/train/local-joined-data-1597616350.csv to last_joined_job_train_data/local-joined-data-1597616350.csv


In [None]:
!cat ./last_joined_job_train_data/local-joined-data-1597616350.csv

# Re-train and Re-deploy

Now we can train a new model with newly collected experiences, and host the resulting model.

In [56]:
bandits_experiment.train_next_model(input_data_s3_prefix=bandits_experiment.last_joined_job_train_data)

INFO:orchestrator:Use last trained model bandits-1597616270-model-id-1597616291 as pre-trained model for training
INFO:orchestrator:Starting training job for ModelId 'bandits-1597616270-model-id-1597616385''
INFO:orchestrator:Training job will be executed in 'local' mode


Creating tmpt_nxmiq7_algo-1-ifzgj_1 ... 
[1BAttaching to tmpt_nxmiq7_algo-1-ifzgj_12mdone[0m
[36malgo-1-ifzgj_1  |[0m 2020-08-16 22:19:48,073 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-ifzgj_1  |[0m 2020-08-16 22:19:48,085 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-ifzgj_1  |[0m 2020-08-16 22:19:48,097 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-ifzgj_1  |[0m 2020-08-16 22:19:48,106 sagemaker-containers INFO     Invoking user script
[36malgo-1-ifzgj_1  |[0m 
[36malgo-1-ifzgj_1  |[0m Training Env:
[36malgo-1-ifzgj_1  |[0m 
[36malgo-1-ifzgj_1  |[0m {
[36malgo-1-ifzgj_1  |[0m     "additional_framework_parameters": {
[36malgo-1-ifzgj_1  |[0m         "sagemaker_estimator": "RLEstimator"
[36malgo-1-ifzgj_1  |[0m     },
[36malgo-1-ifzgj_1  |[0m     "channel_input_dirs": {
[36malgo-1-ifzgj_1  |[0m         "training": "/opt/ml/input/da

ERROR:orchestrator:Failed to run: ['docker-compose', '-f', '/tmp/tmpt_nxmiq7/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1


[36mtmpt_nxmiq7_algo-1-ifzgj_1 exited with code 1
[0mAborting on container exit...




KeyboardInterrupt: 

In [None]:
bandits_experiment.last_trained_model_id

In [None]:
bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id)

In [None]:
bandits_experiment.last_hosted_model_id

# Continuously Deploy New Bandit Models

The above cells explained the individual steps in the training workflow. To train a model to convergence, we will continually train the model based on data collected with client application interactions. We demonstrate the continual training loop in a single cell below.

We include the evaluation step at each step before deployment to compare the model just trained (`last_trained_model_id`) against the model that is currently hosted (`last_hosted_model_id`). If you want the loops to finish faster, you can set `do_evaluation=False` in the cell below.

Details of each joining and training job can be tracked in `join_db` and `model_db` respectively. `model_db` also stores the evaluation scores. When you have multiple experiments, you can check their status in `experiment_db`.

In [None]:
do_evaluation = True

# You can also monitor your loop progress on CloudWatch Dashboard 
display(Markdown(bandits_experiment.get_cloudwatch_dashboard_details()))

In [None]:
start_time = time.time()
total_loops = 2 # Increase for higher accuracy
batch_size = 10 # Model will be trained after every `batch_size` number of data instances
rewards_list = []

local_mode = bandits_experiment.local_mode
for loop_no in range(total_loops):
    print(f"""
    #############
    #### Loop {loop_no+1}
    #############
    """)
    
    # Generate experiences and log them
    for i in range(batch_size):
        user_id, user_context = sim_app.choose_random_context()
        action, event_id, bandit_model_id, action_prob, sample_prob = bandit_model.get_action(obs=user_context.tolist())
        reward = sim_app.get_reward(user_id, action, event_id, bandit_model_id, action_prob, sample_prob, local_mode)
        rewards_list.append(reward)
        
    # Publish rewards sum for this batch to CloudWatch for monitoring 
    bandits_experiment.cw_logger.publish_rewards_for_simulation(
        bandits_experiment.experiment_id,
        sum(rewards_list[-batch_size:])/batch_size
    )
    
    # Join the events and rewards data to use for the next bandit-model training job
    if local_mode:
        bandits_experiment.ingest_joined_data(sim_app.joined_data_buffer,
                                              ratio=0.85)
    else:
        # Kinesis Firehose => S3 => Athena
        print("Waiting for firehose to flush data to s3...")
        time.sleep(60) 
        rewards_s3_prefix = bandits_experiment.ingest_rewards(sim_app.rewards_buffer)
        bandits_experiment.join(rewards_s3_prefix, ratio=0.85)
    
    # Train 
    bandits_experiment.train_next_model(
        input_data_s3_prefix=bandits_experiment.last_joined_job_train_data)

    # Deply the new bandit model
    if do_evaluation:
        # Evaluate against the currently-deployed bandit model
        bandits_experiment.evaluate_model(
            input_data_s3_prefix=bandits_experiment.last_joined_job_eval_data,
            evaluate_model_id=bandits_experiment.last_trained_model_id)
        eval_score_last_trained_model = bandits_experiment.get_eval_score(
            evaluate_model_id=bandits_experiment.last_trained_model_id,
            eval_data_path=bandits_experiment.last_joined_job_eval_data)

        bandits_experiment.evaluate_model(
            input_data_s3_prefix=bandits_experiment.last_joined_job_eval_data,
            evaluate_model_id=bandits_experiment.last_hosted_model_id)

        eval_score_last_hosted_model = bandits_experiment.get_eval_score(
            evaluate_model_id=bandits_experiment.last_hosted_model_id, 
            eval_data_path=bandits_experiment.last_joined_job_eval_data)
    
        # Deploy
        if eval_score_last_trained_model <= eval_score_last_hosted_model:
            bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id)
        else:
            print('Not deploying model in loop {}'.format(loop_no))
    else:
        # Just deploy the new bandit model without evaluating against previous model
        bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id)
    
    sim_app.clear_buffer()

print(f"Total time taken to complete {total_loops} loops: {time.time() - start_time}")

# Visualize the Bandit Rewards

You can visualize the bandit-model training performance by plotting the rolling mean reward across client interactions.

Here rolling mean reward is calculated on the last `rolling_window` number of data instances, where each data instance corresponds to a single client interaction.

In [None]:
%%time

import matplotlib.pyplot as plt
from pylab import rcParams
import pandas as pd
%matplotlib inline

def get_mean_reward(reward_lst, batch_size=batch_size):
    mean_rew=list()
    for r in range(len(reward_lst)):
        mean_rew.append(sum(reward_lst[:r+1]) * 1.0 / ((r+1)*batch_size))
    return mean_rew

rcParams['figure.figsize'] = 15, 10
lwd = 5
cmap = plt.get_cmap('tab20')
colors=plt.cm.tab20(np.linspace(0, 1, 20))

rolling_window = 100
rewards_df = pd.DataFrame(rewards_list, columns=['bandit']).rolling(rolling_window).mean()
rewards_df['oracle'] = sum(sim_app.opt_rewards) / len(sim_app.opt_rewards)

rewards_df.plot(y=['bandit','oracle'],linewidth=lwd)
plt.legend(loc=4, prop={'size': 20})
plt.tick_params(axis='both', which='major', labelsize=15)
plt.xlabel('Data instances (models were updated every %s data instances)' % batch_size, size=20)
plt.ylabel('Rolling Mean Reward', size=30)
plt.grid()
plt.show()

#### Get mean rewards

In [None]:
rewards_df.bandit.mean()

### Clean up

We have three DynamoDB tables (experiment, join, model) from the bandits application above (e.g. `experiment_id='bandits-...'`). To better maintain them, we should remove the related records if the experiment has finished. Besides, having an endpoint running will incur costs. Therefore, we delete these components as part of the clean up process.

> Only execute the clean up cells below when you've finished the current experiment and want to deprecate everything associated with it. After the cleanup, the Cloudwatch metrics will not be populated anymore.

In [2]:
try:
    bandits_experiment.clean_resource(experiment_id=bandits_experiment.experiment_id)
    bandits_experiment.clean_table_records(experiment_id=bandits_experiment.experiment_id)
except:
    print('Ignore any errors.  Errors are OK.')

Ignore any errors.  Errors are OK.
