# Contextual Bandits with Amazon SageMaker RL

We demonstrate how you can manage your own contextual multi-armed bandit workflow on SageMaker using the built-in [Vowpal Wabbit (VW)](https://github.com/VowpalWabbit/vowpal_wabbit) container to train and deploy contextual bandit models. We show how to train these models that interact with a live environment (using a simulated client application) and continuously update the model with efficient exploration.

### Why Contextual Bandits?

Wherever we look to personalize content for a user (content layout, ads, search, product recommendations, etc.), contextual bandits come in handy. Traditional personalization methods collect a training dataset, build a model and deploy it for generating recommendations. However, the training algorithm does not inform us on how to collect this dataset, especially in a production system where generating poor recommendations lead to loss of revenue. Contextual bandit algorithms help us collect this data in a strategic manner by trading off between exploiting known information and exploring recommendations which may yield higher benefits. The collected data is used to update the personalization model in an online manner. Therefore, contextual bandits help us train a personalization model while minimizing the impact of poor recommendations.

### What does this notebook contain?

To implement the exploration-exploitation strategy, we need an iterative training and deployment system that: (1) recommends an action using the contextual bandit model based on user context, (2) captures the implicit feedback over time and (3) continuously trains the model with incremental interaction data. In this notebook, we show how to setup the infrastructure needed for such an iterative learning system. While the example demonstrates a bandits application, these continual learning systems are useful more generally in dynamic scenarios where models need to be continually updated to capture the recent trends in the data (e.g. tracking fraud behaviors based on detection mechanisms or tracking user interests over time). 

In a typical supervised learning setup, the model is trained with a SageMaker training job and it is hosted behind a SageMaker hosting endpoint. The client application calls the endpoint for inference and receives a response. In bandits, the client application also sends the reward (a score assigned to each recommendation generated by the model) back for subsequent model training. These rewards will be part of the dataset for the subsequent model training. 

# Based on this blog post:

https://aws.amazon.com/blogs/machine-learning/power-contextual-bandits-using-continual-learning-with-amazon-sagemaker-rl/

![](../../../img/multi_armed_bandit_maximize_reward.png)

![](../../../img/multi_armed_bandit_traffic_shift.png)

The contextual bandit training workflow is controlled by an experiment manager provided with this example. The client application (say a recommender system application) pings the SageMaker hosting endpoint that is serving the bandits model. The application sends the state (user features) as input and receives an action (recommendation) as a response. The client application sends the recommended action to the user and stores the received reward in S3. The SageMaker hosted endpoint also stores inference data (state and action) in S3. The experiment manager joins the inference data with rewards as they become available. The joined data is used to update the model with a SageMaker training job. The updated model is evaluated offline and deployed to SageMaker hosting endpoint if the model evaluation score improves upon prior models. 

Below is an overview of the subsequent cells in the notebook: 
* Configuration: this includes details related to SageMaker and other AWS resources needed for the bandits application. 
* IAM role setup: this creates appropriate execution role and shows how to add more permissions to the role, needed for specific AWS resources.
* Client application (Environment): this shows the simulated client application.
* Step-by-step bandits model development: 
 1. Model Initialization (random or warm-start) 
 2. Deploy the First Model 
 3. Initialize the Client Application 
 4. Reward Ingestion 
 5. Model Re-training and Re-deployment 
* Bandits model deployment with the end-to-end loop. 
* Visualization 
* Cleanup 

#### Local Mode

To facilitate experimentation, we provide a `local_mode` that runs the contextual bandit example using the SageMaker Notebook instance itself instead of SageMaker training and hosting instances. The workflow remains the same in `local_mode`, but runs much faster for small datasets. Hence, it is a useful tool for experimentation and debugging. However, it will not scale to production use cases with high throughput and large datasets. 

In `local_mode`, the training, evaluation and hosting is done with the SageMaker VW docker container. The join is not handled by SageMaker, and is done inside the client application. The rest of the textual explanation assumes that the notebook is run in SageMaker mode.

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

In [2]:
import yaml
import sys
import numpy as np
import time
import sagemaker

sys.path.append('common')
sys.path.append('common/sagemaker_rl')

from markdown_helper import *
from IPython.display import Markdown

### Configuration

The configuration for the bandits application can be specified in a `config.yaml` file as can be seen below. It configures the AWS resources needed. The DynamoDB tables are used to store metadata related to experiments, models and data joins. The `private_resource` specifices the SageMaker instance types and counts used for training, evaluation and hosting. The SageMaker container image is used for the bandits application. This config file also contains algorithm and SageMaker-specific setups.  Note that all the data generated and used for the bandits application will be stored in `s3://sagemaker-{REGION}-{AWS_ACCOUNT_ID}/{experiment_id}/`.

Please make sure that the `num_arms` parameter in the config is equal to the number of actions in the client application (which is defined in the cell below).

The Docker image is defined here:  https://github.com/aws/sagemaker-rl-container/blob/master/vw/docker/8.7.0/Dockerfile

In [3]:
!pygmentize 'config.yaml'

[94mresource[39;49;00m:
  [94mshared_resource[39;49;00m:
    [37m# cloud formation stack[39;49;00m
    [94mresources_cf_stack_name[39;49;00m: [33m"[39;49;00m[33mBanditsSharedResourceStack[39;49;00m[33m"[39;49;00m
    [37m# Dynamo table for status of an experiment[39;49;00m
    [94mexperiment_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsExperimentTable[39;49;00m[33m"[39;49;00m
    [37m# Dynamo table for status of all models trained[39;49;00m
    [94mmodel_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsModelTable[39;49;00m[33m"[39;49;00m
    [37m# Dynamo table for status of all joining job for reward ingestion[39;49;00m
    [94mjoin_db[39;49;00m:
      [94mtable_name[39;49;00m: [33m"[39;49;00m[33mBanditsJoinTable[39;49;00m[33m"[39;49;00m
    [94miam_role[39;49;00m:
      [94mrole_name[39;49;00m: [33m"[39;49;00m[33mBanditsIAMRole[39;49;00m[33m"[39;49;00m
  [94mpr

In [4]:
config_file = 'config.yaml'
with open(config_file, 'r') as yaml_file:
    config = yaml.load(yaml_file)

  app.launch_new_instance()


# Additional permissions for the IAM role
IAM role requires additional permissions for [AWS CloudFormation](https://aws.amazon.com/cloudformation/), [Amazon DynamoDB](https://aws.amazon.com/dynamodb/), [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [Amazon Athena](https://aws.amazon.com/athena/). Make sure the SageMaker role you are using has the permissions.

In [5]:
# display(Markdown(generate_help_for_experiment_manager_permissions(sagemaker_role)))

### Client application (Environment)
The client application simulates a live environment that uses the SageMaker bandits model to serve recommendations to users. The logic of reward generation resides in the client application. We simulate the online learning loop with feedback using the [Statlog (Shuttle) Data Set](https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)). The data consists of 7 classes, and if the agent selects the right class, then reward is 1. Otherwise, the agent obtains a reward 0.

The workflow of the client application is as follows:
- The client application picks a context at random, which is sent to the SageMaker endpoint for retrieving an action.
- SageMaker endpoint returns an action, associated probability and `event_id`.
- Since this simulator was generated from the Statlog dataset, we know the true class for that context. 
- The application reports the reward to the experiment manager using S3, along with the corresponding `event_id`.

`event_id` is a unique identifier for each interaction. It is used to join inference data `<state, action, action probability>` with the rewards. 

In a later cell of this notebook, where there exists a hosted endpoint, we illustrate how the client application interacts with the endpoint and gets the recommended action.

### Step-by-step bandits model development

`ExperimentManager` is the top level class for all the Bandits/RL and continual learning workflows. Similar to the estimators in the [Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk), `ExperimentManager` contains methods for training, deployment and evaluation. It keeps track of the job status and reflects current progress in the workflow.

Start the application using the `ExperimentManager` class 

In [6]:
import time

timestamp = int(time.time())

experiment_name = 'bandits-{}'.format(timestamp)

# `ExperimentManager` will create a AWS CloudFormation Stack of additional resources needed for the Bandit experiment. 

In [7]:
from orchestrator.workflow.manager.experiment_manager import ExperimentManager

bandits_experiment = ExperimentManager(config, experiment_id=experiment_name)

INFO:orchestrator.resource_manager:Using Resources in CloudFormation stack named: BanditsSharedResourceStack for Shared Resources.


In [8]:
try:
    bandits_experiment.clean_resource(experiment_id=bandits_experiment.experiment_id)

    bandits_experiment.clean_table_records(experiment_id=bandits_experiment.experiment_id)
except:
    print('Ignore any errors.  This is OK.')

bandits_experiment = ExperimentManager(config, experiment_id=experiment_name)

INFO:orchestrator:Deleting hosting endpoint 'bandits-1597300705'...
INFO:orchestrator.resource_manager:Using Resources in CloudFormation stack named: BanditsSharedResourceStack for Shared Resources.


# Initialize Model

To start a new experiment, we need to initialize the first model. In the case where historical data is available and is in the format of `<state, action, action probability, reward>`, we can warm start by learning the policy offline. Otherwise, we can initiate a random policy.

**Warm start the policy**

We showcase the warm start by generating a batch of randomly selected samples with size `batch_size`. Then we split it into a training set and an evaluation set using the parameter `ratio`.

In [9]:
!pip install -q wrapt --upgrade --ignore-installed
!pip install -q transformers==2.8.0
!pip install -q tensorflow==2.1.0

[31mERROR: astroid 2.3.3 has requirement wrapt==1.11.*, but you'll have wrapt 1.12.1 which is incompatible.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [10]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [11]:
import pandas as pd
import time
import uuid
import boto3
from urllib.parse import urlparse
import datetime
import json
import io
import numpy as np

def remove_underrepresented_classes(features, labels, thresh=0.0005):
    """Removes classes when number of datapoints fraction is below a threshold."""
    total_count = labels.shape[0]
    unique, counts = np.unique(labels, return_counts=True)
    ratios = counts.astype('float') / total_count
    vals_and_ratios = dict(zip(unique, ratios))
    print('Unique classes and their ratio of total: %s' % vals_and_ratios)
    keep = [vals_and_ratios[v] >= thresh for v in labels]
    return features[keep], labels[np.array(keep)]

def safe_std(values):
    """Remove zero std values for ones."""
    return np.array([val if val != 0.0 else 1.0 for val in values])

def classification_to_bandit_problem(contexts, labels, num_actions=None):
    """Normalize contexts and encode deterministic rewards."""
    if num_actions is None:
        num_actions = np.max(labels) + 1
    num_contexts = contexts.shape[0]

    # Due to random subsampling in small problems, some features may be constant
    sstd = safe_std(np.std(contexts, axis=0, keepdims=True)[0, :])

    # Normalize features
    contexts = ((contexts - np.mean(contexts, axis=0, keepdims=True)) / sstd)

    # One hot encode labels as rewards
    rewards = np.zeros((num_contexts, num_actions))
    rewards[np.arange(num_contexts), labels] = 1.0

    return contexts, rewards, (np.ones(num_contexts), labels)


class StatlogSimApp():
    """
    A client application simulator using Statlog data.
    """
    def __init__(self, predictor, data):
# #        file_name = 'sim_app/shuttle.trn'
#         self.num_actions = 5
# #        self.data_size = 43483
        
#         with open(file_name, 'r') as f:
#             data = np.loadtxt(f)

#         # Shuffle data
#         np.random.shuffle(data)

#         # Last column is label, rest are features
#         contexts = data[:, :-1]
#         labels = data[:, -1].astype(int) - 1  # convert to 0 based index

        self.num_actions = 5

        ############
        # TODO:  Factor this code out
        data = pd.read_csv(data).to_numpy()

    #    df = pd.read_csv(data, 
    #                     delimiter='\t', 
    #                     quoting=csv.QUOTE_NONE,
    #                     compression='gzip')
    #    df_scrubbed = df[['review_body', 'star_rating']].sample(n=100)
    #    df_scrubbed = df_scrubbed.reset_index()
    #    df_scrubbed.shape
    #    data = df_scrubbed.to_numpy()

        # Last column is label, the rest are the features    
        data_without_index = data[:,1:]
        contexts = data_without_index[:, :-1]
        labels = data_without_index[:, -1].astype(int) - 1  # convert to 0 based index
        ############
    
        context, labels = remove_underrepresented_classes(contexts, labels)
        self.context, self.labels, _ = classification_to_bandit_problem(
                                        context, labels, self.num_actions)
        self.opt_rewards = [1]
        
        self.rewards_buffer = []
        self.joined_data_buffer = []

    def choose_random_user(self):
        context_index = np.random.choice(self.context.shape[0])
        context = self.context[context_index]
        return context_index, context
    
    def get_reward(self, 
                   context_index, 
                   action, 
                   event_id, 
                   model_id, 
                   action_prob, 
                   sample_prob, 
                   local_mode):

        reward = 1 if self.labels[context_index][action-1] == 1 else 0

        if local_mode:
            json_blob = {"reward": reward,
                         "event_id": event_id,
                         "action": action,
                         "action_prob": action_prob,
                         "model_id": model_id,
                         "observation": self.context[context_index].tolist(),
                         "sample_prob": sample_prob}
            self.joined_data_buffer.append(json_blob)
        else:
            json_blob = {"reward": reward, "event_id": event_id}
            self.rewards_buffer.append(json_blob)
        
        return reward
    
    def clear_buffer(self):
        self.rewards_buffer.clear()
        self.joined_data_buffer.clear()

In [12]:
import numpy as np
import pandas as pd
import boto3
from src.io_utils import parse_s3_uri
import csv

def prepare_warm_start_data(data, batch_size=100):
    """
    Generate a batch of experiences for warm starting the policy.
    """
    
    num_actions = 5
    
    ############
    # TODO:  Factor this code out
    data = pd.read_csv(data).to_numpy()
        
#    df = pd.read_csv(data, 
#                     delimiter='\t', 
#                     quoting=csv.QUOTE_NONE,
#                     compression='gzip')
#    df_scrubbed = df[['review_body', 'star_rating']].sample(n=100)
#    df_scrubbed = df_scrubbed.reset_index()
#    df_scrubbed.shape
#    data = df_scrubbed.to_numpy()
    
    # Last column is label, the rest are the features    
    data_without_index = data[:,1:]
    contexts = data_without_index[:, :-1]
    labels = data_without_index[:, -1].astype(int) - 1  # convert to 0 based index

    # TODO:  Convert raw text into tokens
    
    # print(contexts)
    # print(labels)

    context, labels = remove_underrepresented_classes(contexts, labels)
    
    print(type(labels[0]))
    print(type(context[0]))
    
    statlog_context, statlog_labels, _ = classification_to_bandit_problem(
                                    context, labels, num_actions)

    joined_data_buffer = []
    for i in range(0, batch_size):
        context_index_i = np.random.choice(statlog_context.shape[0])
        context_i = statlog_context[context_index_i]
        action = np.random.choice(num_actions) + 1 #random action
        action_prob = 1 / num_actions # probability of picking a random action
        reward = 1 if statlog_labels[context_index_i][action-1] == 1 else 0

        json_blob = {"reward": reward,
                    "event_id": 'not-apply-to-warm-start',
                    "action": action,
                    "action_prob": action_prob,
                    "model_id": 'not-apply-to-warm-start',
                    "observation": context_i.tolist(),
                    "sample_prob": np.random.uniform(0.0, 1.0)}

        joined_data_buffer.append(json_blob)

    return joined_data_buffer

# def download_historical_data_from_s3(data_s3_prefix):
#     """Download the warm start data from S3."""
#     s3_client = boto3.client('s3')
#     bucket, prefix, _ = parse_s3_uri(data_s3_prefix)

#     results = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
#     contents = results.get('Contents')
#     key = contents[0].get('Key')
    
#     data_file_name = 'statlog_warm_start.data'
#     s3_client.download_file(bucket, key, data_file_name)

# def evaluate_historical_data(data_file):
#     """Calculate policy value of the logged policy."""
#     # Assume logged data comes from same policy 
#     # so no need for counterfactual analysis
#     offline_data = pd.read_csv(data_file, sep=",")
#     offline_data_mean = offline_data['reward'].mean()
#     offline_data_cost = 1 - offline_data_mean
#     offline_data_cost
#     return offline_data_cost

In [13]:
batch_size = 100
#warm_start_data_buffer = prepare_warm_start_data('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz',
warm_start_data_buffer = prepare_warm_start_data(data='./data/model_rewards.csv',
                                                 batch_size=batch_size)

# upload to s3
bandits_experiment.ingest_joined_data(warm_start_data_buffer,
                                      ratio=0.8)


Unique classes and their ratio of total: {0: 0.1111111111111111, 1: 0.2222222222222222, 2: 0.2222222222222222, 3: 0.2222222222222222, 4: 0.2222222222222222}
<class 'numpy.int64'>
<class 'numpy.ndarray'>


INFO:orchestrator:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for athena queries
INFO:orchestrator:Started dummy local joining job...
INFO:orchestrator:Splitting data into train/evaluation set with ratio of 0.8
INFO:orchestrator:Joined data will be stored under s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300713
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300713/train/local-joined-data-1597300713.csv
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300713/eval/local-joined-data-1597300713.csv


In [14]:
bandits_experiment._jsonify()

{'experiment_id': 'bandits-1597300705',
 'training_workflow_metadata': {'next_model_to_train_id': None,
  'last_trained_model_id': None,
  'training_state': None},
 'hosting_workflow_metadata': {'last_hosted_model_id': None,
  'hosting_endpoint': None,
  'hosting_state': None,
  'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': 'SUCCEEDED',
  'last_joined_job_id': 'bandits-1597300705-join-job-id-1597300713',
  'next_join_job_id': None},
 'evaluation_workflow_metadata': {'evaluation_state': None,
  'last_evaluation_job_id': None,
  'next_evaluation_job_id': None}}

In [15]:
bandits_experiment.initialize_first_model(input_data_s3_prefix=bandits_experiment.last_joined_job_train_data) 

INFO:orchestrator:Next Model name would be bandits-1597300705-model-id-1597300721
INFO:orchestrator:Start training job for model 'bandits-1597300705-model-id-1597300721''
INFO:orchestrator:Training job will be executed in 'local' mode


Creating tmpbdure2z3_algo-1-6m82s_1 ... 
[1BAttaching to tmpbdure2z3_algo-1-6m82s_12mdone[0m
[36malgo-1-6m82s_1  |[0m 2020-08-13 06:38:44,715 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-6m82s_1  |[0m 2020-08-13 06:38:44,741 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-6m82s_1  |[0m 2020-08-13 06:38:44,752 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-6m82s_1  |[0m 2020-08-13 06:38:44,762 sagemaker-containers INFO     Invoking user script
[36malgo-1-6m82s_1  |[0m 
[36malgo-1-6m82s_1  |[0m Training Env:
[36malgo-1-6m82s_1  |[0m 
[36malgo-1-6m82s_1  |[0m {
[36malgo-1-6m82s_1  |[0m     "additional_framework_parameters": {
[36malgo-1-6m82s_1  |[0m         "sagemaker_estimator": "RLEstimator"
[36malgo-1-6m82s_1  |[0m     },
[36malgo-1-6m82s_1  |[0m     "channel_input_dirs": {
[36malgo-1-6m82s_1  |[0m         "training": "/opt/ml/input/da



[36mtmpbdure2z3_algo-1-6m82s_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


# Evaluate current model against historical model

After every training cycle, we evaluate if the newly trained model is better than the one currently deployed. Using the evaluation dataset, we evaluate how the new model would perform compared to the model that is currently deployed. SageMaker RL supports offline evaluation by performing counterfactual analysis (CFA). By default, we apply [doubly robust (DR) estimation](https://arxiv.org/pdf/1103.4601.pdf) method. The bandit policy tries to minimize the cost (1-reward) value in this case, so a smaller evaluation score indicates better policy performance.

In [16]:
# evaluate the current model
bandits_experiment.evaluate_model(
    input_data_s3_prefix=bandits_experiment.last_joined_job_eval_data,
    evaluate_model_id=bandits_experiment.last_trained_model_id)

eval_score_last_trained_model = bandits_experiment.get_eval_score(
    evaluate_model_id=bandits_experiment.last_trained_model_id,
    eval_data_path=bandits_experiment.last_joined_job_eval_data
)

INFO:orchestrator:Evaluating model 'bandits-1597300705-model-id-1597300721' with evaluation job id 'bandits-1597300705-model-id-1597300721-eval-1597300735'
INFO:orchestrator:Evaluation job will be executed in 'local' mode
INFO:orchestrator:Getting eval scores for model 'bandits-1597300705-model-id-1597300721' on eval data set 's3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300713/eval'
INFO:orchestrator:Evaluation score for model 'bandits-1597300705-model-id-1597300721'with data 's3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300713/eval' is 1.001176.


In [17]:
# # get baseline performance from the historical (warm start) data
# download_historical_data_from_s3(data_s3_prefix=bandits_experiment.last_joined_job_eval_data)
# baseline_score = evaluate_historical_data(data_file='statlog_warm_start.data')
# baseline_score

In [18]:
# Check the model_id of the last model trained.
bandits_experiment.last_trained_model_id

'bandits-1597300705-model-id-1597300721'

# Deploy the First Model

Once training and evaluation is done, we can deploy the model.

In [19]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [20]:
bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id) 

INFO:orchestrator:Model 'bandits-1597300705-model-id-1597300721' is ready to deploy.


Attaching to tmp89xlh7wl_algo-1-psdo0_1
[36malgo-1-psdo0_1  |[0m 18:C 13 Aug 2020 06:39:10.890 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
[36malgo-1-psdo0_1  |[0m 18:C 13 Aug 2020 06:39:10.890 # Redis version=5.0.6, bits=64, commit=00000000, modified=0, pid=18, just started
[36malgo-1-psdo0_1  |[0m 18:C 13 Aug 2020 06:39:10.890 # Configuration loaded
[36malgo-1-psdo0_1  |[0m 18:M 13 Aug 2020 06:39:10.891 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[36malgo-1-psdo0_1  |[0m 18:M 13 Aug 2020 06:39:10.891 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
[36malgo-1-psdo0_1  |[0m 18:M 13 Aug 2020 06:39:10.891 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[36malgo-1-psdo0_1  |[0m 18:M 13 Aug 2020 06:39:10.891 # Server initialized
[36malgo-1-psdo0_1  |[0m [08/13/2020 06:39:13 INFO 13

You can check the experiment state at any point by executing:

In [21]:
bandits_experiment._jsonify()

{'experiment_id': 'bandits-1597300705',
 'training_workflow_metadata': {'next_model_to_train_id': None,
  'last_trained_model_id': 'bandits-1597300705-model-id-1597300721',
  'training_state': 'TRAINED'},
 'hosting_workflow_metadata': {'hosting_endpoint': 'local:arn-does-not-matter',
  'hosting_state': <HostingState.DEPLOYED: 'DEPLOYED'>,
  'last_hosted_model_id': 'bandits-1597300705-model-id-1597300721',
  'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': 'SUCCEEDED',
  'last_joined_job_id': 'bandits-1597300705-join-job-id-1597300713',
  'next_join_job_id': None},
 'evaluation_workflow_metadata': {'evaluation_state': 'EVALUATED',
  'last_evaluation_job_id': 'bandits-1597300705-model-id-1597300721-eval-1597300735',
  'next_evaluation_job_id': None}}

The model just trained appears in both `last_trained_model_id` and `last_hosted_model_id`.

# Initialize the Client Application

Now that the last trained model is hosted, client application can send out the state, hit the endpoint, and receive the recommended action. There are 7 classes in the statlog data, corresponding to 7 actions respectively.

In [22]:
predictor = bandits_experiment.predictor

sim_app = StatlogSimApp(data='./data/model_rewards.csv',
                        predictor=predictor)

Unique classes and their ratio of total: {0: 0.1111111111111111, 1: 0.2222222222222222, 2: 0.2222222222222222, 3: 0.2222222222222222, 4: 0.2222222222222222}


Make sure that `num_arms` specified in `config.yaml` is equal to the total unique actions in the simulation application.

In [23]:
assert sim_app.num_actions == bandits_experiment.config["algor"]["algorithms_parameters"]["num_arms"]

In [24]:
import time

user_id, user_context = sim_app.choose_random_user()

action, event_id, model_id, action_prob, sample_prob = predictor.get_action(obs=user_context)

# Check prediction response by uncommenting the lines below
print('Selected action: {}, event ID: {}, model ID: {}, probability: {}'.format(action, event_id, model_id, action_prob))

Selected action: 3, event ID: 243479965267372479126869264948909965314, model ID: bandits-1597300705-model-id-1597300721, probability: 0.9992000000000001


# Ingest Reward

Client application generates a reward after receiving the recommended action and stores the tuple `<eventID, reward>` in S3. In this case, reward is 1 if predicted action is the true class, and 0 otherwise. SageMaker hosting endpoint saves all the inferences `<eventID, state, action, action probability>` to S3 using [Kinesis Firehose](https://aws.amazon.com/kinesis/data-firehose/). The experiment manager joins the reward with state, action and action probability using [Amazon Athena](https://aws.amazon.com/athena/). 

In [25]:
local_mode = bandits_experiment.local_mode
batch_size = 500 # collect 500 data instances
print("Collecting batch of experience data...")

# Generate experiences and log them
for i in range(batch_size):
    user_id, user_context = sim_app.choose_random_user()
    action, event_id, model_id, action_prob, sample_prob = predictor.get_action(obs=user_context.tolist())
    reward = sim_app.get_reward(user_id, action, event_id, model_id, action_prob, sample_prob, local_mode)
    
# Join (observation, action) with rewards (can be delayed) and upload the data to S3
if local_mode:
    bandits_experiment.ingest_joined_data(sim_app.joined_data_buffer)
else:
    print("Waiting for firehose to flush data to s3...")
    time.sleep(60) # Wait for firehose to flush data to S3
    rewards_s3_prefix = bandits_experiment.ingest_rewards(sim_app.rewards_buffer)
    bandits_experiment.join(rewards_s3_prefix)
    
sim_app.clear_buffer()

Collecting batch of experience data...


INFO:orchestrator:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for athena queries
INFO:orchestrator:Started dummy local joining job...
INFO:orchestrator:Splitting data into train/evaluation set with ratio of 0.8
INFO:orchestrator:Joined data will be stored under s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300760
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300760/train/local-joined-data-1597300760.csv
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300760/eval/local-joined-data-1597300760.csv


In [26]:
bandits_experiment.last_joined_job_train_data

's3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300760/train'

In [27]:
# Check the workflow to see if join job has completed successfully
bandits_experiment._jsonify()

{'experiment_id': 'bandits-1597300705',
 'training_workflow_metadata': {'next_model_to_train_id': None,
  'last_trained_model_id': 'bandits-1597300705-model-id-1597300721',
  'training_state': 'TRAINED'},
 'hosting_workflow_metadata': {'hosting_endpoint': 'local:arn-does-not-matter',
  'hosting_state': 'DEPLOYED',
  'last_hosted_model_id': 'bandits-1597300705-model-id-1597300721',
  'next_model_to_host_id': None},
 'joining_workflow_metadata': {'joining_state': 'SUCCEEDED',
  'last_joined_job_id': 'bandits-1597300705-join-job-id-1597300760',
  'next_join_job_id': None},
 'evaluation_workflow_metadata': {'evaluation_state': 'EVALUATED',
  'last_evaluation_job_id': 'bandits-1597300705-model-id-1597300721-eval-1597300735',
  'next_evaluation_job_id': None}}

# Re-train and Re-deploy

Now we can train a new model with newly collected experiences, and host the resulting model.

In [28]:
bandits_experiment.train_next_model(input_data_s3_prefix=bandits_experiment.last_joined_job_train_data)

INFO:orchestrator:Use last trained model bandits-1597300705-model-id-1597300721 as pre-trained model for training
INFO:orchestrator:Starting training job for ModelId 'bandits-1597300705-model-id-1597300768''
INFO:orchestrator:Training job will be executed in 'local' mode


Creating tmpjboqt5rk_algo-1-whyov_1 ... 
[1BAttaching to tmpjboqt5rk_algo-1-whyov_12mdone[0m
[36malgo-1-whyov_1  |[0m 2020-08-13 06:39:31,340 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-whyov_1  |[0m 2020-08-13 06:39:31,352 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-whyov_1  |[0m 2020-08-13 06:39:31,363 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-whyov_1  |[0m 2020-08-13 06:39:31,372 sagemaker-containers INFO     Invoking user script
[36malgo-1-whyov_1  |[0m 
[36malgo-1-whyov_1  |[0m Training Env:
[36malgo-1-whyov_1  |[0m 
[36malgo-1-whyov_1  |[0m {
[36malgo-1-whyov_1  |[0m     "additional_framework_parameters": {
[36malgo-1-whyov_1  |[0m         "sagemaker_estimator": "RLEstimator"
[36malgo-1-whyov_1  |[0m     },
[36malgo-1-whyov_1  |[0m     "channel_input_dirs": {
[36malgo-1-whyov_1  |[0m         "training": "/opt/ml/input/da

[36mtmpjboqt5rk_algo-1-whyov_1 exited with code 0
[0mAborting on container exit...




===== Job Complete =====


In [29]:
bandits_experiment.last_trained_model_id

'bandits-1597300705-model-id-1597300768'

In [30]:
bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id)

INFO:orchestrator:Model 'bandits-1597300705-model-id-1597300768' is ready to deploy.


[36malgo-1-psdo0_1  |[0m [08/13/2020 06:39:42 INFO 139656814556928] Found new model! Trying to replace Model ID: bandits-1597300705-model-id-1597300721 with Model ID: bandits-1597300705-model-id-1597300768
[36malgo-1-psdo0_1  |[0m [2020-08-13 06:39:42 +0000] [24] [INFO] Handling signal: hup
[36malgo-1-psdo0_1  |[0m [2020-08-13 06:39:42 +0000] [24] [INFO] Hang up: Master
[36malgo-1-psdo0_1  |[0m [2020-08-13 06:39:42 +0000] [39] [INFO] Booting worker with pid: 39
[36malgo-1-psdo0_1  |[0m [2020-08-13 06:39:42 +0000] [40] [INFO] Booting worker with pid: 40
[36malgo-1-psdo0_1  |[0m [08/13/2020 06:39:42 INFO 139656814556928] creating an instance of VWModel
[36malgo-1-psdo0_1  |[0m [08/13/2020 06:39:42 INFO 139656814556928] successfully created VWModel
[36malgo-1-psdo0_1  |[0m [08/13/2020 06:39:42 INFO 139656814556928] command: ['vw', '--cb_explore', '5', '--epsilon', '0.001', '-p', '/dev/stdout', '--quiet', '--testonly', '-i', '/opt/ml/downloads/6jbeFT0P/vw.model']
[

In [31]:
bandits_experiment.last_hosted_model_id

'bandits-1597300705-model-id-1597300768'

# Continuously Deploy New Bandit Models

The above cells explained the individual steps in the training workflow. To train a model to convergence, we will continually train the model based on data collected with client application interactions. We demonstrate the continual training loop in a single cell below.

We include the evaluation step at each step before deployment to compare the model just trained (`last_trained_model_id`) against the model that is currently hosted (`last_hosted_model_id`). If you want the loops to finish faster, you can set `do_evaluation=False` in the cell below.

Details of each joining and training job can be tracked in `join_db` and `model_db` respectively. `model_db` also stores the evaluation scores. When you have multiple experiments, you can check their status in `experiment_db`.

In [32]:
do_evaluation = True

# You can also monitor your loop progress on CloudWatch Dashboard 
display(Markdown(bandits_experiment.get_cloudwatch_dashboard_details()))

You can monitor your Training/Hosting evaluation metrics on this [CloudWatch Dashboard](https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=bandits-1597300705;start=PT1H)

(Note: This would need Trained/Hosted Models to be evaluated in order to publish Evaluation Scores)

In [None]:
start_time = time.time()
total_loops = 2 # Increase for higher accuracy
batch_size = 500 # Model will be trained after every 500 data instances
rewards_list = []

local_mode = bandits_experiment.local_mode
for loop_no in range(total_loops):
    print(f"""
    #### Loop {loop_no+1}
    """)
    
    # Generate experiences and log them
    for i in range(batch_size):
        user_id, user_context = sim_app.choose_random_user()
        action, event_id, model_id, action_prob, sample_prob = predictor.get_action(obs=user_context.tolist())
        reward = sim_app.get_reward(user_id, action, event_id, model_id, action_prob, sample_prob, local_mode)
        rewards_list.append(reward)
    
    
    # publish rewards sum for this batch to CloudWatch for monitoring 
    bandits_experiment.cw_logger.publish_rewards_for_simulation(
        bandits_experiment.experiment_id,
        sum(rewards_list[-batch_size:])/batch_size
    )
    
    # Local/Athena join
    if local_mode:
        bandits_experiment.ingest_joined_data(sim_app.joined_data_buffer,ratio=0.85)
    else:
        print("Waiting for firehose to flush data to s3...")
        time.sleep(60) 
        rewards_s3_prefix = bandits_experiment.ingest_rewards(sim_app.rewards_buffer)
        bandits_experiment.join(rewards_s3_prefix, ratio=0.85)
    
    # Train 
    bandits_experiment.train_next_model(
        input_data_s3_prefix=bandits_experiment.last_joined_job_train_data)
    
    if do_evaluation:
    # Evaluate
        bandits_experiment.evaluate_model(
            input_data_s3_prefix=bandits_experiment.last_joined_job_eval_data,
            evaluate_model_id=bandits_experiment.last_trained_model_id)
        eval_score_last_trained_model = bandits_experiment.get_eval_score(
            evaluate_model_id=bandits_experiment.last_trained_model_id,
            eval_data_path=bandits_experiment.last_joined_job_eval_data)

        bandits_experiment.evaluate_model(
            input_data_s3_prefix=bandits_experiment.last_joined_job_eval_data,
            evaluate_model_id=bandits_experiment.last_hosted_model_id)

        eval_score_last_hosted_model = bandits_experiment.get_eval_score(
            evaluate_model_id=bandits_experiment.last_hosted_model_id, 
            eval_data_path=bandits_experiment.last_joined_job_eval_data)
    
        # Deploy
        if eval_score_last_trained_model <= eval_score_last_hosted_model:
            bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id)
        else:
            print('Not deploying model in loop {}'.format(loop_no))
    else:
        bandits_experiment.deploy_model(model_id=bandits_experiment.last_trained_model_id)
    
    sim_app.clear_buffer()

print(f"Total time taken to complete {total_loops} loops: {time.time() - start_time}")


    #### Loop 1
    


INFO:orchestrator:Successfully create S3 bucket 'sagemaker-us-east-1-835319576252' for athena queries
INFO:orchestrator:Started dummy local joining job...
INFO:orchestrator:Splitting data into train/evaluation set with ratio of 0.85
INFO:orchestrator:Joined data will be stored under s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300788
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300788/train/local-joined-data-1597300789.csv
INFO:orchestrator:_upload_data_buffer_as_joined_data_format put s3://sagemaker-us-east-1-835319576252/bandits-1597300705/joined_data/bandits-1597300705-join-job-id-1597300788/eval/local-joined-data-1597300789.csv
INFO:orchestrator:Use last trained model bandits-1597300705-model-id-1597300768 as pre-trained model for training
INFO:orchestrator:Starting training job for ModelId 'bandits-159

Creating tmpce5k3r3h_algo-1-g73hb_1 ... 
[1BAttaching to tmpce5k3r3h_algo-1-g73hb_12mdone[0m
[36malgo-1-g73hb_1  |[0m 2020-08-13 06:39:59,754 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-g73hb_1  |[0m 2020-08-13 06:39:59,767 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-g73hb_1  |[0m 2020-08-13 06:39:59,778 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-g73hb_1  |[0m 2020-08-13 06:39:59,787 sagemaker-containers INFO     Invoking user script
[36malgo-1-g73hb_1  |[0m 
[36malgo-1-g73hb_1  |[0m Training Env:
[36malgo-1-g73hb_1  |[0m 
[36malgo-1-g73hb_1  |[0m {
[36malgo-1-g73hb_1  |[0m     "additional_framework_parameters": {
[36malgo-1-g73hb_1  |[0m         "sagemaker_estimator": "RLEstimator"
[36malgo-1-g73hb_1  |[0m     },
[36malgo-1-g73hb_1  |[0m     "channel_input_dirs": {
[36malgo-1-g73hb_1  |[0m         "training": "/opt/ml/input/da



[36mtmpce5k3r3h_algo-1-g73hb_1 exited with code 0
[0mAborting on container exit...
===== Job Complete =====


# Visualize the Bandit Rewards

You can visualize the model performance along the training loop by plotting the rolling mean reward across client interactions. Here rolling mean reward is calculated on the last `rolling_window` number of data instances, where each data instance corresponds to a single client interaction. 

> Note: The plot below cannot be generated if the notebook has been restarted after the execution of the cell above. 



In [None]:
%%time

import matplotlib.pyplot as plt
from pylab import rcParams
import pandas as pd
%matplotlib inline

def get_mean_reward(reward_lst, batch_size=batch_size):
    mean_rew=list()
    for r in range(len(reward_lst)):
        mean_rew.append(sum(reward_lst[:r+1]) * 1.0 / ((r+1)*batch_size))
    return mean_rew

rcParams['figure.figsize'] = 15, 10
lwd = 5
cmap = plt.get_cmap('tab20')
colors=plt.cm.tab20(np.linspace(0, 1, 20))

rolling_window = 100
rewards_df = pd.DataFrame(rewards_list, columns=['bandit']).rolling(rolling_window).mean()
rewards_df['oracle'] = sum(sim_app.opt_rewards) / len(sim_app.opt_rewards)

rewards_df.plot(y=['bandit','oracle'],linewidth=lwd)
plt.legend(loc=4, prop={'size': 20})
plt.tick_params(axis='both', which='major', labelsize=15)
plt.xlabel('Data instances (models were updated every %s data instances)' % batch_size, size=20)
plt.ylabel('Rolling Mean Reward', size=30)
plt.grid()
plt.show()

#### Get mean rewards

In [None]:
rewards_df.bandit.mean()

### Clean up

We have three DynamoDB tables (experiment, join, model) from the bandits application above (e.g. `experiment_id='bandits-...'`). To better maintain them, we should remove the related records if the experiment has finished. Besides, having an endpoint running will incur costs. Therefore, we delete these components as part of the clean up process.

> Only execute the clean up cells below when you've finished the current experiment and want to deprecate everything associated with it. After the cleanup, the Cloudwatch metrics will not be populated anymore.

In [None]:
# try:
#     bandits_experiment.clean_resource(experiment_id=bandits_experiment.experiment_id)

#     bandits_experiment.clean_table_records(experiment_id=bandits_experiment.experiment_id)
# except:
#     print('Ignore any errors.  This is OK.')