# Contextual Bandits with Parametric Actions -- Experimentation Mode

We demonstrate how you can use varying number of actions with contextual bandits algorithms in SageMaker. This notebook builds on 
the [Contextual Bandits example notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/reinforcement_learning/bandits_statlog_vw_customEnv/bandits_statlog_vw_customEnv.ipynb) example notebook which used fixed number of actions. Please refer to that notebook for basics on contextual 
bandits. 

In the contextual bandit setting, an agent recommends an action given a state. This notebook introduces three features to bandit 
algorithms that makes them applicable to a broader set of real-world problems. We use the movie recommendation problem as an example.
1. The number of actions available to the agent can change over time. For example, the movies in the catalog changes over time.
2. Each action may have features associated with it. For the movie recommendation problem, each movie can have features such as 
genre, cast, etc.
3. The agent can pick multiple actions. When recommending movies, it is natural that multiple movies are recommended at a time.

The contextual bandit agent will trade-off between exploitation and exploration to quickly learn user preferences and minimize 
poor recommendations. The bandit algorithms are appropriate to use in recommendation problems when there are many cold items (items which have no or little interaction data) in the catalog or if user preferences change over time.

### What is Experimentation Mode?

Contextual bandits are often used to train models by interacting with the real world. In movie recommendation, the bandit learns user preferences based on their feedback from past interactions. To test if bandit algorithms are applicable for your use case, you may want to test different algorithms and understand the impact of different features, hyper-parameters. Experimenting with real users can lead to poor experience due to unanticipated issues or poor performance. Experimenting in production comes with the complexity of working with infrastructure components (e.g. web services, data engines, databases) designed for scale. With Experimentation Mode, you can get started with a small dataset or a simulator and identify the algorithm, features and hyper-parameters that are best applicable for your use case. The experimentation is much faster, does not impact real users and easy to work with. Once you are satisfied with the algorithm performance, you can switch to Deployment Mode, where we provide infrastructure support that scales to production requirements.

## Pre-requisites 

### Imports

To get started, we'll import the Python libraries we need, set up the environment with a few prerequisites for permissions and configurations.

In [None]:
import sagemaker
import boto3
import sys
import os
import json
import glob
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import subprocess
from IPython.display import HTML
import time
from time import gmtime, strftime
sys.path.append("common")
from misc import get_execution_role, wait_for_s3_object
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework
%matplotlib inline

### Setup S3 bucket

Set up the linkage and authentication to the S3 bucket that you want to use for checkpoint and the metadata. 

In [None]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

### Configure where training happens

You can run your RL training jobs on a SageMaker notebook instance or on your own machine. In both of these scenarios, you can run the following in either local or SageMaker modes. The local mode uses the SageMaker Python SDK to run your code in a local container before deploying to SageMaker. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. You just need to set local_mode = True.

In [None]:
# run in local mode?
local_mode = True

if local_mode:
    instance_type = 'local'
else:
    instance_type = "ml.c5.xlarge"

### Create an IAM role

Either get the execution role when running from a SageMaker notebook instance `role = sagemaker.get_execution_role()` or, when running from local notebook instance, use utils method `role = get_execution_role()` to create an execution role.

In [None]:
try:
    role = sagemaker.get_execution_role()
except:
    role = get_execution_role()

print("Using IAM role arn: {}".format(role))

### Client application (MovieLens Environment)
The client application simulates a live environment that uses the SageMaker bandits model to serve recommendations to users. The logic of reward generation resides in the client application. We simulate the online learning loop with feedback using a recommendation simulator. The simulator uses MovieLens 100k dataset.

The workflow of the client application is as follows:
- **User sampling and candidate list generation**: The client application picks a user u and a list of 100 items (define by item_pool_size) at random, which is sent to the SageMaker endpoint for retrieving a recommendation. This list consists of the movies that the user u has rated in the past, as we know the true user preferences (ratings) for these movies.
- **Bandit Slate recommendation**: SageMaker endpoint returns a recommendation - a list of top-k items, associated probability and `event_id`.
- **Feedback generation by simulating user behaviour**: The reward is given to the agent based on user ratings in the dataset. We assume a Cascade Click model, where the user scans the list top-down, and clicks on the item that she likes. We give a reward of 0 to all the items above the clicked item and a reward to 1 to the item that was clicked.
- **Rewards logging**: The application reports the reward to the experiment manager using S3, along with the corresponding `event_id`.

`event_id` is a unique identifier for each interaction. It is used to join inference data `<shared_context, actions_context, action, action probability>` with the rewards. 

In a later cell of this notebook, where there exists a hosted endpoint, we illustrate how to interact with the endpoint and get the recommended actions.

#### MovieLens 100K usage license
Please be aware of the following requirements regarding acknowledgment, copyright and availability, cited from the [data set description page](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt).

The data set may be used for any research purposes under the following conditions:

 * The user may not state or imply any endorsement from the
   University of Minnesota or the GroupLens Research Group.
 * The user must acknowledge the use of the data set in
   publications resulting from the use of the data set
   (see below for citation information).
 * The user may not redistribute the data without separate
   permission.
 * The user may not use this information for any commercial or
   revenue-bearing purposes without first obtaining permission
   from a faculty member of the GroupLens Research Project at the
   University of Minnesota.
   
If you have any further questions or comments, please contact GroupLens (grouplens-info@cs.umn.edu).

#### Download MovieLens 100K and upload to S3

In [None]:
%%bash
curl -o ml-100k.zip http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip

In [None]:
movielens_data_s3_path = sage_session.upload_data(path="ml-100k", bucket=s3_bucket, key_prefix="movielens/data")

## Train the Bandit model using the Python SDK Script mode

If you are using local mode, the training will run on the notebook instance. When using SageMaker for training, you can select a GPU or CPU instance. The RLEstimator is used for training RL and bandit jobs. 

1. Specify the source directory where the environment, presets and training code is uploaded.
2. Specify the entry point as the training code 
3. Specify the input dataset
4. Specify the container image
5. Define the training parameters such as the instance count, job name, S3 path for output and job name. 
6. Specify the hyperparameters for the bandit algorithm. 
7. Define the metrics definitions that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. 

#### Define the hyperparameters and the training job name

In [None]:
hyperparameters = {
                   # Algorithm params
                   "arm_features": True,
                   "exploration_policy": "regcbopt",
                   "mellowness": 0.01,
                   
                   # Env params
                   "item_pool_size": 100,
                   "top_k": 5,
                   "total_interactions": 1000,
                   "max_users": 100,
                   }

job_name = "testbed-bandits-1"

In [None]:
estimator = RLEstimator(entry_point="train.py",
                        source_dir='src',
                        dependencies=["common/sagemaker_rl"],
                        image_name="462105765813.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rl-vw-container:adf",
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=1,
                        output_path=s3_output_path,
                        base_job_name=job_name,
                        hyperparameters = hyperparameters
                    )

estimator.fit(inputs={"movielens": movielens_data_s3_path}, wait=True)

#### Download the outputs to plot performance

In [None]:
if local_mode:
    output_path_prefix = f"{estimator.latest_training_job.job_name}/output.tar.gz"
else:
    output_path_prefix = f"{estimator.latest_training_job.job_name}/output/output.tar.gz"
    
sage_session.download_data(path="./output", bucket=s3_bucket, key_prefix=output_path_prefix)

In [None]:
%%bash
tar -C ./output -xvzf ./output/output.tar.gz

In [None]:
if local_mode:
    output_path_local = "output/data/output.json"
else:
    output_path_local = "output/output.json"

with open(output_path_local) as f:
    all_regrets = json.load(f)

In [None]:
all_regrets = {key: np.cumsum(val) for key,val in all_regrets.items()}
df = pd.DataFrame(all_regrets)
df.plot()