# Offline RL Training the Cart-pole Model with SageMaker RL and Ray Rllib

---
## Introduction

In this notebook, we will demonstrate how to use the Ray Rllib toolkit along with SageMaker RL to perform offline RL trainig. In offline RL (also known as batch RL), the agent is trained using previously generated experience datasets. This is highly valuable when simulating the actual environment is expensive. We consider the familiar cartpole problem. 

We have structured this notebook in three parts: <br />
1) Generate the necessary experience dataset. <br />
2) Train the offline RL model. <br />
3) Evaluate the performance of the trained agent. <br />

The objective of the cartpole problem is to balance the cartpole in the vertical position. For more details regarding the  observation and action spaces and reward structure see [Neuronlike adaptive elements that can solve difficult learning control problems](https://ieeexplore.ieee.org/document/6313077).


## Pre-requisites

### Imports


We will start by importing the necessary Python libraries.

In [None]:
import sagemaker
import boto3
import sys
import os
import glob
import re
import subprocess
import numpy as np
from pathlib import Path
import tarfile
from IPython.display import HTML
import time
from time import gmtime, strftime

sys.path.append("common")
from misc import get_execution_role, wait_for_s3_object
from docker_utils import build_and_push_docker_image
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework

 ### Setup S3 bucket

 
 As a next step, we set up the S3 bucket for storing the experiences and other training artifacts.

In [None]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()
s3_output_path = "s3://{}/".format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

### Define Variables 


Next we will define a job name prefix. The job name prefix ends in "-gen" to indicate that the training job is used to generate the experience dataset. 

In [None]:
job_name_prefix = "rl-cartpole-ray-gen"

### Configure where training happens



The next step is to define the type of training instance. For this example, we will not be using local mode to generate the training experiences. 

In [None]:
local_mode = False
instance_type = "ml.c5.2xlarge"

### Create an IAM role

Obtain an execution role with the IAM permissions for SageMaker training.


In [None]:
try:
    role = sagemaker.get_execution_role()
except:
    role = get_execution_role()

print("Using IAM role arn: {}".format(role))

### Use docker image


To train the model and generate the experiences, we need a docker image of the ray container. We get a public docker image for RLlib from the [Amazon SageMaker RL containers repository](https://github.com/aws/sagemaker-rl-container).

In [None]:
cpu_or_gpu = "gpu" if instance_type.startswith("ml.p") else "cpu"
aws_region = boto3.Session().region_name
custom_image_name = (
    "462105765813.dkr.ecr.%s.amazonaws.com/sagemaker-rl-ray-container:ray-0.8.5-tf-%s-py36"
    % (aws_region, cpu_or_gpu)
)

### Write the Training Code for Generating Offline Data
The training code for generating the experinces is written in the file “train-rl-cartpole-ray-gen.py” located in the /src directory. We will use PPO algorithm to generate the training experinces. Note that we use the "output" key in the config dictionary to specify the location where the experience data will be stored.




In [None]:
!pygmentize src/train-{job_name_prefix}.py

## Generate the offline dataset using the Python SDK Script mode

Here we specify the estimator object that will be used to generate the dataset. The location entrypoint script, source directory, docker image etc are included in the estimator definition. We start the training job by calling estimator.fit()

In [None]:
%%time

metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)

estimator = RLEstimator(
    entry_point="train-%s.py" % job_name_prefix,
    source_dir="src",
    dependencies=["common/sagemaker_rl"],
    image_uri=custom_image_name,
    role=role,
    train_instance_type=instance_type,
    train_instance_count=1,
    output_path=s3_output_path,
    base_job_name=job_name_prefix,
    metric_definitions=metric_definitions,
    hyperparameters={
        # Attention scientists!  You can override any Ray algorithm parameter here:
        # "rl.training.config.horizon": 5000,
        # "rl.training.config.num_sgd_iter": 10,
    },
)

estimator.fit(wait=local_mode)
job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

## Retrieving the Experience Data

Once the training is complete, the experience data that includes the sequence of observations, actions and rewards generated by the environment will be stored among the training artifacts in Amazon S3. We will retrieve this data here.

In [None]:
data_folder = "cartpole_data"
exp_loc = "cartpole-out"
s3 = boto3.client("s3")
os.mkdir("src/{}".format(data_folder))
s3.download_file(
    s3_bucket,
    "{}/output/output.tar.gz".format(job_name),
    "src/{}/output.tar.gz".format(data_folder),
)
exp_tar = tarfile.open("src/{}/output.tar.gz".format(data_folder))
exp_tar.extractall("src/{}".format(data_folder))

The experience data generated from training can now be found stored in json format inside the data_folder. 

In [None]:
expfolder = Path.cwd() / "src" / data_folder / exp_loc
offline_data = [filename for filename in expfolder.rglob("*.json")]
offline_data

# Training the Offline RL Agent Using IMPALA

Next we will try to train the offline RL agent using the above data. We will use an off-policy RL algorithm called IMPALA to train the agent. The code for training the offline RL agent using IMPALA is written in train-rl-cartpole-ray-offline-IMPALA.py located in the /src directory. In the training config, we specify the location where the training experiences are stored using hte "input" key. Note that the "explore" key is set to be "False" in training config since we will only be sampling actions from the offline dataset. 


In [None]:
job_name_prefix = "rl-cartpole-ray-offline-IMPALA"

In [None]:
!pygmentize src/train-{job_name_prefix}.py

In [None]:
%%time

metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)

estimator2 = RLEstimator(
    entry_point="train-%s.py" % job_name_prefix,
    source_dir="src",
    dependencies=["common/sagemaker_rl"],
    image_uri=custom_image_name,
    role=role,
    train_instance_type=instance_type,
    train_instance_count=1,
    output_path=s3_output_path,
    base_job_name=job_name_prefix,
    metric_definitions=metric_definitions,
    hyperparameters={
        # Attention scientists!  You can override any Ray algorithm parameter here:
        # "rl.training.config.horizon": 5000,
        # "rl.training.config.num_sgd_iter": 10,
    },
)

estimator2.fit(wait=local_mode)
job_name = estimator2.latest_training_job.job_name
print("Training job: %s" % job_name)

## Evaluation of  RL model


In this step, we run evaluation of the RL Agent using the final stored checkpoint. We will move the checkpoint data to either the local directory or Amazon S3 depending on whether evaluation is run in local mode or using SageMaker mode.

In [None]:
tmp_dir = "/tmp/{}".format(job_name)
os.system("mkdir {}".format(tmp_dir))


if local_mode:
    model_tar_key = "{}/model.tar.gz".format(job_name)
else:
    model_tar_key = "{}/output/model.tar.gz".format(job_name)

local_checkpoint_dir = "{}/model".format(tmp_dir)

wait_for_s3_object(s3_bucket, model_tar_key, tmp_dir, training_job_name=job_name)

if not os.path.isfile("{}/model.tar.gz".format(tmp_dir)):
    raise FileNotFoundError("File model.tar.gz not found")

os.system("mkdir -p {}".format(local_checkpoint_dir))
os.system("tar -xvzf {}/model.tar.gz -C {}".format(tmp_dir, local_checkpoint_dir))

print("Checkpoint directory {}".format(local_checkpoint_dir))

In [None]:
if local_mode:
    checkpoint_path = "file://{}".format(local_checkpoint_dir)
    print("Local checkpoint file path: {}".format(local_checkpoint_dir))
else:
    checkpoint_path = "s3://{}/{}/checkpoint/".format(s3_bucket, job_name)
    if not os.listdir(local_checkpoint_dir):
        raise FileNotFoundError("Checkpoint files not found under the path")
    os.system("aws s3 cp --recursive {} {}".format(local_checkpoint_dir, checkpoint_path))
    print("S3 checkpoint file path: {}".format(checkpoint_path))

Evaluation metrics such as mean_reward, max_reward etc can be calculated now by calling the estimator_eval.fit() script.

In [None]:
%%time

estimator_eval = RLEstimator(
    entry_point="evaluate-ray.py",
    source_dir="src",
    dependencies=["common/sagemaker_rl"],
    image_uri=custom_image_name,
    role=role,
    train_instance_type=instance_type,
    train_instance_count=1,
    base_job_name=job_name_prefix + "-evaluation",
    hyperparameters={"evaluate_episodes": 10, "algorithm": "IMPALA", "env": "CartPole-v1"},
)

estimator_eval.fit({"model": checkpoint_path})
job_name = estimator_eval.latest_training_job.job_name
print("Evaluation job: %s" % job_name)