# GAM306 - Land a rocket (and a good game) with reinforcement learning

# Competition

Use this notebook to train your lunar lander agent for the competition! This notebook is the same as the tutorial notebook you went through, so everything you learned there applies here. The only thing that is different is now you can set which scenario you want to work on. This will be covered in Step 1. There are two notebooks named `lunarlander-1.ipynb` and `lunarlander-2.ipynb` so that you and your teammate can work at the same time on two different scenarios if you choose to do so. Divide and conquer!


## Scenarios

These coefficients represent various weights that alter the reward function to fit each scenario listed below. Since the goal for each scenario is different, there needs to be a different reward function so the agent can learn appropriately.

### Open AI gym default - tutorial
c1 = c2 = c3 = 100; c4= 0.3

### Softest landing
c1 = 500
            
### Most centered landing
c2 = 500   
            
### Most level landing
c3 = 500            
            
### Minimum fuel
c4 = 0.5

# Step 1: Choose your scenario

Choose which scenario you want to work on by setting the _**scenario**_ variable in the next code cell to the appropriate number. For example, if you want to train for scenario 1 - softest landing, set `scenario ='1'`. The competition only counts for scenarios 1-4.

1. Set `scenaro = '1'` below to do the softest landing scenario. Or, set this value to whatever scenario you want to work on. 

2. Run the code cell below to set the scenario variable by hitting the **Run** button in the toolbar above.

Make sure to re-run the code cell below to reset the scenario variable every time you want to change the scenario you are working on throughout this workshop.  

This code cell first defines that we are using the [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) problem in the [Box2D](https://gym.openai.com/envs/#box2d) environment from [OpenAI Gym](https://openai.com/). 

It then sets the training algorithm to **PPO** which stands for Proximal Policy Optimization. This is a new class of reinforcement learning algorithms that is easy to use and offers good performance. You can read more about PPO [here](https://openai.com/blog/openai-baselines-ppo/#ppo).

Finally, it imports necessary files for training.

In [16]:
box2d_problem = 'lunarlander'

# Algorithm 
algo = 'PPO'

#Choose scenario 0, 1, 2, 3 or 4
#Scenario 0 - Tutorial
#Scenario 1 - Softest landing
#Scenario 2 - Accurate landing
#Scenario 3 - Level landing
#Scenario 4 - Minimal fuel usage
#Rerun cell to be able to set a different scenario 
scenario='1'

trainscript = 'train-{}-{}.py'.format(box2d_problem,algo)

!cp src/lunar_lander-{scenario}.py src/lunar_lander.py

In [17]:
trainscript

'train-lunarlander-PPO.py'

### States

What are the states of this agent? This is a sample that shows how the state of the lunar lander agent is determined. It contains information about the state of the agent at every step of the process, including the position of the agent for example. 

```python
pos = self.lander.position
        vel = self.lander.linearVelocity
        state = [
            (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2),
            (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_H/SCALE/2),
            vel.x*(VIEWPORT_W/SCALE/2)/FPS,
            vel.y*(VIEWPORT_H/SCALE/2)/FPS,
            self.lander.angle,
            20.0*self.lander.angularVelocity/FPS,
            1.0 if self.legs[0].ground_contact else 0.0,
            1.0 if self.legs[1].ground_contact else 0.0
]
```       

### Actions

What are the actions the agent can take? For example, the agent might be able to take no action. Or maybe the agent fires the left engine, or the main engine, or the right engine. 

According to Pontryagin's maximum principle, it is optimal to fire the engine full throttle or turn it off. That's the reason this environment is OK to have discreet actions (engine on or off).

```python        
        if self.continuous:
            # Action is two floats [main engine, left-right engines].
            # Main engine: -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power.
            # Left-right:  -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off
            self.action_space = spaces.Box(-1, +1, (2,), dtype=np.float32)
        else:
            # Nop, fire left engine, main engine, right engine
            self.action_space = spaces.Discrete(4)
```

### Reward

How is the agent rewarded? This is a sample reward function for the lunar lander agent. In this function, you can see that the reward increases or decreases depending on the state of the lunar lander agent. For example, if you want a level landing, the reward increases for each leg that contacts with the moon. This is to train the lunar lander to land right-side up instead of upside down. To a similar effect, the reward decreases depending on how much fuel is used. This is to train the lunar lander to conserve as much fuel as possible. 

The reward is officially calculated as follows:

The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector shown below. The reward for moving from the top of the screen to the landing pad with zero speed is about 100-140 points.
If the lander moves away from landing pad, it loses reward. The episode finishes if the lander crashes or
comes to a rest, receiving an additional -100 or +100 points respectively. Each leg ground contact is +10. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solving the lunar lander is 200 points.

Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land
on its first attempt. You can see the source code for more details. 

```python
        reward = 0
        shaping = \
            - 100*np.sqrt(state[0]*state[0] + state[1]*state[1]) \
            - 100*np.sqrt(state[2]*state[2] + state[3]*state[3]) \
            - 100*abs(state[4]) + 10*state[6] + 10*state[7]
        # Add ten points for legs contact, the idea is if you
        # lose contact again after landing, you get negative reward
        if self.prev_shaping is not None:
            reward = shaping - self.prev_shaping
        self.prev_shaping = shaping

        reward -= m_power*0.30  # less fuel spent is better, about -30 for heurisic landing
        reward -= s_power*0.03

        done = False
        if self.game_over or abs(state[0]) >= 1.0:
            done   = True
            reward = -100
        if not self.lander.awake:
            done   = True
            reward = +100
```

# Step 2: Set prerequisites 

### Imports

To get started, we'll import the Python libraries we need and set up the environment with a few prerequisites for permissions and configurations. This code cell imports necessary Python libraries, like **boto3** which is the AWS SDK for Python.

3. Run the code cell below to set the prerequisites needed for this workshop by hitting the **Run** button in the toolbar above.
 

In [18]:
import sagemaker
import boto3
import sys
import os
import glob
import re
import subprocess
import numpy as np
from IPython.display import HTML
import time
from time import gmtime, strftime
sys.path.append("common")
from misc import get_execution_role, wait_for_s3_object
from docker_utils import build_and_push_docker_image
from sagemaker.rl import RLEstimator, RLToolkit, RLFramework

# Step 3: Setup S3 bucket

Next, we need to set up the linkage and authentication to an S3 bucket. This will be the bucket where SageMaker stores the output data of training jobs. SageMaker also stores the trained models as **model.tar.gz** files in this S3 bucket, as well as checkpoints and other metadata. 

4. Run the code cell below by hitting the **Run** button in the toolbar above.

This code cell creates a SageMaker session, which helps to manage interactions with the Amazon SageMaker APIs and any other AWS services needed, like S3. The S3 bucket is set to the default SageMaker bucket and the output path of this bucket is defined.

In [19]:
sage_session = sagemaker.session.Session()
s3_bucket = sage_session.default_bucket()  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

S3 bucket path: s3://sagemaker-us-east-1-793689376757/


# Step 4: Define variables and configure training

We define variables such as the job prefix for the training jobs *and the image path for the container (only when this is BYOC).*

5. **Run** the code cell below to set the job name. **DO NOT EDIT THIS CELL AT ALL!** Just run the cell as it is.

In [20]:
# create a descriptive job name 
job_name_prefix = scenario + '-rl-box2d-'+box2d_problem+scenario

### Configure where training happens

You can train your RL training jobs using the SageMaker notebook instance or local notebook instance. In both of these scenarios, you can run the following in either local or SageMaker modes. The local mode uses the SageMaker Python SDK to run your code in a local container before deploying to SageMaker. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. You just need to set `local_mode = True`.

6. For now, leave this cell default with local_mode set to **False**.

7. **Run** the code cell.

**Tip:** In the future, you can fire off multiple training jobs on different instance types by changing the instance type below before training. Right now, the instance type is **ml.p3.16xlarge**. For this workshop, you have the ability to use the follow instance types for distributed training:

* (2) ml.p3.2xlarge instances

* (2) ml.p3.8xlarge instances

* (2) ml.p3.16xlarge instances

You also have access to any of the [default SageMaker limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_sagemaker)

In [21]:
# run in local_mode on this machine, or as a SageMaker TrainingJob?
local_mode = False

if local_mode:
    instance_type = 'local'
else:
    # If on SageMaker, pick the instance type
    instance_type = "ml.p3.2xlarge"

### Create an IAM role

Either get the execution role when running from a SageMaker notebook instance `role = sagemaker.get_execution_role()` or, when running from local notebook instance, use utils method `role = get_execution_role()` to create an execution role.

8. **Run** this code cell to ceate an IAM role for SageMaker.

In [22]:
try:
    role = sagemaker.get_execution_role()
except:
    role = get_execution_role()

print("Using IAM role arn: {}".format(role))

Using IAM role arn: arn:aws:iam::793689376757:role/mod-2a01731acb2241e7-SagemakerIAM-17T4W130FWHO2


### Install docker for `local` mode

In order to work in `local` mode, you need to have docker installed. When running from you local machine, please make sure that you have docker and docker-compose (for local CPU machines) and nvidia-docker (for local GPU machines) installed. Alternatively, when running from a SageMaker notebook instance, you can simply run the following script to install dependenceis.

9. **Run** the code cell below. 

**Tip:** You can only run a single local notebook at one time.

In [23]:
# only run from SageMaker notebook instance
if local_mode:
    !/bin/bash ./common/setup.sh

# Step 5: Build docker container

We must build a custom docker container with Roboschool installed.  This takes care of everything:

* Fetching base container image
* Installing Roboschool and its dependencies
* Uploading the new container image to ECR

This step can take a long time if you are running on a machine with a slow internet connection.  If your notebook instance is in SageMaker or EC2 it should take 3-10 minutes depending on the instance type.

10. **Run** the code cell to build the docker container. 

**Tip:** The output should say `Done pushing` when the build is finished. If your build seems to be hung up on something, either not progressing or looks like it is not running at all, refresh your notebook and try again.

In [24]:
%%time

cpu_or_gpu = 'gpu' if instance_type.startswith('ml.p') else 'cpu'
repository_short_name = box2d_problem+"-%s" % cpu_or_gpu
docker_build_args = {
    'CPU_OR_GPU': cpu_or_gpu, 
    'AWS_REGION': boto3.Session().region_name
}
custom_image_name = build_and_push_docker_image(repository_short_name, build_args=docker_build_args)
print("Using ECR image %s" % custom_image_name)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Logged into ECR
Building docker image lunarlander-gpu from Dockerfile
$ docker build -t lunarlander-gpu -f Dockerfile . --build-arg CPU_OR_GPU=gpu --build-arg AWS_REGION=us-east-1
Sending build context to Docker daemon  726.5kB
Step 1/31 : ARG CPU_OR_GPU
Step 2/31 : ARG AWS_REGION
Step 3/31 : FROM 520713654638.dkr.ecr.${AWS_REGION}.amazonaws.com/sagemaker-rl-tensorflow:ray0.6.5-${CPU_OR_GPU}-py3
 ---> f038be2cc8c4
Step 4/31 : WORKDIR /opt/ml
 ---> Using cache
 ---> e6f5ed3314ff
Step 5/31 : RUN apt-get update
 ---> Using cache
 ---> 647d06cbfc46
Step 6/31 : RUN apt-get install sudo
 ---> Using cache
 ---> 5be1835b8d58
Step 7/31 : RUN apt-get update && apt-get install -y --no-install-recommends apt-utils
 ---> Using cache
 ---> 3832846bf4ba
Step 8/31 : RUN apt-get -y install libpcre3
 ---> Using cache
 ---> 31bc4a6206af
Step 9/31 : RUN apt-get -y install libpcre3-dev
 ---> Using cache
 ---> 884

latest: digest: sha256:ce5c897820bec29633e6d7b5f614b2933f52fefc752a698c2fdd47753729ac58 size: 9976
Done pushing 793689376757.dkr.ecr.us-east-1.amazonaws.com/lunarlander-gpu
Using ECR image 793689376757.dkr.ecr.us-east-1.amazonaws.com/lunarlander-gpu
CPU times: user 130 ms, sys: 40.6 ms, total: 170 ms
Wall time: 6.1 s


In [25]:
custom_image_name

'793689376757.dkr.ecr.us-east-1.amazonaws.com/lunarlander-gpu'

# Step 6: Write the Training Code

The training code is written in the file “train-coach.py” which is uploaded in the /src directory. 
First import the environment files and the preset files, and then define the main() function.  

In [26]:
!pygmentize src/{trainscript}

[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

[34mimport[39;49;00m [04m[36mgym[39;49;00m
[34mimport[39;49;00m [04m[36mray[39;49;00m
[34mfrom[39;49;00m [04m[36mray.tune[39;49;00m [34mimport[39;49;00m run_experiments
[34mfrom[39;49;00m [04m[36mray.tune.registry[39;49;00m [34mimport[39;49;00m register_env
[34mimport[39;49;00m [04m[36mroboschool[39;49;00m

[34mfrom[39;49;00m [04m[36msagemaker_rl.ray_launcher[39;49;00m [34mimport[39;49;00m SageMakerRayLauncher


[34mdef[39;49;00m [32mcreate_environment[39;49;00m(env_config):
    [37m# This import must happen inside the method so that worker processes import this code[39;49;00m
    [34mimport[39;49;00m [04m[36mroboschool[39;49;00m
    [34mreturn[39;49;00m gym.make([33m'[39;49;00m[33mLunarLander-v2[39;49;00m[33m'[39;49;00m)


[34mclass[39;49;00m [04m[32mMyLauncher[39;49;00m(SageMakerRayLauncher):

    [34mdef[39

## Hyperparameters

To tune hyperparameters, find the `train-lunarlander-PPO.py` file in the `src` folder of your SageMaker notebook. You can tune hyperparameters by editing the values in the `def get_experiment_config(self):` function. Below is a descripttion of what the hyperparameters are.

11. Research the hyperparameters below to understand how changing them affects the lunar lander agent. Tune the hyperparameters in the `train-lunarlander-PPO.py` file to create the most optimal RL agent.

Please see https://arxiv.org/abs/1707.06347 and https://arxiv.org/pdf/1506.02438.pdf for details 


### DEFAULT_CONFIG
    # If true, use the Generalized Advantage Estimator (GAE)
    # with a value function, see .
    "use_gae": True,
    # GAE(lambda) parameter
    "lambda": 1.0,
    # Initial coefficient for KL divergence
    "kl_coeff": 0.2,
    # Size of batches collected from each worker
    "sample_batch_size": 200,
    # Number of timesteps collected for each SGD round
    "train_batch_size": 4000,
    # Total SGD batch size across all devices for SGD
    "sgd_minibatch_size": 128,
    # Whether to shuffle sequences in the batch when training (recommended)
    "shuffle_sequences": True,
    # Number of SGD iterations in each outer loop
    "num_sgd_iter": 30,
    # Stepsize of SGD
    "lr": 5e-5,
    # Learning rate schedule
    "lr_schedule": None,
    # Share layers for value function. If you set this to True, it's important
    # to tune vf_loss_coeff.
    "vf_share_layers": False,
    # Coefficient of the value function loss. It's important to tune this if
    # you set vf_share_layers: True
    "vf_loss_coeff": 1.0,
    # Coefficient of the entropy regularizer
    "entropy_coeff": 0.0,
    # Decay schedule for the entropy regularizer
    "entropy_coeff_schedule": None,
    # PPO clip parameter
    "clip_param": 0.3,
    # Clip param for the value function. Note that this is sensitive to the
    # scale of the rewards. If your expected V is large, increase this.
    "vf_clip_param": 10.0,
    # If specified, clip the global norm of gradients by this amount
    "grad_clip": None,
    # Target value for KL divergence
    "kl_target": 0.01,
    # Whether to rollout "complete_episodes" or "truncate_episodes"
    "batch_mode": "truncate_episodes",
    # Which observation filter to apply to the observation
    "observation_filter": "NoFilter",
    # Uses the sync samples optimizer instead of the multi-gpu one. This does
    # not support minibatches.
    "simple_optimizer": False

# Step 7: Train the RL model using the Python SDK Script mode

If you are using local mode, the training will run on the notebook instance. When using SageMaker for training, you can select a GPU or CPU instance. The RLEstimator is used for training RL jobs. This code cell does the following:

* Specifies the source directory where the environment, presets and training code is uploaded.
* Specifies the entry point as the training code 
* Specifies the choice of RL toolkit and framework. This automatically resolves to the ECR path for the RL Container. 
* Defines the training parameters such as the instance count, job name, S3 path for output and job name. 
* Specifies the hyperparameters for the RL agent algorithm. The RLCOACH_PRESET or the RLRAY_PRESET can be used to specify the RL agent algorithm you want to use. 
* Defines the metrics definitions that you are interested in capturing in your logs. These can also be visualized in CloudWatch and SageMaker Notebooks. 


12. **Run** this code cell to begin training your RL model. It should take about 10-15 minutes for training to complete. Training time is capped at around 16 minutes as a time constraint for this workshop.

**Tip:** You can also monitor the progress of your training job by going back to the Amazon SageMaker Management Console and find the **Training jobs** link in the left navigation pane.

If you run into an error saying `ResourceLimitExceeded`, then change the EC2 instance type that training is happening on in **Step 4** above.

In [27]:
metric_definitions = RLEstimator.default_metric_definitions(RLToolkit.RAY)

In [28]:
metric_definitions.append({'Name': 'dist_from_center','Regex': 'dist_from_center=(.*?);'})
metric_definitions.append({'Name': 'vel_at_end','Regex': 'vel_at_end=(.*?);'})
metric_definitions.append({'Name': 'angle_at_end','Regex': 'angle_at_end=(.*?);'})
metric_definitions.append({'Name': 'fuel_used','Regex': 'fuel_used=(.*?);'})

In [29]:
%%time
    
estimator = RLEstimator(entry_point=trainscript,
                        source_dir='src',
                        dependencies=["common/sagemaker_rl"],
                        image_name=custom_image_name,
                        role=role,
                        train_instance_type=instance_type,
                        train_instance_count=2,
                        output_path=s3_output_path,
                        base_job_name=job_name_prefix,
                        metric_definitions=metric_definitions,
                        train_max_run=250
                    )

estimator.fit(wait=False)
job_name = estimator.latest_training_job.job_name
print("Training job: %s" % job_name)

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 2 Instances, with current utilization of 2 Instances and a request delta of 2 Instances. Please contact AWS support to request an increase for this limit.

# Step 8: Visualization

RL training can take a long time.  So while it's running there are a variety of ways we can track progress of the running training job.  Some intermediate output gets saved to S3 during training, so we'll set up to capture that. This code cell defines the path to where outputs are stored for specific training jobs. 

13. Always keep everything default in this code cell and **run** it. 

In [30]:
print("Job name: {}".format(job_name))

s3_url = "s3://{}/{}".format(s3_bucket,job_name)

intermediate_folder_key = "{}/output/intermediate/".format(job_name)
intermediate_url = "s3://{}/{}".format(s3_bucket, intermediate_folder_key)

print("S3 job path: {}".format(s3_url))
print("Intermediate folder path: {}".format(intermediate_url))
    
tmp_dir = "/tmp/{}".format(job_name)
os.system("mkdir {}".format(tmp_dir))
print("Create local folder {}".format(tmp_dir))

NameError: name 'job_name' is not defined

## Fetch videos of training rollouts

Videos of certain rollouts get written to S3 during agent training.  Here we fetch the last 10 videos from S3, and render the last one.

14. **Run** the following code cells to get the most recent video outputs of your trained lunar lander agent. You will be able to display the video in the SageMaker notebook itself. Hit **Run interact** to play the latest video. 

In [None]:
recent_videos = wait_for_s3_object(
            s3_bucket, intermediate_folder_key, tmp_dir, 
            fetch_only=(lambda obj: obj.key.endswith(".mp4") and obj.size>0), 
            limit=50, training_job_name=job_name)

In [None]:
# ls -l --block-size=M /tmp/{job_name}

In [None]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual, Video
import ipywidgets as widgets
video=0
def showvideo(i):
    last_video = sorted(recent_videos)[i]
    return Video.from_file(last_video)
print(len(recent_videos))

video = interact_manual(showvideo, i=widgets.IntSlider(min=0,max=49,step=1,value=49));
print("Does landing look better for higher values of i, i.e. later videos?")

## Plot metrics for training job

We can see the reward metric of the training as it's running, using algorithm metrics that are recorded in Amazon CloudWatch metrics. We can plot this to see the performance of the model over time.

15. **Run** the following code cell to view a plot of metrics from your training. 

16. You can also see the metrics in the **AWS Management Console** by finding your training job, clicking on it to expand details about it, and scrolling down to the Monitor section.

    * Do to this, make sure you are on the SageMaker Management Console.
    * Find the Training jobs link on the left navigation pane
    * Find your latest training job and click it to see more details.
    * Scroll down to the monitor section
    * You will be able to see plotted metrics like episode reward mean, episode reward max, as well as other metrics like CPU and memory utilization. 

In [None]:
%matplotlib inline
from sagemaker.analytics import TrainingJobAnalytics

if not local_mode:
    df = TrainingJobAnalytics(job_name, ['episode_reward_mean']).dataframe()
    num_metrics = len(df)
    if num_metrics == 0:
        print("No algorithm metrics found in CloudWatch")
    else:
        plt = df.plot(x='timestamp', y='value', figsize=(12,5), legend=True, style='b-')
        plt.set_ylabel('Mean reward per episode')
        plt.set_xlabel('Training time (s)')
else:
    print("Can't plot metrics in local mode.")

# Thank you!

You have finished exploring the SageMaker notebook, including understanding how to set the scenario, tune hyperparameters, build docker containers, train your RL models, and visualize outputs. 

You can continue to use this notebook for training your agent in other scenarios by starting again at the top and setting the scenario appropriately. 