# Welcome to Duckietown!

This is the companion tutorial file for learning how to use Amazon AWS's Sagemaker tool to train your Duckietown AIDO submission... **in the cloud**!

We'll be building of our our [Reinforcement Learning](https://goo.gl/YFTjn3) Tutorial, where we take DDPG and use Sagemaker to train with speed!

This tutorial will walk you through, step by step, how to get your Sagemaker account running and using it to train a AIDO Lane Following Submission.

Some prerequisites we expect you to have:
1. An AWS Account (You can get one by signing up [here](https://aws.amazon.com/))
2. A good overview of the code we'll be looking at. We'll be building off [this repository ](https://github.com/duckietown/challenge-aido1_LF1-baseline-RL-sim-pytorch), and this code can be found [here](https://github.com/duckietown/aido-on-sagemaker). A good start would be the video tutorial posted above.
3. The ability to submit with `duckietown-shell` (which means you already have a [Duckietown Account](https://www.duckietown.org/research/ai-driving-olympics/ai-do-register)) as well as `git` on your computer

We've broken this tutorial down into five parts:

1. [Getting Started with AWS and Sagemaker](#Getting-Started-with-AWS-and-Sagemaker)
2. [Walking through the code](#Code-Walkthrough)
3. [Training with Sagemaker](#Sagemaker-Training)
4. [Submitting your model](#Submitting-Your-Model)
5. [Improvements and Faster Training with Sagemaker](#Next-Steps)

## Getting Started with AWS and Sagemaker

### Why AWS and the Cloud?


### Why Sagemaker


### Creating an AWS Account


### Creating a Notebook Instance

#### What type of Instance do I use?

#### CPU or GPU?

#### Paying Close Attention to the Region

#### I'm a student | competitor | academic instructor - How can I pay for this?

#### Jupyter Notebook Tips + Resources

### Setting the Correct IAM Permissions


### Cloning our Baseline inside of the Sagemaker Notebook Instance

## Code Walkthrough

### The parts of the sample container

The `container` directory has all the components you need to extend the SageMaker PyTorch container to use as an sample algorithm:

    .
    ├── Dockerfile
    ├── entrypoint.sh
    ├── build_and_push.sh
    └── duckietown-rl
        ├── train_ddpg.py
        └── ... More stuff (See next cell)

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image. More details are provided below.
* __`entrypoint.sh`__ a script which launches an `Xvfb` process, which is basically a virtual screen so `gym-duckietown` can render the images your agent will see.
* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.
* __`duckietown-rl`__ is the directory which contains our user code to be invoked.


### Training Code

    duckietown-rl/
    ├─────── Dockerfile
    ├─────── build_and_push.sh
    └─────── cifar10

Look familiar? That's because it is! This is the same code from the Pytorch baseline, only with a few Sagemaker-specific modifications. We'll focus on the files we need.

* __`train-ddpg.py`__ is the program that implements our training algorithm and handles loading our model for inferences.


### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

We start from the SageMaker PyTorch image as the base. The base image is an ECR image, so it will have the following pattern.
* {account}.dkr.ecr.{region}.amazonaws.com/sagemaker-{framework}:{framework_version}-{processor_type}-{python_version}

Here is an explanation of each field.
1. account - AWS account ID the ECR image belongs to. Our public deep learning framework images are all under the 520713654638 account.
2. region - The region the ECR image belongs to. [Available regions](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).
3. framework - The deep learning framework.
4. framework_version - The version of the deep learning framework.
5. processor_type - CPU or GPU.
6. python_version - The supported version of Python.

So the SageMaker PyTorch ECR image would be:
520713654638.dkr.ecr.us-west-2.amazonaws.com/sagemaker-pytorch:0.4.0-cpu-py3

Information on supported frameworks and versions can be found in this [README](https://github.com/aws/sagemaker-python-sdk).

Next, we add the code that implements our specific algorithm to the container and set up the right environment for it to run under.

**DISCLAIMER: As of now, the support for the two environment variables below are only supported for the SageMaker Chainer (4.1.0+) and PyTorch (0.4.0+) containers.**

Finally, we need to specify two environment variables.
1. SAGEMAKER_SUBMIT_DIRECTORY - the directory within the container containing our Python script for training and inference.
2. SAGEMAKER_PROGRAM - the Python script that should be invoked for training and inference.

Let's look at the Dockerfile for this example.

In [None]:
!cat container/Dockerfile

### Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. This code is also available as the shell script `container/build-and-push.sh`, which you can run as `build-and-push.sh pytorch-extending-our-containers-cifar10-example` to build the image `pytorch-extending-our-containers-cifar10-example`. 

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this is the region where the notebook instance was created). If the repository doesn't exist, the script will create it. In addition, since we are using the SageMaker PyTorch image as the base, we will need to retrieve ECR credentials to pull this public image.

In [None]:
%%sh

# NEED TO ADD AmazonEC2ContainerRegistryFullAccess policy

# The name of our algorithm
algorithm_name=duckietown-extending

cd container

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
# region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Get the login command from ECR in order to pull down the SageMaker PyTorch image
$(aws ecr get-login --registry-ids 520713654638 --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} . --build-arg REGION=${region}
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

## SageMaker Training
To represent our training, we use the Estimator class, which needs to be configured in five steps. 
1. IAM role - our AWS execution role
2. train_instance_count - number of instances to use for training.
3. train_instance_type - type of instance to use for training. For training locally, we specify `local` or `local_gpu`.
4. image_name - our custom PyTorch Docker image we created.
5. hyperparameters - hyperparameters we want to pass.

Let's start with setting up our IAM role. We make use of a helper function within the Python SDK. This function throw an exception if run outside of a SageMaker notebook instance, as it gets metadata from the notebook instance. If running outside, you must provide an IAM role with proper access stated above in [Permissions](#Permissions).

In [4]:
import os
import subprocess

from sagemaker import get_execution_role

role = get_execution_role()

instance_type = 'local'

if subprocess.call('nvidia-smi') == 0:
    ## Set type to GPU if one is present
    instance_type = 'local_gpu'
    
# When you're ready to really train:
# instance_type = 'ml.m4.xlarge'

print("Instance type = " + instance_type)

Instance type = local


In [3]:
from sagemaker.estimator import Estimator

hyperparameters = {'max_timesteps': 75}

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name='duckietown-extending:latest',
                      hyperparameters=hyperparameters)

estimator.fit('file:///tmp', wait=False)
print("All done!")

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-945394400746
INFO:sagemaker:Creating training-job with name: duckietown-extending-2018-11-16-15-06-27-369


[{'DataUri': 'file:///tmp', 'ChannelName': 'training', 'DataSource': {'FileDataSource': {'FileDataDistributionType': 'FullyReplicated', 'FileUri': 'file:///tmp'}}}]
Creating tmpf1gvqn_algo-1-BV4AH_1_fce55f064bcc ... 
[1BAttaching to tmpf1gvqn_algo-1-BV4AH_1_ef1f213b12262mdone[0m
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m Starting Xvfb
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m Executing command train
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m 2018-11-16 15:06:31,000 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m 2018-11-16 15:06:31,003 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m 2018-11-16 15:06:31,016 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m 2018-11-16 15:06:31,021 sagemaker_pytorch_container.training INFO     Invoking user training script.
[36malgo-1-BV4AH_1_ef

[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 2 | reward: 3.033402040537948
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 3 | reward: 3.032298193450097
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 4 | reward: 3.023413666881358
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 5 | reward: 3.0255559146878275
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 6 | reward: 3.0257898139694483
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 7 | reward: 3.0253202729014426
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 8 | reward: 3.0300721519639433
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 9 | reward: 3.021357603972789
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 10 | reward: 2.9987801931194937
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 11 | reward: 2.9784653167320942
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 12 | reward: 2.960300581855909
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m timestep: 13 | reward: 2.9512512631119283
[36malgo-1-BV4AH_1_ef1f213b1226 |[0m ti

### Changing Hyperparameters - Where are They?

## Submitting Your Model


## Next Steps

### Bigger Instances

### GPUs

### Architectures, State Representations