<h1>Basic Custom Training Container</h1>

This notebook demonstrates how to build and use a basic custom Docker container for training with Amazon SageMaker. Reference documentation is available at https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role



ecr_repository_name = 'matterport_mask_rcnn'
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

230755935769
us-west-2
arn:aws:iam::230755935769:role/SageMakerExecutionRoleMLOps
sagemaker-us-west-2-230755935769


Let's take a look at the Dockerfile which defines the statements for building our custom SageMaker training container:

In [2]:
! pygmentize Dockerfile

[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.4-gpu-py36-cu100-ubuntu18.04[39;49;00m

[34mRUN[39;49;00m apt update
[34mRUN[39;49;00m apt install -y python3-opencv

[34mRUN[39;49;00m git clone https://github.com/catwhiskers/Mask_RCNN.git && [36mcd[39;49;00m Mask_RCNN && pip install -r requirements.txt && python setup.py install  


At high-level the Dockerfile specifies the following operations for building this container:
<ul>
    <li>Start from Ubuntu 16.04</li>
    <li>Define some variables to be used at build time to install Python 3</li>
    <li>Some handful libraries are installed with apt-get</li>
    <li>We then install Python 3 and create a symbolic link</li>
    <li>We install some Python libraries like numpy, pandas, ScikitLearn, etc.</li>
    <li>We set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)</li>
    <li>Finally, we copy all contents in <strong>code/</strong> (which is where our training code is) to the WORKDIR and define the ENTRYPOINT</li>
</ul>

<h3>Build and push the container</h3>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [3]:
! pygmentize build_and_push.sh

[37m#!/usr/bin/env bash[39;49;00m

[37m# This script shows how to build the Docker image and push it to ECR to be ready for use[39;49;00m
[37m# by SageMaker.[39;49;00m

[37m# The argument to this script is the image name. This will be used as the image on the local[39;49;00m
[37m# machine and combined with the account and region to form the repository name for ECR.[39;49;00m

[31mDIR[39;49;00m=[33m"[39;49;00m[34m$([39;49;00m [36mcd[39;49;00m [33m"[39;49;00m[34m$([39;49;00m dirname [33m"[39;49;00m[33m${[39;49;00m[31mBASH_SOURCE[39;49;00m[0][33m}[39;49;00m[33m"[39;49;00m [34m)[39;49;00m[33m"[39;49;00m && [36mpwd[39;49;00m [34m)[39;49;00m[33m"[39;49;00m
[36msource[39;49;00m [31m$DIR[39;49;00m/set_env.sh

[37m# set region[39;49;00m
[31mregion[39;49;00m=
[34mif[39;49;00m [ [33m"[39;49;00m[31m$#[39;49;00m[33m"[39;49;00m -eq [34m1[39;49;00m ]; [34mthen[39;49;00m
    [31mregion[39;49;00m=[31m$1[39;49;00m
[34melse

<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [6]:

! ./build_and_push.sh us-west-2

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Sending build context to Docker daemon  482.4MB
Step 1/4 : FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.4-gpu-py36-cu100-ubuntu18.04
 ---> 990ca849a51a
Step 2/4 : RUN apt update
 ---> Using cache
 ---> 0d3b66d75207
Step 3/4 : RUN apt install -y python3-opencv
 ---> Using cache
 ---> 933f627f3802
Step 4/4 : RUN git clone https://github.com/catwhiskers/Mask_RCNN.git && cd Mask_RCNN && pip install -r requirements.txt && python setup.py install
 ---> Using cache
 ---> 812b58a06c66
Successfully built 812b58a06c66
Successfully tagged matterport_mask_crnn:latest
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [230755935769.dkr.ecr.us-west-2.amazonaws.com/matterport_mask_crnn]

[1B19808a1a: Preparing 
[1Bc07844ef: Preparing 
[1B20cc965c: Preparing 
[1Bf231bfcd: Preparing 
[1Bb2438270: Prepa

<h3>Training with Amazon SageMaker</h3>

Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job.

In [7]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

230755935769.dkr.ecr.us-west-2.amazonaws.com/matterport_mask_rcnn:latest


Given the purpose of this example is explaining how to build custom containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the code first:

In [None]:
! pygmentize ../docker/code/main.py

We upload some dummy data to Amazon S3, in order to define our S3-based training channels.

In [None]:
! echo "val1, val2, val3" > dummy.csv
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/train'))
print(sagemaker_session.upload_data('dummy.csv', bucket, prefix + '/val'))
! rm dummy.csv

Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html).

In [None]:
import sagemaker

est = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='local', # use local mode
                                    #train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix)

est.set_hyperparameters(hp1='value1',
                        hp2=300,
                        hp3=0.001)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })