# Building a custom training container
1. [Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker](#Part-1:-Packaging-and-Uploading-your-Algorithm-for-use-with-Amazon-SageMaker)
    1. [An overview of Docker](#An-overview-of-Docker)
    1. [How Amazon SageMaker runs your Docker container](#How-Amazon-SageMaker-runs-your-Docker-container)
      1. [Running your container during training](#Running-your-container-during-training)
        1. [The input](#The-input)
        1. [The output](#The-output)
      1. [Running your container during hosting](#Running-your-container-during-hosting)
    1. [The parts of the sample container](#The-parts-of-the-sample-container)
    1. [The Dockerfile](#The-Dockerfile)
1. [Part 2: Building and registering the container](#Part-2:-Building-and-registering-the-container)
1. [Part 3: Use the container for training in Amazon SageMaker](#Part-3:-Use-the-container-for-training-in-Amazon-SageMaker)
  1. [Set up the environment](#Set-up-the-environment)
  1. [Training on SageMaker](#Training-on-SageMaker) 

## Part 1: Packaging your algorithm for use with Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new technology. But they are not difficult and can significantly simplify the deployment of your software packages. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way your program is set up is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, and environment variable.

A Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run simultaneously on the same physical or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. An example is provided below. You can build your Docker images based on Docker images built by yourself or by others, which can simplify things quite a bit.

Docker has become very popular in programming and devops communities due to its flexibility and its well-defined specification of how code can be run in its containers. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a one way for training and another, slightly different, way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker runs your Docker container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container. All SageMaker framework containers already cover this requirement and will trigger your defined training algorithm and inference code.

* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.

#### Running your container during training

Currently, our SageMaker PyTorch container utilizes [console_scripts](http://python-packaging.readthedocs.io/en/latest/command-line-scripts.html#the-console-scripts-entry-point) to make use of the `train` command issued at training time. The line that gets invoked during `train` is defined within the setup.py file inside [SageMaker Containers](https://github.com/aws/sagemaker-containers/blob/master/setup.py#L48), our common SageMaker deep learning container framework. When this command is run, it will invoke the [trainer class](https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/cli/train.py) to run, which will finally invoke our [PyTorch container code](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/training.py) to run your Python file.

A number of files are laid out for your use, under the `/opt/ml` directory:

    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |   `-- resourceConfig.json
    |   `-- data
    |       `-- <channel_name>
    |           `-- <input data>
    |-- model
    |   `-- <model files>
    `-- output
        `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values are always strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match algorithm expectations. The files for each channel are copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker packages any files in this directory into a compressed tar archive file. This file is made available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file are returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it is ignored.

### The parts of the sample training container

The `training_container` directory has all the components you need to extend the SageMaker scikit-learn container to use as a sample algorithm:

    .
    |-- Dockerfile
    |-- train.py

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image. More details are provided below.
* __`train.py`__ is the program that implements our training algorithm and handles unloading/serialization of our model for use in the inference container.

In this simple application, we install only one file in the container. You may only need that many, but if you have many supporting routines, you may wish to install more.

### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

Let's look at the [Dockerfile](./training_container/Dockerfile) for this example.

We start from the SageMaker scikit-learn image as the base. The base image is an ECR image, so it will have the following pattern.

`{account}.dkr.ecr.{region}.amazonaws.com/sagemaker-{framework}:{framework_version}-{processor_type}-{python_version}`

Here is an explanation of each field.
1. account - AWS account ID the ECR image belongs to. Our public scikit-learn framework images are under the 683313688378 account for the `us-east-1` region.
2. region - The region the ECR image belongs to. [Available regions](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).
3. framework - The framework.
4. framework_version - The version of the framework.
5. processor_type - CPU or GPU.
6. python_version - The supported version of Python.

So the SageMaker scikit-learn ECR image would be:
`683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3`

Information on supported frameworks and versions can be found in this [README](https://github.com/aws/sagemaker-python-sdk).

Next, we add the code that implements our specific algorithm to the container and set up the right environment for it to run under.

Finally, we need to specify an environment variable.
- SAGEMAKER_PROGRAM - the Python script that should be invoked for training and inference.

# Part 2: Building and registering the container

The new Amazon SageMaker Studio Image Build convenience package allows data scientists and developers to easily build custom container images from your Studio notebooks via [a new CLI](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/). The new CLI eliminates the need to manually set up and connect to Docker build environments for building container images in Amazon SageMaker Studio.

To use the CLI, we need to ensure the Amazon SageMaker execution role used by your Studio notebook environment (or another AWS Identity and Access Management (IAM) role, if you prefer) has the required permissions to interact with the resources used by the CLI, including access to CodeBuild and Amazon ECR.

Your role should have a trust policy with CodeBuild. See the following code:

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "codebuild.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
```

You also need to make sure the appropriate permissions are included in your role to run the build in CodeBuild, create a repository in Amazon ECR, and push images to that repository. The following code is an example policy that you should modify as necessary to meet your needs and security requirements:

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "codebuild:DeleteProject",
                "codebuild:CreateProject",
                "codebuild:BatchGetBuilds",
                "codebuild:StartBuild"
            ],
            "Resource": "arn:aws:codebuild:*:*:project/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogStream",
            "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:GetLogEvents",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:BatchGetImage",
                "ecr:CompleteLayerUpload",
                "ecr:DescribeImages",
                "ecr:DescribeRepositories",
                "ecr:UploadLayerPart",
                "ecr:ListImages",
                "ecr:InitiateLayerUpload",
                "ecr:BatchCheckLayerAvailability",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
              "s3:GetObject",
              "s3:DeleteObject",
              "s3:PutObject"
              ],
            "Resource": "arn:aws:s3:::sagemaker-*/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket"
            ],
            "Resource": "arn:aws:s3:::sagemaker*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:ListRoles"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringLikeIfExists": {
                    "iam:PassedToService": "codebuild.amazonaws.com"
                }
            }
        }
    ]
}
```

The CLI can easily be installed in the Studio notebook environment using the command `!pip install sagemaker-studio-image-build` within your notebook environment. 

However, because the scikit-learn repository is not included by default in the current release of the CLI ([fix pending](https://github.com/aws-samples/sagemaker-studio-image-build-cli/issues/13)), we will compile our own version. 

1. In the root directory, clone the [SageMaker Build CLI repository](https://github.com/aws-samples/sagemaker-studio-image-build-cli).
2. Navigate to `sagemaker-studio-image-build-cli/sagemaker-studio-image-build/data/buildspec.template.yml`
3. Replace the content of the file with [this](https://raw.githubusercontent.com/athewsey/sagemaker-studio-image-build-cli/fbf39b22dde7a3d1375b10897e510cc9dadb9ebc/sagemaker_studio_image_build/data/buildspec.template.yml)
4. Check that the file `sagemaker-studio-image-build-cli/sagemaker-studio-image-build/cli.py` contains a *,* at the end of line *77*

Now you are ready to compile the Image Build CLI. Open a new Terminal window in SageMaker Studio and input the following commands:
```
cd ~/sagemaker-studio-image-build-cli
make install
```

You should get a message that the Image Build CLI has been succesfully installed. Now you can take advantage of the new CLI to easily build your custom bring-your-own Docker images from Amazon SageMaker Studio without worrying about the underlying setup and configuration of build services.

To use the CLI, from the same terminal window navigate to the directory containing your Dockerfile and enter the code below:
```
cd ~/training_container
sm-docker build . --repository lightfm:1.0
``` 

The `--repository` flag allows you to give a custom name and version label to the container. 

It’s that simple! The command automatically logs build output to your terminal and returns the image URI of your Docker image if the operation is successful. You will be returned a container URI similar to the following:
`{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/lightfm:1.0`

# Part 3: Use the container for training in Amazon SageMaker

Once you have your container packaged, you can use it to train models. Let's do that with the algorithm we made above.

## Set up the environment
Here we specify the bucket to use and the role that is used for working with SageMaker.

In [None]:
from sagemaker import get_execution_role
role = get_execution_role()

Below we wrote a helper function to generate the ECR URI for a given repository name and tag:

In [None]:
import boto3 
def get_container_uri(ecr_repository, tag):
    account_id = boto3.client('sts').get_caller_identity().get('Account')

    region = boto3.session.Session().region_name

    uri_suffix = 'amazonaws.com'
    if region in ['cn-north-1', 'cn-northwest-1']:
        uri_suffix = 'amazonaws.com.cn'

    return '{}.dkr.ecr.{}.{}/{}:{}'.format(account_id, region, uri_suffix, ecr_repository, tag)

print (get_container_uri('lightfm', '1.0'))

## Training on SageMaker
Training a model on SageMaker with the Python SDK is done by using the high-level abstraction of the Estimator class. 

This is where we now specify the ECR image URL, which we just pushed above.

In [None]:
from sagemaker.estimator import Estimator

byoc_image_uri = get_container_uri('lightfm','1.0')

# S3 prefix
prefix = 'light-fm-training-demo'

estimator = Estimator(image_uri=byoc_image_uri,
                      role=get_execution_role(),
                      base_job_name='light-fm-custom-container-train-job',
                      instance_count=1,
                      instance_type='ml.m4.xlarge')

estimator.fit()

If the training was successful, the trained model has been stored on S3. You can check the location by navigating to the SageMaker training job in the interface and consulting the output S3 location.