# Create Training Job

An Amazon SageMaker *training job* is a compute process that trains an ML model in an containerized environment. In this notebook, you will create a training job with your own custom container on Amazon SageMaker. To read more about training job, refer to the [official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)

The outline of this notebook is:
- create an service execution for SageMaker to run a training job
- build a light-weighted container based on continuumio/miniconda
- test your container locally
- push your container to ECR
- upload your training data to an S3 bucket
- create a training job with everything you did above

In [1]:
# setup
!wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/master/sagemaker-fundamentals/execution-role/iam_helpers.py

--2021-03-02 00:54:41--  https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/master/sagemaker-fundamentals/execution-role/iam_helpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3350 (3.3K) [text/plain]
Saving to: ‘iam_helpers.py’


2021-03-02 00:54:41 (68.9 MB/s) - ‘iam_helpers.py’ saved [3350/3350]



In [13]:
import boto3
import pprint

pp = pprint.PrettyPrinter(indent=1)
iam = boto3.client('iam')

## Create an IAM service role

To review IAM role, see the [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb)

The service role is intended to be assumed by the SageMaker service. For simplicity, we will give it `AmazonSageMakerFullAccess` permission. However, in order to do what we need in this notebook, we do not need such a comprehensive permission. You are highly encouraged to play with the helper functions we provide in `iam_helpers.py` to figure out what are the minimum permissions needed to run this notebook. 


In [10]:
from iam_helpers import *

role_name='sm' 
role = create_execution_role(role_name=role_name)['Role']

In [15]:
# attach AmazonSageMakerFullAccess
res = iam.attach_role_policy(
    RoleName=role['RoleName'],
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
)

pp.pprint(res)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '212',
                                      'content-type': 'text/xml',
                                      'date': 'Tue, 02 Mar 2021 01:21:01 GMT',
                                      'x-amzn-requestid': '8e298791-e0bd-4721-a8a0-12d71d9802d7'},
                      'HTTPStatusCode': 200,
                      'RequestId': '8e298791-e0bd-4721-a8a0-12d71d9802d7',
                      'RetryAttempts': 0}}


## Build the training environement into a docker image

Before creating a training job on Amazon SageMaker, you need to package the entire runtime environment of your ML project into a docker image and push the image into the Elastic Container Registry (ECR) under your account. 

When triggering a training job, your requested SageMaker instance will pull that image from your ECR and execute it with the data you specified in an S3 URI. 

It important to know how SageMaker runs your image. For **training job**, SageMaker runs your image like
```
docker run <image> train
```
i.e. your image needs to have an executable `train` and it is the executable that starts the model training process. You will see later in the notebook how to create it. 

The next natural thing to ask is how does the image running on SageMaker instance access the data that the model needs to be trained on? SageMaker requires you to reserve `/opt/ml` directory inside your image for it to provide training information. When you trigger a training job, you will need to specify the location of your training data, and the SageMaker instance running your image will mount your data into `/opt/ml/input`. 

To read more about SageMaker uses `/opt/ml` to provide training information, refer to the [official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html)