# Create Training Job

An Amazon SageMaker *training job* is a compute process that trains an ML model in an containerized environment. In this notebook, you will create a training job with your own custom container on Amazon SageMaker. To read more about training job, refer to the [official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)

The outline of this notebook is:
- create an service execution for SageMaker to run a training job
- build a light-weighted container based on continuumio/miniconda
- test your container locally
- push your container to Elastic Container Registry (ECR)
- upload your training data to an S3 bucket
- create a training job with everything you did above

In [188]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/master/sagemaker-fundamentals/execution-role/iam_helpers.py


In [1]:
# setup
!wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/master/sagemaker-fundamentals/execution-role/iam_helpers.py

--2021-03-03 19:44:56--  https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/master/sagemaker-fundamentals/execution-role/iam_helpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3350 (3.3K) [text/plain]
Saving to: ‘iam_helpers.py.1’


2021-03-03 19:45:02 (70.4 MB/s) - ‘iam_helpers.py.1’ saved [3350/3350]



In [58]:
import boto3
import datetime
import pprint
import os

pp = pprint.PrettyPrinter(indent=1)
iam = boto3.client('iam')

In [50]:
# some helper functions
def current_time():
    ct = datetime.datetime.now() 
    return str(ct.now()).replace(":", "-").replace(" ", "-")[:19]

def account_id():
    return boto3.client('sts').get_caller_identity()['Account']

## Create an IAM service role

To review IAM role, see the [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb)

The service role is intended to be assumed by the SageMaker service. For simplicity, we will give it `AmazonSageMakerFullAccess` permission. However, in order to do what we need in this notebook, we do not need such a comprehensive permission. You are highly encouraged to play with the helper functions we provide in `iam_helpers.py` to figure out what are the minimum permissions needed to run this notebook. 


In [123]:
from iam_helpers import create_execution_role, attach_permission

role_name='sm' 
role = create_execution_role(role_name=role_name)['Role']
print(role)

{'Path': '/', 'RoleName': 'sm', 'RoleId': 'AROA2ATYEUMKISEX5EO2V', 'Arn': 'arn:aws:iam::688520471316:role/sm', 'CreateDate': datetime.datetime(2021, 3, 3, 23, 37, 7, tzinfo=tzlocal()), 'AssumeRolePolicyDocument': {'Version': '2012-10-17', 'Statement': [{'Effect': 'Allow', 'Principal': {'AWS': 'arn:aws:iam::688520471316:user/hongshan', 'Service': ['sagemaker.amazonaws.com']}, 'Action': 'sts:AssumeRole'}]}}


In [80]:
# attach AmazonSageMakerFullAccess
res = iam.attach_role_policy(
    RoleName=role['RoleName'],
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
)

pp.pprint(res)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '212',
                                      'content-type': 'text/xml',
                                      'date': 'Wed, 03 Mar 2021 23:37:08 GMT',
                                      'x-amzn-requestid': '8549d495-f371-41bd-ba92-974326a857f9'},
                      'HTTPStatusCode': 200,
                      'RequestId': '8549d495-f371-41bd-ba92-974326a857f9',
                      'RetryAttempts': 0}}


## Build the training environement into a docker image

Before creating a training job on Amazon SageMaker, you need to package the entire runtime environment of your ML project into a docker image and push the image into the Elastic Container Registry (ECR) under your account. 

When triggering a training job, your requested SageMaker instance will pull that image from your ECR and execute it with the data you specified in an S3 URI. 

It important to know how SageMaker runs your image. For **training job**, SageMaker runs your image like
```
docker run <image> train
```
i.e. your image needs to have an executable `train` and it is the executable that starts the model training process. You will see later in the notebook how to create it. 

The next natural thing to ask is how does the image running on SageMaker instance access the data that the model needs to be trained on? SageMaker requires you to reserve `/opt/ml` directory inside your image for it to provide training information. When you trigger a training job, you will need to specify the location of your training data, and the SageMaker instance running your image will mount your data into `/opt/ml/input`. 

To read more about SageMaker uses `/opt/ml` to provide training information, refer to the [official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html)

In [5]:
# View the Dockerfile
!cat container/Dockerfile

FROM continuumio/miniconda:latest 

# SageMaker uses /opt/ml for input / output data 
# throughout the training 
RUN mkdir -p /opt/ml

# Copy the training script into /usr/bin 
# as an executable
COPY train.py /usr/bin/train

# make /opt/ml/program/train an executable
RUN chmod +x /usr/bin/train



### Explaination

`train.py` in `container/` is the main script for training the model. We copied it into `/usr/bin`, renamed it as `train` and made it an executable in the docker image. This way when the container is executed as 
```
docker run <image> train
```
The script in `/usr/bin/train` (in the container) will run. 

Note that this is one way to run the training logic on SageMaker. As long as the command 
```
docker run <image> train 
```
triggers your training logic you can do whatever you want. 

Next, we build the image.

In [6]:
%%sh
# build the image
cd container/

# tag it as example-image:latest
docker build -t example-image:latest .

Sending build context to Docker daemon  18.43kB
Step 1/4 : FROM continuumio/miniconda:latest
 ---> b8ea69b5c41c
Step 2/4 : RUN mkdir -p /opt/ml
 ---> Using cache
 ---> a170cc3fed03
Step 3/4 : COPY train.py /usr/bin/train
 ---> Using cache
 ---> 315ae4eff0a2
Step 4/4 : RUN chmod +x /usr/bin/train
 ---> Using cache
 ---> 0213a62c189a
Successfully built 0213a62c189a
Successfully tagged example-image:latest


Let's inspect what's in the training script `container/train.py`

In [7]:
!pygmentize container/train.py

[37m#!/usr/bin/env python[39;49;00m

[37m# A sample script for training an ML model[39;49;00m
[37m# It does 2 things[39;49;00m
[37m# load csv data in /opt/ml/data[39;49;00m

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m

[37m# where SageMaker injects training data inside container[39;49;00m
data_dir=[33m"[39;49;00m[33m/opt/ml/input/data[39;49;00m[33m"[39;49;00m

[37m# SageMaker treat "/opt/ml/model" as checkpoint direcotry[39;49;00m
[37m# and it will send everything there to S3 output path you [39;49;00m
[37m# specified [39;49;00m
model_dir=[33m"[39;49;00m[33m/opt/ml/model[39;49;00m[33m"[39;49;00m

[34mdef[39;49;00m [32mmain[39;49;00m():
    [36mprint[39;49;00m([33m"[39;49;00m[33m== Files in train channel ==[39;49;00m[33m"[39;49;00m)
    [34mfor[39;49;00m f [35min[39;49;00m os.lis

### Explaination

It is a skeleton of a typical ML training logic. The main function fetches training data in `/opt/ml/input/data/train`. To verify we indeed have access to the data, we will print out the names of the files in `/opt/ml/input/data/train`. When you actually run this training logic on SageMaker, you can view the stdout through CloudWatch. We will discuss this in more detail later in this notebook. 

When the main function finishes model training, it saves the model checkpoint in `/opt/ml/model`. The SageMaker Instance running your container will then upload everything in `/opt/ml/model` to an S3 URI that you will later configure yourself. 

## Test your container

It is a good practice to test your container before sending it to SageMaker, because you can debug and iterate much faster on your local machine. 

You are strongly encouraged to read through the section on [How Amazon SageMaker Provides Training Information](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html) from the official doc and figure out local testing environment that replicates how SageMaker provides training information to your container. 

We will use docker python client to execute the container. To see our implementation of local testing environment, run the following cell.

In [8]:
!pygmentize container/local_test/test_container.py

[37m# This script tests the your own container before running[39;49;00m
[37m# on SageMaker infrastructure. It mimics how SageMaker provides[39;49;00m
[37m# training info to your container and how it executes it. [39;49;00m

[34mimport[39;49;00m [04m[36mdocker[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

dirname = os.path.dirname(
    os.path.realpath([31m__file__[39;49;00m)
    )

client = docker.from_env()

container = client.containers.run(
    [33m'[39;49;00m[33mexample-image:latest[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtrain[39;49;00m[33m'[39;49;00m, [37m# docker run example-image:latest train [39;49;00m
    volumes={
        [37m# mount ml/ to /opt/ml as volume[39;49;00m
        [37m# it's a mechanism for the operating [39;49;00m
        [37m# system to communicate with inside of[39;49;00m
        [37m# a docker container[39;49;00m
        os.path.join(dirname, [33m'[39;49;00m[33mml[39;49;00m[33m'[39;

### Explaination 

Our testing script runs the docker image `example-image:latest` with `train` command, mimicking how SageMaker runs your container for a training job. It mounts the local directory `container/local_test/ml/` to `/opt/ml` in the docker image, mimicking how SageMaker provides the training information to the container. 

The directory `container/local_test/ml` looks like:

In [9]:
!ls -R container/local_test/ml

container/local_test/ml:
input  model  output

container/local_test/ml/input:
data

container/local_test/ml/input/data:
test  train

container/local_test/ml/input/data/test:
test_data_batch1.csv  test_data_batch2.csv  test_data_batch3.csv

container/local_test/ml/input/data/train:
data_batch1.csv  data_batch2.csv  data_batch3.csv

container/local_test/ml/model:
model.pkl

container/local_test/ml/output:
failure

container/local_test/ml/output/failure:


The directories `container/local_test/ml/input/data/train` and `container/local_test/ml/input/data/test` contains some csv files, which will be available in `/opt/ml/input/data/train` and `/opt/ml/input/data/test` as the training and testing data. 

In [10]:
# run the test
!python container/local_test/test_container.py

== Files in train channel ==
data_batch2.csv
data_batch3.csv
data_batch1.csv
== Files in the test channel ==
test_data_batch1.csv
test_data_batch2.csv
test_data_batch3.csv
== Saving model checkpoint ==
== training completed ==



Now, you should see a model checkpoint in `container/local_test/ml/model`

In [11]:
!ls container/local_test/ml/model

model.pkl


## Push your docker image to ECR

Now, you have build your image tested it locally. Next thing you need to do is to push it to the Elastic Container Registry under your account. Later, when you trigger a training job, the SageMaker instance you requested will pull that image. 

To do so, you will need to create a repo in your ECR to host it. You might have guess that this operation requires some permission on your ECR resources. That's right. You (the principal running this notebook) needs permission to create repository in ECR and get authorization token from it and the role you created before (which you will later pass to SageMaker) needs permission to get authorization token (and pull the image). 

If you have `AdministratorAccess` then you have permisssion to do everything on your AWS resources. For the service role `sm` we created at begining of this notebook, we attached `AmazonSageMakerFullAccess` to it and you might have guessed that this permission is kind of strong and common actions like pulling an image from ECR is included. You are right. But it is still interesting to verify that you and your agent (service role) have the necessary permissions.

To do so, you can use `SimulatePrincipalPolicy` API from IAM. You guessed right, it simulates the principal's policy and tells you if certain actions are allowed. For more detail on `SimulatePrincipalPolicy`, refer to the [API reference](https://docs.aws.amazon.com/IAM/latest/APIReference/API_SimulatePrincipalPolicy.html) in the official docs or its [python equivalent](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html#IAM.Client.simulate_principal_policy) in boto3 documentation. 

In [12]:
user_arn = boto3.client('sts').get_caller_identity()['Arn'] # you

user_prp = iam.simulate_principal_policy(
    PolicySourceArn=user_arn,
    ActionNames=['ecr:GetAuthorizationToken', 'ecr:CreateRepository']
)
print("== User's Permission Evaluation ==")
pp.pprint(user_prp['EvaluationResults'])

== User's Permission Evaluation ==
[{'EvalActionName': 'ecr:GetAuthorizationToken',
  'EvalDecision': 'allowed',
  'EvalResourceName': '*',
  'MatchedStatements': [{'EndPosition': {'Column': 10, 'Line': 149},
                         'SourcePolicyId': 'AmazonSageMakerFullAccess',
                         'SourcePolicyType': 'IAM Policy',
                         'StartPosition': {'Column': 10, 'Line': 43}},
                        {'EndPosition': {'Column': 6, 'Line': 8},
                         'SourcePolicyId': 'AdministratorAccess',
                         'SourcePolicyType': 'IAM Policy',
                         'StartPosition': {'Column': 17, 'Line': 3}},
                        {'EndPosition': {'Column': 6, 'Line': 11},
                         'SourcePolicyId': 'AmazonEC2ContainerRegistryFullAccess',
                         'SourcePolicyType': 'IAM Policy',
                         'StartPosition': {'Column': 17, 'Line': 3}}],
  'MissingContextValues': []},
 {'EvalActionName

In [13]:
role_arn=role['Arn'] # your agent 

role_prp = iam.simulate_principal_policy(
    PolicySourceArn=role_arn,
    ActionNames=['ecr:GetAuthorizationToken']
)
print("== Service Role Permission Evaluation ==")
pp.pprint(role_prp['EvaluationResults'])

== Service Role Permission Evaluation ==
[{'EvalActionName': 'ecr:GetAuthorizationToken',
  'EvalDecision': 'allowed',
  'EvalResourceName': '*',
  'MatchedStatements': [{'EndPosition': {'Column': 10, 'Line': 149},
                         'SourcePolicyId': 'AmazonSageMakerFullAccess',
                         'SourcePolicyType': 'IAM Policy',
                         'StartPosition': {'Column': 10, 'Line': 43}}],
  'MissingContextValues': []}]


Note: if you do not have enough permissions on the ECR resources under your organization's account. Then the admin of the account needs to grant you the ECR permissions. 

### Create a repository in your ECR

Suppose you have enough ECR permissions, we now create a repository in your ECR to host the image `example-image:latest`. It is convenient to set the name of the repository should be the same as the name of the image. 

In [14]:
ecr = boto3.client('ecr')

try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(
        repositoryName='example-image')
    pp.pprint(cr_res)
except Exception as e:
    print(e)

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'example-image' already exists in the registry with id '688520471316'


If you already have a repository called `example-image`, then there are two ways you can continue
* Delete the repository can create new one with the same name
* Create a repository using a name other than `example-image`

We will provide code for the second route below. But you will need to run it with caution, because the repository `example-image` is probably used by your org for production, and it happens to coincides with our choice of repository name. 

In [77]:
"""
If you want to delete the `example-image` repository,
Change this cell from markdown to python, then run it. 
"""
try:
    ecr.delete_repository(
        repositoryName='example-image')
    
    ecr.create_repository(
        repositoryName='example-image')
except Exception as e:
    print(e)

### Tag your image and push to ECR

Now, let's tag the image with the full address of the repository we just created and push it there. Before doing that, you will need to grant docker access to your ECR. Refer to the [registry authentication section](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html) from the ECR documentation for more detail. 

In [21]:
%%bash
account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
region=$(aws configure get region)
ecr_account=${account}.dkr.ecr.${region}.amazonaws.com

# Give docker your ECR login password
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account

# Fullname of the repo
fullname=$ecr_account/example-image:latest

#echo $fullname
# Tag the image with the fullname
docker tag example-image:latest $fullname

# Push to ECR
docker push $fullname

Login Succeeded
The push refers to repository [688520471316.dkr.ecr.us-west-2.amazonaws.com/example-image]
16c0bf8a256b: Preparing
94149d717f86: Preparing
88674bdc7fd9: Preparing
78db50750faa: Preparing
805309d6b0e2: Preparing
2db44bce66cd: Preparing
2db44bce66cd: Waiting
94149d717f86: Pushed
88674bdc7fd9: Pushed
16c0bf8a256b: Pushed
2db44bce66cd: Pushed
78db50750faa: Pushed
805309d6b0e2: Pushed
latest: digest: sha256:64caa0b2f89c35a24e9648f1644d3efc7634054f2779460ea1064d87bb06c8af size: 1574


https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [23]:
# Inspect the ECR repository
repo_res = ecr.describe_images(
    repositoryName='example-image')
pp.pprint(repo_res)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '399',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 03 Mar 2021 19:56:36 GMT',
                                      'x-amzn-requestid': 'c33de26f-f577-4fac-9566-d5e904225ac9'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'c33de26f-f577-4fac-9566-d5e904225ac9',
                      'RetryAttempts': 0},
 'imageDetails': [{'artifactMediaType': 'application/vnd.docker.container.image.v1+json',
                   'imageDigest': 'sha256:64caa0b2f89c35a24e9648f1644d3efc7634054f2779460ea1064d87bb06c8af',
                   'imageManifestMediaType': 'application/vnd.docker.distribution.manifest.v2+json',
                   'imagePushedAt': datetime.datetime(2021, 3, 3, 19, 50, 30, tzinfo=tzlocal()),
                   'imageSizeInBytes': 150950302,
                   'imageTags': ['latest'],
                   'r

## Prepare training data

SageMaker provides training data to your image through an S3 bucket that your service role has read access to. This means before triggering a training job, you need to make your training available in such an S3 bucket.

In this notebook, we will use preloaded data on a public bucket `sagemaker-sample-files`.

In [53]:
# inspect the bucket
public_bucket = "sagemaker-sample-files"
s3 = boto3.client('s3')
obj_res = s3.list_objects_v2(
    Bucket="sagemaker-sample-files")

# print out object keys compactly
for obj in obj_res['Contents']:
    if '/tabular/fraud_detection/synthethic_fraud_detection_SA' in obj['Key']:
        print(obj['Key'])

datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/
datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/churn.txt
datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/identity.csv
datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/sampled_identity.csv
datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/sampled_transactions.csv
datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/transaction.csv


Let's pretend the data under `datasets/tabular/synthetic_fraud_detection_SA` is the data for your ML project.

The public bucket `sagemaker-sample-files` is located in us-east-1. We first need to copy the data to a bucket of yours that share the same region with the SageMaker instance you will use later.

In [51]:
# create a bucket
def create_tmp_bucket():
    """Create an S3 bucket that is intended to be used for short term"""
    bucket = "{}-{}".format(account_id(), current_time())
    region = boto3.Session().region_name
    boto3.client('s3').create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration={
            'LocationConstraint': region
        })
    return bucket

bucket = create_tmp_bucket()

The bucket is created by you. By default, all objects in the bucket are private and are accessible by you. But later you will need SageMaker to read input data from write model artifact to it. Therefore, you will need to grant read and write access to the bucket to the execution role `sm`. 

In [None]:

get = {
    "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*" # Allow the role to perform list related actions, i.e read access
                ],
                "Resource": [
                    "arn:aws:s3:us-west-2:688520471316-2021-03-03-22-46-41:*" 
                ]
            }
        ]
    }

    
# create a new policy
policy_name='s3get'
policy = iam.create_policy(
    PolicyName=policy_name,
    PolicyDocument=json.dumps(get))['Policy']

# attach the policy to the role
res = iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn=policy['Arn']
    )



In [127]:
bucket_arn = "arn:aws:s3:::{}/*".format(bucket)

get_put = {
    "Version":"2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Put*"
            ],
            "Resource": bucket_arn
        }
    ]
}

pm_res = attach_permission(
    role_name=role['RoleName'],
    policy_name='get_put',
    policy_doc=get_put)

pp.pprint(pm_res)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '212',
                                      'content-type': 'text/xml',
                                      'date': 'Thu, 04 Mar 2021 02:13:07 GMT',
                                      'x-amzn-requestid': 'b517f83e-13a7-4426-bd2e-defefdcb9272'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'b517f83e-13a7-4426-bd2e-defefdcb9272',
                      'RetryAttempts': 0}}


In [128]:
input_prefix = 'input_data/'

In [129]:
# copy from sagemaker-samplef-files to {bucket}
s3 = boto3.client('s3')

# copy remote csv files to local
files = []
data_dir = '/tmp'
for obj in obj_res['Contents']:
    if '/tabular/fraud_detection/synthethic_fraud_detection_SA' in obj['Key']:
        key = obj['Key']
        if key.endswith('.csv'):
            filename=key.split('/')[-1]
            files.append(filename)
            with open(os.path.join(data_dir, filename), 'wb') as f:
                s3.download_fileobj(public_bucket, key, f)

# upload from local to the bucket you just created
for fname in files:
    with open(os.path.join(data_dir, fname), 'rb') as f:
        key = input_prefix + fname
        s3.upload_fileobj(f, bucket, key)

In [130]:
# inspect your bucket
obj_res = s3.list_objects_v2(
    Bucket=bucket)

for obj in obj_res['Contents']:
    print(obj['Key'])

input_data/identity.csv
input_data/sampled_identity.csv
input_data/sampled_transactions.csv
input_data/transaction.csv


## Prepare an S3 URI for saving model artifact

After your image is done with model training, it needs to write the trained model artifact into `/opt/ml/model`. This is directory where SageMaker looks for the trained model artifact and upload it to an S3 URI you will configure later. Naturally, the execution role `sm` needs to have write permission to this S3 URI. 

Refer to the section on [How Amazon SageMaker Processes Training Output](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html) in the official docs for more detail. 

## Put everything together

Now, you have everything you need to create a training job. Let's review what you have done. you have 
* created an execution role for SageMaker service
* built and tested a docker image that includes the runtime and logic of your model training
* made the image accessible to SageMaker by hosting it on ECR
* made the training data available to SageMaker by hosting it on S3
* pointed SageMaker to an S3 bucket to write output 

Let pull the trigger and create a training job. We will invoke `CreateTrainingJob` API via boto3. You are strongly encouraged to read through the [description of the API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) before moving on. 

In [131]:
# set up

sm_boto3 = boto3.client('sagemaker')

# name training job
training_job_name = 'example-training-job-{}'.format(current_time())

# input data prefix
data_path = "s3://" + bucket + '/' + input_prefix

# location that SageMaker saves the model artifacts
output_prefix = 'example/output/'
output_path = "s3://" + bucket + '/' + output_prefix

# ECR URI of your image
region = boto3.Session().region_name
account = account_id()
image_uri = "{}.dkr.ecr.{}.amazonaws.com/example-image:latest".format(account, region)

algorithm_specification = {
    'TrainingImage': image_uri,
    'TrainingInputMode': 'File',
}


input_data_config = [
    {
        'ChannelName': 'train',
            'DataSource':{
                'S3DataSource':{
                    'S3DataType': 'S3Prefix',
                    'S3Uri': data_path,
                    'S3DataDistributionType': 'FullyReplicated',
                }
        }
        
    },
    {
        'ChannelName': 'test',
        'DataSource':{
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': data_path,
                'S3DataDistributionType': 'FullyReplicated',
            }
        }
    }
]


output_data_config = {
    'S3OutputPath': output_path
}

resource_config = {
    'InstanceType': 'ml.m5.large',
    'InstanceCount':1,
    'VolumeSizeInGB':10
}

stopping_condition={
    'MaxRuntimeInSeconds':120,
    #'MaxWaitTimeInSeconds': 123
}

enable_network_isolation=False

In [132]:
ct_res = sm_boto3.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification=algorithm_specification,
    RoleArn=role_arn,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition,
    EnableNetworkIsolation=enable_network_isolation,
    EnableManagedSpotTraining=False,
)

In [109]:
# View the status of the training job
tj_state = sm_boto3.describe_training_job(
    TrainingJobName=training_job_name)
pp.pprint(tj_state.keys())

dict_keys(['TrainingJobName', 'TrainingJobArn', 'ModelArtifacts', 'TrainingJobStatus', 'SecondaryStatus', 'AlgorithmSpecification', 'RoleArn', 'InputDataConfig', 'OutputDataConfig', 'ResourceConfig', 'StoppingCondition', 'CreationTime', 'TrainingStartTime', 'TrainingEndTime', 'LastModifiedTime', 'SecondaryStatusTransitions', 'EnableNetworkIsolation', 'EnableInterContainerTrafficEncryption', 'EnableManagedSpotTraining', 'TrainingTimeInSeconds', 'BillableTimeInSeconds', 'ProfilingStatus', 'ResponseMetadata'])


In [134]:
# check training job status every 30 seconds
stopped = False
while not stopped:
    tj_state = sm_boto3.describe_training_job(
        TrainingJobName=training_job_name)
    if tj_state['TrainingJobStatus'] in ['Completed', 'Stopped', 'Failed']:
        stopped=True
    else:
        print("Training in progress")
        time.sleep(30)

if tj_state['TrainingJobStatus'] == 'Failed':
    print("Training job failed ")
    print("Failed Reason: {}".tj_state['FailedReason'])
else:
    print("Training job completed")

Training job completed


## Inspect the trained model artifact

In [143]:
print("== Output config:")
print(tj_state['OutputDataConfig'])

print()

print("== Model artifact:")
pp.pprint(s3.list_objects_v2(Bucket=bucket, Prefix=output_prefix))

== Output config:
{'KmsKeyId': '', 'S3OutputPath': 's3://688520471316-2021-03-03-22-46-41/example/output/'}

== Model artifact:
{'Contents': [{'ETag': '"cea072960b7b3a427bebf56f5dca5071"',
               'Key': 'example/output/example-training-job-2021-03-04-02-16-41/output/model.tar.gz',
               'LastModified': datetime.datetime(2021, 3, 4, 2, 20, 17, tzinfo=tzlocal()),
               'Size': 120,
               'StorageClass': 'STANDARD'}],
 'EncodingType': 'url',
 'IsTruncated': False,
 'KeyCount': 1,
 'MaxKeys': 1000,
 'Name': '688520471316-2021-03-03-22-46-41',
 'Prefix': 'example/output/',
 'ResponseMetadata': {'HTTPHeaders': {'content-type': 'application/xml',
                                      'date': 'Thu, 04 Mar 2021 02:31:07 GMT',
                                      'server': 'AmazonS3',
                                      'transfer-encoding': 'chunked',
                                      'x-amz-bucket-region': 'us-west-2',
                                  

In [153]:
logs = boto3.client('logs')

log_res= logs.describe_log_streams(
    logGroupName='/aws/sagemaker/TrainingJobs',
    logStreamNamePrefix=training_job_name)

for log_stream in log_res['logStreams']:
    # get one log event
    log_event = logs.get_log_events(
        logGroupName='/aws/sagemaker/TrainingJobs',
        logStreamName=log_stream['logStreamName'])
    
    # print out messages from the log event
    for ev in log_event['events']:
        for k, v in ev.items():
            if k == 'message':
                print(v)

== Files in train channel ==
transaction.csv
sampled_identity.csv
sampled_transactions.csv
identity.csv
== Files in the test channel ==
transaction.csv
sampled_identity.csv
sampled_transactions.csv
identity.csv
== Saving model checkpoint ==
== training completed ==


## Conclusion

Congratulations! You now understand the basics of a training job on SageMaker. It's funny to think that after this long notebook, you get a trained model artifact, which is a pickled None instance. But keep in mind that you can follow the exact same process to train a state-of-art model with billions of parameters and the compute cost is proportional to how long you train your model.

## Clean up resources

In [None]:
# delete the ECR repo
del_repo_res = ecr.delete_repository(
    repositoryName='example-image',
    force=True)

pp.pprint(del_repo_res)

In [166]:
# delete the S3 bucket
def delete_bucket_force(bucket_name):
    objs = s3.list_objects_v2(Bucket=bucket_name)['Contents']
    for obj in objs:
        s3.delete_object(
            Bucket=bucket_name,
            Key=obj['Key'])
    
    return s3.delete_bucket(Bucket=bucket_name)

del_buc_res = delete_bucket_force(bucket)

pp.pprint(del_buc_res)

{'ResponseMetadata': {'HTTPHeaders': {'date': 'Thu, 04 Mar 2021 03:09:12 GMT',
                                      'server': 'AmazonS3',
                                      'x-amz-id-2': 'Trw1jLE9sIZbTSBm4VQD3Gpio1DIBdiDrJ5y4IvynC5dnu+0VSUoNYQ5TLJpowZYMFlwxUQlhxM=',
                                      'x-amz-request-id': '6CR3EN9H0WNG6SK2'},
                      'HTTPStatusCode': 204,
                      'HostId': 'Trw1jLE9sIZbTSBm4VQD3Gpio1DIBdiDrJ5y4IvynC5dnu+0VSUoNYQ5TLJpowZYMFlwxUQlhxM=',
                      'RequestId': '6CR3EN9H0WNG6SK2',
                      'RetryAttempts': 0}}
