# Create Training Job (Hyperparamter Injection) 

In this notebook, we discuss more complicated set ups for `CreateTrainingJob` API. It assumes you are confortable with the set ups discussed in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb)



## What is Hyperparameter Injection?

With hyperparamter injection, you don't need to hard code hyperparameters of your ML training in the training image, instead you can pass your hyperparamters through `CreateTrainingJob` API and SageMaker will makes them available to your training container. This way you can experiment a list of hyperparameters for your training job without rebuilding the image for each experiment. More importantly, this is the mechanism used by `CreateHyperParameterTuningJob` API to (you guessed right) create many training jobs to search for the best hyperparameters. We will discuss `CreateHyperParameterTuningJob` in a different notebook. 

If you remember from [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb), SageMaker reserves `/opt/ml` directory "to talk to your container", i.e. provide training information to your training job and retrieve output from it. 

You will pass hyperparamters of your training job as a dictionary to the `create_training_job` of boto3 SageMaker client, and it will become availble in `/opt/ml/input/config/hyperparameters.json`. See [reference in the official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html).

### Set ups

You will build a training image and push it to ECR like in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb). The only difference is, the python script for runing the training will print out the hyperparamters in `/opt/ml/input/config/hyperparameters.json` to confirm that container does have access to the hyperparamters you passed to `CreateTrainingJob` API. 

This training job does not require any data. Therefore, you don't need to confgure `InputDataConfig` parameter for `CreateTrainingJob`. However, SageMaker always needs an S3 URI to save your model artifact, i.e. you still need to configure `OutputDataConfig` parameter. 

In [11]:
import boto3 # your gateway to AWS APIs
import datetime
import pprint
import os
import time
import re

pp = pprint.PrettyPrinter(indent=1)
iam = boto3.client('iam')

In [16]:
# some helper functions
def current_time():
    ct = datetime.datetime.now() 
    return str(ct.now()).replace(":", "-").replace(" ", "-")[:19]

def account_id():
    return boto3.client('sts').get_caller_identity()['Account']

### Set up a service role for SageMaker

Review [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb) for step-by-step instructions on how to create an IAM Role.

The service role is intended to be assumed by the SageMaker service to procure resources in your AWS account on your behalf. 

1. If you are running this this notebook on SageMaker infrastructure like Notebook Instances or Studio, then we will use the role you used to spin up those resources

2. If you are running this notebook on an EC2 instance, then we will create a service role attach `AmazonSageMakerFullAccess` to it. If you already have a SageMaker service role, you can paste its `role_arn` here. 

First, let's get some helper functions for creating execution role. We discussed those functions in the [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb).

In [3]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py


--2021-03-23 23:35:37--  https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3659 (3.6K) [text/plain]
Saving to: ‘iam_helpers.py’

     0K ...                                                   100% 63.4M=0s

2021-03-23 23:35:37 (63.4 MB/s) - ‘iam_helpers.py’ saved [3659/3659]



In [30]:
# set up service role for SageMaker
from iam_helpers import create_execution_role

sts = boto3.client('sts')
caller = sts.get_caller_identity()

if ':user/' in caller['Arn']: # as IAM user
    # either paste in a role_arn with or create a new one and attach 
    # AmazonSageMakerFullAccess
    role_name = 'sm'
    role_arn = create_execution_role(role_name=role_name)['Role']['Arn']
    
    # attach the permission to the role
    # skip it if you want to use a SageMaker service that 
    # already has AmazonFullSageMakerFullAccess
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
    )
elif 'assumed-role' in caller['Arn']: # on SageMaker infra
    assumed_role = caller['Arn']
    role_arn = re.sub(r"^(.+)sts::(\d+):assumed-role/(.+?)/.*$", r"\1iam::\2:role/\3", assumed_role)
else:
    print("I assume you are on an EC2 instance launched with an IAM role")
    role_arn = caller['Arn']

## Build a training image and push to ECR

You will build a training image here like in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb)

In [7]:
# View the Dockerfile
!cat container_hyperparameter_injection/Dockerfile

FROM continuumio/miniconda:latest 

# SageMaker uses /opt/ml for input / output data 
# throughout the training 
RUN mkdir -p /opt/ml

# Copy the training script into /usr/bin 
# as an executable
COPY train.py /usr/bin/train

# make /opt/ml/program/train an executable
RUN chmod +x /usr/bin/train




In [6]:
# View the "training alogrithm"
!pygmentize container_hyperparameter_injection/train.py

[37m#!/usr/bin/env python[39;49;00m

[37m# A sample script for training an ML model[39;49;00m
[37m# It does 2 things[39;49;00m
[37m# load csv data in /opt/ml/data[39;49;00m

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m


[37m# where SageMaker injects training info inside container[39;49;00m
input_dir=[33m"[39;49;00m[33m/opt/ml/input/[39;49;00m[33m"[39;49;00m

[37m# SageMaker treat "/opt/ml/model" as checkpoint direcotry[39;49;00m
[37m# and it will send everything there to S3 output path you [39;49;00m
[37m# specified [39;49;00m
model_dir=[33m"[39;49;00m[33m/opt/ml/model[39;49;00m[33m"[39;49;00m


[34mdef[39;49;00m [32mmain[39;49;00m():
    
    [36mprint[39;49;00m([33m"[39;49;00m[33m== Loading hyperparamters ===[39;49;00m[33m"[39;49;00m)
    [34mwith[39;

The algorithm simply print out hyperparameters in the json file `/opt/ml/input/config/hyperparameters.json` as a verification that it can indeed access those hyperparamters

In [43]:
%%sh
# build the image
cd container_hyperparameter_injection/

# tag it as example-image:latest
docker build -t example-image:latest .

Sending build context to Docker daemon  14.34kB
Step 1/4 : FROM continuumio/miniconda:latest
 ---> b8ea69b5c41c
Step 2/4 : RUN mkdir -p /opt/ml
 ---> Using cache
 ---> a170cc3fed03
Step 3/4 : COPY train.py /usr/bin/train
 ---> 5bc823c42a18
Step 4/4 : RUN chmod +x /usr/bin/train
 ---> Running in 52bd1de8fba4
Removing intermediate container 52bd1de8fba4
 ---> 888c36fefde2
Successfully built 888c36fefde2
Successfully tagged example-image:latest


## Test the container locally
Before pushing the image to ECR, it is always a good practice to test it locally. You need to create a `hyperparameters.json` file and make it available to the container at `/opt/ml/input/config/hyperparameters.json`. To do so, you can mount a local directory to `/opt/ml` as a docker volume like in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb).

Checkout the test we provide:

In [20]:
!pygmentize container_hyperparameter_injection/local_test/test_container.py

[37m# This script tests the your own container before running[39;49;00m
[37m# on SageMaker infrastructure. It mimics how SageMaker provides[39;49;00m
[37m# training info to your container and how it executes it. [39;49;00m

[34mimport[39;49;00m [04m[36mdocker[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

dirname = os.path.dirname(
    os.path.realpath([31m__file__[39;49;00m)
    )

client = docker.from_env()

container = client.containers.run(
    [33m'[39;49;00m[33mexample-image:latest[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtrain[39;49;00m[33m'[39;49;00m, [37m# docker run example-image:latest train [39;49;00m
    volumes={
        [37m# mount ml/ to /opt/ml as volume[39;49;00m
        [37m# it's a mechanism for the operating [39;49;00m
        [37m# system to communicate with inside of[39;49;00m
        [37m# a docker container[39;49;00m
        os.path.join(dirname, [33m'[39;49;00m[33mml[39;49;00m[33m'[39;49;00m) : {[33m'[3

We made some realistic looking hyperparameters in `container_hyperparameter_injection/local_test/ml/input/config/hyperparameters.json` and mounted `container_hyperparamter_injection/local_test/ml` to `/opt/ml` as a docker volume to the container, so that the file container can access the hyperparamters at `/opt/ml/input/config/hyperparameters.json`. 

Note that the json file `container_hyperparameter_injection/local_test/ml/input/config/hyperparameters.json` is not nested and the values are all strings, even they meant to be other data types. This is because when calling `CreateTrainingJob` with hyperparameter injection, the hyperparameters can only be a dictionary of key-value pairs, and both key and value need to be a string. See [API reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

In [44]:
# run the test
!python container_hyperparameter_injection/local_test/test_container.py

== Loading hyperparamters ===
== Hyperparamters: ==
{u'batch_size': 128,
 u'epochs': 100,
 u'learning_rate': 0.0001,
 u'optional': {u'activation_fn': u'sigmoid',
               u'drop_out': 0.3,
               u'grad_clip': 0.01},
 u'weight_decay': 0.01}
== Saving model checkpoint ==
== training completed ==



Now you tested your container, you can push it to ECR and be confident that it will work for a SageMaker training job.

In [23]:
# create a repo in ECR called example-image
ecr = boto3.client('ecr')

try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(
        repositoryName='example-image')
    pp.pprint(cr_res)
except Exception as e:
    print(e)

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'example-image' already exists in the registry with id '688520471316'


In [45]:
%%bash
account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
region=$(aws configure get region)
ecr_account=${account}.dkr.ecr.${region}.amazonaws.com

# Give docker your ECR login password
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account

# Fullname of the repo
fullname=$ecr_account/example-image:latest

#echo $fullname
# Tag the image with the fullname
docker tag example-image:latest $fullname

# Push to ECR
docker push $fullname

Login Succeeded
The push refers to repository [688520471316.dkr.ecr.us-west-2.amazonaws.com/example-image]
c48ea44f521c: Preparing
42c8a091b042: Preparing
88674bdc7fd9: Preparing
78db50750faa: Preparing
805309d6b0e2: Preparing
2db44bce66cd: Preparing
2db44bce66cd: Waiting
78db50750faa: Layer already exists
805309d6b0e2: Layer already exists
88674bdc7fd9: Layer already exists
2db44bce66cd: Layer already exists
42c8a091b042: Pushed
c48ea44f521c: Pushed
latest: digest: sha256:a4fc409a81a13c7f6c913a1a6d7a5fb3066c1dd2b9da25c84db5b82507441e2e size: 1574


https://docs.docker.com/engine/reference/commandline/login/#credentials-store



In [26]:
# Inspect the ECR repository
repo_res = ecr.describe_images(
    repositoryName='example-image')
pp.pprint(repo_res)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '2197',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 24 Mar 2021 19:03:24 GMT',
                                      'x-amzn-requestid': '3f2d5c18-27a0-4f11-ad79-e81d5eb82df3'},
                      'HTTPStatusCode': 200,
                      'RequestId': '3f2d5c18-27a0-4f11-ad79-e81d5eb82df3',
                      'RetryAttempts': 0},
 'imageDetails': [{'artifactMediaType': 'application/vnd.docker.container.image.v1+json',
                   'imageDigest': 'sha256:9d58547ed7516607ad53e13ca7b41e8a90138b95f994bf5eafee6dbe95c34739',
                   'imageManifestMediaType': 'application/vnd.docker.distribution.manifest.v2+json',
                   'imagePushedAt': datetime.datetime(2021, 3, 16, 23, 49, 3, tzinfo=tzlocal()),
                   'imageSizeInBytes': 1023214837,
                   'registryId': '688520471316',
              

## Prepare an S3 bucket for model artifact
Even you are not training a real model, SageMaker still requires you to give it an S3 URI to upload model artifact in `/opt/ml/model`. So let's create a temporary bucket for this. 

In [17]:
# create a bucket
def create_tmp_bucket():
    """Create an S3 bucket that is intended to be used for short term"""
    bucket = f"sagemaker-{current_time()}" # accessible by SageMaker
    region = boto3.Session().region_name
    boto3.client('s3').create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration={
            'LocationConstraint': region
        })
    return bucket

bucket = create_tmp_bucket()

## Put everything together

Now you have everything you need to create a training job that can ingest hyperparamters from the boto3 call. Let's review what you have done. You have 
* created an execution role for SageMaker service
* built and tested a docker image that includes the runtime and logic of your model training
* made the image accessible to SageMaker by hosting it on ECR
* created an S3 bucket for saving model artifact

In [53]:
# set up
import json

sm_boto3 = boto3.client('sagemaker')

# name training job
training_job_name = 'example-training-job-{}'.format(current_time())



# location that SageMaker saves the model artifacts
output_prefix = 'example/output/'
output_path = "s3://" + bucket + '/' + output_prefix

# ECR URI of your image
region = boto3.Session().region_name
account = account_id()
image_uri = "{}.dkr.ecr.{}.amazonaws.com/example-image:latest".format(account, region)

algorithm_specification = {
    'TrainingImage': image_uri,
    'TrainingInputMode': 'File',
}

# inject the following hyperparamters to your container
# you can define `hyperparameters` in whatever way
# you want as long as it can be parsed to a json file (not nested)
# and both key and value are strings

hyperparamters = {
    "num_trees" : "15",
    "max_depth" : "4",
    "n_iter": "30",
    "your_parameter_1": "1",
    "your_parameter_2" : "0.01"
}

output_data_config = {
    'S3OutputPath': output_path
}

resource_config = {
    'InstanceType': 'ml.m5.large',
    'InstanceCount':1,
    'VolumeSizeInGB':10
}

stopping_condition={
    'MaxRuntimeInSeconds':120,
}

enable_network_isolation=False

In [54]:
ct_res = sm_boto3.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification=algorithm_specification,
    HyperParameters=hyperparameters, # look here
    RoleArn=role_arn,
    OutputDataConfig=output_data_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition,
    EnableNetworkIsolation=enable_network_isolation,
    EnableManagedSpotTraining=False,
)

In [56]:
# check training job status every 30 seconds
stopped = False
while not stopped:
    tj_state = sm_boto3.describe_training_job(
        TrainingJobName=training_job_name)
    if tj_state['TrainingJobStatus'] in ['Completed', 'Stopped', 'Failed']:
        stopped=True
    else:
        print("Training in progress")
        time.sleep(30)

if tj_state['TrainingJobStatus'] == 'Failed':
    print("Training job failed ")
    print("Failed Reason: {}".tj_state['FailedReason'])
else:
    print("Training job completed")

Training in progress
Training in progress
Training in progress
Training in progress
Training in progress
Training in progress
Training job completed


## Inspect the trained model artifact

In [59]:
print("== Output config:")
print(tj_state['OutputDataConfig'])

print()
s3 = boto3.client('s3')
print("== Model artifact:")
pp.pprint(s3.list_objects_v2(Bucket=bucket, Prefix=output_prefix))

== Output config:
{'KmsKeyId': '', 'S3OutputPath': 's3://sagemaker-2021-03-24-18-39-57/example/output/'}

== Model artifact:
{'Contents': [{'ETag': '"9a969992b9afe3717d2d31dfded7958d"',
               'Key': 'example/output/example-training-job-2021-03-24-19-33-59/output/model.tar.gz',
               'LastModified': datetime.datetime(2021, 3, 24, 19, 36, 57, tzinfo=tzlocal()),
               'Size': 122,
               'StorageClass': 'STANDARD'}],
 'EncodingType': 'url',
 'IsTruncated': False,
 'KeyCount': 1,
 'MaxKeys': 1000,
 'Name': 'sagemaker-2021-03-24-18-39-57',
 'Prefix': 'example/output/',
 'ResponseMetadata': {'HTTPHeaders': {'content-type': 'application/xml',
                                      'date': 'Wed, 24 Mar 2021 19:41:02 GMT',
                                      'server': 'AmazonS3',
                                      'transfer-encoding': 'chunked',
                                      'x-amz-bucket-region': 'us-west-2',
                                      

In [60]:
# print out logs from Cloud Watch
logs = boto3.client('logs')

log_res= logs.describe_log_streams(
    logGroupName='/aws/sagemaker/TrainingJobs',
    logStreamNamePrefix=training_job_name)

for log_stream in log_res['logStreams']:
    # get one log event
    log_event = logs.get_log_events(
        logGroupName='/aws/sagemaker/TrainingJobs',
        logStreamName=log_stream['logStreamName'])
    
    # print out messages from the log event
    for ev in log_event['events']:
        for k, v in ev.items():
            if k == 'message':
                print(v)

== Loading hyperparamters ===
== Hyperparamters: ==
{u'max_depth': u'4', u'n_iter': u'30', u'num_trees': u'15'}
== Saving model checkpoint ==
== training completed ==


## Conclusion

Congratulations! You now understand how to avoid hard-code hyperparamters in your training image. To recap, 

- Hyperparamter injection allows you to quickly experiment your ML algorithm with different hyperparameters
- When calling `CreateTrainingJob` with hyperparamter injection, the hyperparameters you passed to `HyperParameter` needs to be a dictionary of string : string
- To avoid hating yourself, always test your container before pushing it to ECR

## Clean up resources

In [62]:
# delete the ECR repo
del_repo_res = ecr.delete_repository(
    repositoryName='example-image',
    force=True)
pp.pprint(del_repo_res)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '289',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 24 Mar 2021 19:48:32 GMT',
                                      'x-amzn-requestid': '773042de-4abb-41fa-893d-7e919fae658d'},
                      'HTTPStatusCode': 200,
                      'RequestId': '773042de-4abb-41fa-893d-7e919fae658d',
                      'RetryAttempts': 0},
 'repository': {'createdAt': datetime.datetime(2021, 3, 16, 20, 7, 17, tzinfo=tzlocal()),
                'imageTagMutability': 'MUTABLE',
                'registryId': '688520471316',
                'repositoryArn': 'arn:aws:ecr:us-west-2:688520471316:repository/example-image',
                'repositoryName': 'example-image',
                'repositoryUri': '688520471316.dkr.ecr.us-west-2.amazonaws.com/example-image'}}


In [61]:
# delete the S3 bucket
def delete_bucket_force(bucket_name):
    objs = s3.list_objects_v2(Bucket=bucket_name)['Contents']
    for obj in objs:
        s3.delete_object(
            Bucket=bucket_name,
            Key=obj['Key'])
    
    return s3.delete_bucket(Bucket=bucket_name)

del_buc_res = delete_bucket_force(bucket)

pp.pprint(del_buc_res)

{'ResponseMetadata': {'HTTPHeaders': {'date': 'Wed, 24 Mar 2021 19:48:28 GMT',
                                      'server': 'AmazonS3',
                                      'x-amz-id-2': '2ZmF3NwIarE8M3XibzrMf5l2QpKBjp5oBLr6AXyUwPz2bcEOkX6usAJQVkfdp8EFs3G/ykPmILI=',
                                      'x-amz-request-id': 'WQ4864KNX1XG5BXG'},
                      'HTTPStatusCode': 204,
                      'HostId': '2ZmF3NwIarE8M3XibzrMf5l2QpKBjp5oBLr6AXyUwPz2bcEOkX6usAJQVkfdp8EFs3G/ykPmILI=',
                      'RequestId': 'WQ4864KNX1XG5BXG',
                      'RetryAttempts': 0}}
