# Create Training Job (Hyperparamter Injection) 

In this notebook, we discuss more complicated set ups for `CreateTrainingJob` API. It assumes you are confortable with the set ups discussed in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb)



## What is Hyperparameter Injection?

With hyperparamter injection, you don't need to hard code hyperparameters of your ML training in the training image, instead you can pass your hyperparamters through `CreateTrainingJob` API and SageMaker will makes them available to your training container. This way you can experiment a list of hyperparameters for your training job without rebuilding the image for each experiment. More importantly, this is the mechanism used by `CreateHyperParameterTuningJob` API to (you guessed right) create many training jobs to search for the best hyperparameters. We will discuss `CreateHyperParameterTuningJob` in a different notebook. 

If you remember from [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb), SageMaker reserves `/opt/ml` directory "to talk to your container", i.e. provide training information to your training job and retrieve output from it. 

You will pass hyperparamters of your training job as a dictionary to the `create_training_job` of boto3 SageMaker client, and it will become availble in `/opt/ml/input/config/hyperparameters.json`. See [reference in the official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html).

### Set ups

You will build a training image and push it to ECR like in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb). The only difference is, the python script for runing the training will print out the hyperparamters in `/opt/ml/input/config/hyperparameters.json` to confirm that container does have access to the hyperparamters you passed to `CreateTrainingJob` API. 

This training job does not require any data. Therefore, you don't need to confgure `InputDataConfig` parameter for `CreateTrainingJob`. However, SageMaker always needs an S3 URI to save your model artifact, i.e. you still need to configure `OutputDataConfig` parameter. 

In [1]:
import boto3 # your gateway to AWS APIs
import datetime
import pprint
import os
import time
import re

pp = pprint.PrettyPrinter(indent=1)
iam = boto3.client('iam')

In [3]:
# some helper functions
def current_time():
    ct = datetime.datetime.now() 
    return str(ct.now()).replace(":", "-").replace(" ", "-")[:19]

def account_id():
    return boto3.client('sts').get_caller_identity()['Account']

### Set up a service role for SageMaker

Review [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb) for step-by-step instructions on how to create an IAM Role.

The service role is intended to be assumed by the SageMaker service to procure resources in your AWS account on your behalf. 

1. If you are running this this notebook on SageMaker infrastructure like Notebook Instances or Studio, then we will use the role you used to spin up those resources

2. If you are running this notebook on an EC2 instance, then we will create a service role attach `AmazonSageMakerFullAccess` to it. If you already have a SageMaker service role, you can paste its `role_arn` here. 

First, let's get some helper functions for creating execution role. We discussed those functions in the [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb).

In [3]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py


--2021-03-23 23:35:37--  https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3659 (3.6K) [text/plain]
Saving to: ‘iam_helpers.py’

     0K ...                                                   100% 63.4M=0s

2021-03-23 23:35:37 (63.4 MB/s) - ‘iam_helpers.py’ saved [3659/3659]



In [4]:
# set up service role for SageMaker
from iam_helpers import create_execution_role

sts = boto3.client('sts')
caller = sts.get_caller_identity()

if ':user/' in caller['Arn']: # as IAM user
    # either paste in a role_arn with or create a new one and attach 
    # AmazonSageMakerFullAccess
    role_name = 'sm'
    role_arn = create_execution_role(role_name=role_name)['Role']['Arn']
    
    # attach the permission to the role
    # skip it if you want to use a SageMaker service that 
    # already has AmazonFullSageMakerFullAccess
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
    )
elif 'assumed-role' in caller['Arn']: # on SageMaker infra
    assumed_role = caller['Arn']
    role_arn = re.sub(r"^(.+)sts::(\d+):assumed-role/(.+?)/.*$", r"\1iam::\2:role/\3", assumed_role)
else:
    print("I assume you are on an EC2 instance launched with an IAM role")
    role_arn = caller['Arn']

## Build a training image and push to ECR

You will build a training image here like in [the notebook on basics of `CreateTrainingJob`](https://github.com/hsl89/amazon-sagemaker-examples/blob/sagemaker-fundamentals/sagemaker-fundamentals/create-training-job/create_training_job.ipynb)

In [None]:
# View the Dockerfile
!cat container_intermediate/Dockerfile

In [None]:
%%WriteFile

In [None]:
%%sh
# build the image
cd container_intermediate/

# tag it as example-image:latest
docker build -t example-image:latest .

In [None]:
# create a repo in ECR called example-image

ecr = boto3.client('ecr')

try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(
        repositoryName='example-image')
    pp.pprint(cr_res)
except Exception as e:
    print(e)

In [None]:
%%bash
account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
region=$(aws configure get region)
ecr_account=${account}.dkr.ecr.${region}.amazonaws.com

# Give docker your ECR login password
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account

# Fullname of the repo
fullname=$ecr_account/example-image:latest

#echo $fullname
# Tag the image with the fullname
docker tag example-image:latest $fullname

# Push to ECR
docker push $fullname


In [None]:
# Inspect the ECR repository
repo_res = ecr.describe_images(
    repositoryName='example-image')
pp.pprint(repo_res)

## Prepare training data

SageMaker provides training data to your image through an S3 bucket that your service role has read access to. This means before triggering a training job, you need to make your training available in such an S3 bucket.

In this notebook, we will use preloaded data on a public bucket `sagemaker-sample-files`.

In [None]:
# inspect the bucket
public_bucket = "sagemaker-sample-files"
s3 = boto3.client('s3')
obj_res = s3.list_objects_v2(
    Bucket="sagemaker-sample-files")

# print out object keys compactly
for obj in obj_res['Contents']:
    if '/tabular/fraud_detection/synthethic_fraud_detection_SA' in obj['Key']:
        print(obj['Key'])

Let's pretend the data under `datasets/tabular/synthetic_fraud_detection_SA` is the data for your ML project.

The public bucket `sagemaker-sample-files` is located in us-east-1. We first need to copy the data to a bucket of yours that share the same region with the SageMaker instance you will use later.

In [None]:
# create a bucket
def create_tmp_bucket():
    """Create an S3 bucket that is intended to be used for short term"""
    bucket = f"sagemaker-{current_time()}" # accessible by SageMaker
    region = boto3.Session().region_name
    boto3.client('s3').create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration={
            'LocationConstraint': region
        })
    return bucket

bucket = create_tmp_bucket()

The bucket is created by you. By default, all objects in the bucket are private and are accessible by the owner of the bucket. To make the bucket accessible by SageMaker service, normally you would need to explicitly add the permission to access this bucket to the SageMaker service role. But we do not have to it here, because the bucket is prefixed by "sagemaker" and in the policy `AmazonSageMakerFullAccess`, we allow the service role to acccess all S3 prefixed by "sagemaker". 

In [None]:
input_prefix = 'input_data/'

In [None]:
# copy from sagemaker-samplef-files to {bucket}
s3 = boto3.client('s3')

# copy remote csv files to local
files = []
data_dir = '/tmp'
for obj in obj_res['Contents']:
    if '/tabular/fraud_detection/synthethic_fraud_detection_SA' in obj['Key']:
        key = obj['Key']
        if key.endswith('.csv'):
            filename=key.split('/')[-1]
            files.append(filename)
            with open(os.path.join(data_dir, filename), 'wb') as f:
                s3.download_fileobj(public_bucket, key, f)

# upload from local to the bucket you just created
for fname in files:
    with open(os.path.join(data_dir, fname), 'rb') as f:
        key = input_prefix + fname
        s3.upload_fileobj(f, bucket, key)

In [None]:
# inspect your bucket
obj_res = s3.list_objects_v2(
    Bucket=bucket)

for obj in obj_res['Contents']:
    print(obj['Key'])

## How 

## Put everything together

Now you have everything you need to create a training job. Let's review what you have done. You have 
* created an execution role for SageMaker service
* built and tested a docker image that includes the runtime and logic of your model training
* made the image accessible to SageMaker by hosting it on ECR
* made the training data available to SageMaker by hosting it on S3
* pointed SageMaker to an S3 bucket to write output 

Let pull the trigger and create a training job. We will invoke `CreateTrainingJob` API via boto3. You are strongly encouraged to read through the [description of the API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) before moving on. 

In [None]:
# set up

sm_boto3 = boto3.client('sagemaker')

# name training job
training_job_name = 'example-training-job-{}'.format(current_time())

# input data prefix
data_path = "s3://" + bucket + '/' + input_prefix

# location that SageMaker saves the model artifacts
output_prefix = 'example/output/'
output_path = "s3://" + bucket + '/' + output_prefix

# ECR URI of your image
region = boto3.Session().region_name
account = account_id()
image_uri = "{}.dkr.ecr.{}.amazonaws.com/example-image:latest".format(account, region)

algorithm_specification = {
    'TrainingImage': image_uri,
    'TrainingInputMode': 'File',
}


input_data_config = [
    {
        'ChannelName': 'train',
            'DataSource':{
                'S3DataSource':{
                    'S3DataType': 'S3Prefix',
                    'S3Uri': data_path,
                    'S3DataDistributionType': 'FullyReplicated',
                }
        }
        
    },
    {
        'ChannelName': 'test',
        'DataSource':{
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': data_path,
                'S3DataDistributionType': 'FullyReplicated',
            }
        }
    }
]


output_data_config = {
    'S3OutputPath': output_path
}

resource_config = {
    'InstanceType': 'ml.m5.large',
    'InstanceCount':1,
    'VolumeSizeInGB':10
}

stopping_condition={
    'MaxRuntimeInSeconds':120,
}

enable_network_isolation=False

In [None]:
ct_res = sm_boto3.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification=algorithm_specification,
    RoleArn=role_arn,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition,
    EnableNetworkIsolation=enable_network_isolation,
    EnableManagedSpotTraining=False,
)

In [None]:
# View the status of the training job
tj_state = sm_boto3.describe_training_job(
    TrainingJobName=training_job_name)
pp.pprint(tj_state.keys())

In [None]:
# check training job status every 30 seconds
stopped = False
while not stopped:
    tj_state = sm_boto3.describe_training_job(
        TrainingJobName=training_job_name)
    if tj_state['TrainingJobStatus'] in ['Completed', 'Stopped', 'Failed']:
        stopped=True
    else:
        print("Training in progress")
        time.sleep(30)

if tj_state['TrainingJobStatus'] == 'Failed':
    print("Training job failed ")
    print("Failed Reason: {}".tj_state['FailedReason'])
else:
    print("Training job completed")

## Inspect the trained model artifact

In [None]:
print("== Output config:")
print(tj_state['OutputDataConfig'])

print()

print("== Model artifact:")
pp.pprint(s3.list_objects_v2(Bucket=bucket, Prefix=output_prefix))

In [None]:
logs = boto3.client('logs')

log_res= logs.describe_log_streams(
    logGroupName='/aws/sagemaker/TrainingJobs',
    logStreamNamePrefix=training_job_name)

for log_stream in log_res['logStreams']:
    # get one log event
    log_event = logs.get_log_events(
        logGroupName='/aws/sagemaker/TrainingJobs',
        logStreamName=log_stream['logStreamName'])
    
    # print out messages from the log event
    for ev in log_event['events']:
        for k, v in ev.items():
            if k == 'message':
                print(v)

## Conclusion

Congratulations! You now understand the basics of a training job on SageMaker. It's funny to think that after this long notebook, you get a trained model artifact, which is a pickled None instance. But keep in mind that you can follow the exact same process to train a state-of-art model with billions of parameters and the compute cost is proportional to how long you train your model.

## Clean up resources

In [None]:
# delete the ECR repo
del_repo_res = ecr.delete_repository(
    repositoryName='example-image',
    force=True)

pp.pprint(del_repo_res)

In [None]:
# delete the S3 bucket
def delete_bucket_force(bucket_name):
    objs = s3.list_objects_v2(Bucket=bucket_name)['Contents']
    for obj in objs:
        s3.delete_object(
            Bucket=bucket_name,
            Key=obj['Key'])
    
    return s3.delete_bucket(Bucket=bucket_name)

del_buc_res = delete_bucket_force(bucket)

pp.pprint(del_buc_res)