# Create Training Job

An Amazon SageMaker *training job* is a compute process that trains an ML model in an containerized environment. In this notebook, you will create a training job with your own custom container on Amazon SageMaker. To read more about training job, refer to the [official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)

The outline of this notebook is:
- set up an service execution for SageMaker to run a training job
- build a light-weighted container based on continuumio/miniconda
- test your container locally
- push your container to Elastic Container Registry (ECR)
- upload your training data to an S3 bucket
- create a training job with everything you did above

In [None]:
import boto3 # your gateway to AWS APIs
import datetime
import pprint
import os
import time
import re

pp = pprint.PrettyPrinter(indent=1)
iam = boto3.client('iam')

In [None]:
# some helper functions
def current_time():
    ct = datetime.datetime.now() 
    return str(ct.now()).replace(":", "-").replace(" ", "-")[:19]

def account_id():
    return boto3.client('sts').get_caller_identity()['Account']

## Set up a service role for SageMaker

Review [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb) for step-by-step instructions on how to create an IAM Role.

The service role is intended to be assumed by the SageMaker service to procure resources in your AWS account on your behalf. 

1. If you are running this this notebook on SageMaker infrastructure like Notebook Instances or Studio, then we will use the role you used to spin up those resources

2. If you are running this notebook on an EC2 instance, then we will create a service role attach `AmazonSageMakerFullAccess` to it. If you already have a SageMaker service role, you can paste its `role_arn` here. 

First, let's get some helper functions for creating execution role. We discussed those functions in the [notebook on execution role](https://github.com/hsl89/amazon-sagemaker-examples/blob/execution-role/sagemaker-fundamentals/execution-role/execution-role.ipynb).

In [None]:
%%bash
file=$(ls . | grep iam_helpers.py)

if [ -f "$file" ]
then
    rm $file
fi

wget https://raw.githubusercontent.com/hsl89/amazon-sagemaker-examples/sagemaker-fundamentals/sagemaker-fundamentals/execution-role/iam_helpers.py

In [None]:
# set up service role for SageMaker
from iam_helpers import create_execution_role

sts = boto3.client('sts')
caller = sts.get_caller_identity()

if ':user/' in caller['Arn']: # as IAM user
    # either paste in a role_arn with or create a new one and attach 
    # AmazonSageMakerFullAccess
    role_name = 'sm'
    role_arn = create_execution_role(role_name=role_name)['Role']['Arn']
    
    # attach the permission to the role
    # skip it if you want to use a SageMaker service that 
    # already has AmazonFullSageMakerFullAccess
    iam.attach_role_policy(
        RoleName=role_name,
        PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
    )
elif 'assumed-role' in caller['Arn']: # on SageMaker infra
    assumed_role = caller['Arn']
    role_arn = re.sub(r"^(.+)sts::(\d+):assumed-role/(.+?)/.*$", r"\1iam::\2:role/\3", assumed_role)
else:
    print("I assume you are on an EC2 instance launched with an IAM role")
    role_arn = caller['Arn']

## Build the training environement into a docker image

Before creating a training job on Amazon SageMaker, you need to package the entire runtime environment of your ML project into a docker image and push the image into the Elastic Container Registry (ECR) under your account. 

When triggering a training job, your requested SageMaker instance will pull that image from your ECR and execute it with the data you specified in an S3 URI. 

It important to know how SageMaker runs your image. For **training job**, SageMaker runs your image like
```
docker run <image> train
```
i.e. your image needs to have an executable `train` and it is the executable that starts the model training process. You will see later in the notebook how to create it. 

The next natural thing to ask is how does the image running on SageMaker instance access the data that the model needs to be trained on? SageMaker requires you to reserve `/opt/ml` directory inside your image for it to provide training information. When you trigger a training job, you will need to specify the location of your training data, and the SageMaker instance running your image will mount your data into `/opt/ml/input`. 

To read more about SageMaker uses `/opt/ml` to provide training information, refer to the [official docs](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html)

In [None]:
# View the Dockerfile
!cat container/Dockerfile

### Explaination

`train.py` in `container/` is the main script for training the model. We copied it into `/usr/bin`, renamed it as `train` and made it an executable in the docker image. This way when the container is executed as 
```
docker run <image> train
```
The script in `/usr/bin/train` (in the container) will run. 

Note that this is one way to run the training logic on SageMaker. As long as the command 
```
docker run <image> train 
```
triggers your training logic you can do whatever you want. 

Next, we build the image.

In [None]:
%%sh
# build the image
cd container/

# tag it as example-image:latest
docker build -t example-image:latest .

Let's inspect what's in the training script `container/train.py`

In [None]:
!pygmentize container/train.py

### Explaination

It is a skeleton of a typical ML training logic. The main function fetches training data in `/opt/ml/input/data/train`. To verify we indeed have access to the data, we will print out the names of the files in `/opt/ml/input/data/train`. When you actually run this training logic on SageMaker, you can view the stdout through CloudWatch. We will discuss this in more detail later in this notebook. 

When the main function finishes model training, it saves the model checkpoint in `/opt/ml/model`. The SageMaker Instance running your container will then upload everything in `/opt/ml/model` to an S3 URI that you will later configure yourself. 

## Test your container

It is a good practice to test your container before sending it to SageMaker, because you can debug and iterate much faster on your local machine. 

You are strongly encouraged to read through the section on [How Amazon SageMaker Provides Training Information](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html) from the official doc and figure out local testing environment that replicates how SageMaker provides training information to your container. 

We will use docker python client to execute the container. To see our implementation of local testing environment, run the following cell.

In [None]:
!pygmentize container/local_test/test_container.py

### Explaination 

Our testing script runs the docker image `example-image:latest` with `train` command, mimicking how SageMaker runs your container for a training job. It mounts the local directory `container/local_test/ml/` to `/opt/ml` in the docker image, mimicking how SageMaker provides the training information to the container. 

The directory `container/local_test/ml` looks like:

In [None]:
!ls -R container/local_test/ml

The directories `container/local_test/ml/input/data/train` and `container/local_test/ml/input/data/test` contains some csv files, which will be available in `/opt/ml/input/data/train` and `/opt/ml/input/data/test` as the training and testing data. 

In [None]:
# run the test
!python container/local_test/test_container.py

Now you should see a model checkpoint in `container/local_test/ml/model`

In [None]:
!ls container/local_test/ml/model

## Push your docker image to ECR

Now you have build your image tested it locally. Next thing you need to do is to push it to the Elastic Container Registry under your account. Later, when you trigger a training job, the SageMaker instance you requested will pull that image. 

To do so, you will need to create a repo in your ECR to host it. You might have guess that this operation requires some permission on your ECR resources. That's right. You (the principal running this notebook) needs permission to create repository in ECR and get authorization token from it and the role you created before (which you will later pass to SageMaker) needs permission to get authorization token (and pull the image). 

Note: if you do not have enough permissions on the ECR resources under your organization's account. Then the admin of the account needs to grant you the ECR permissions. 

### Create a repository in your ECR

Suppose you have enough ECR permissions, we now create a repository in your ECR to host the image `example-image:latest`. It is convenient to set the name of the repository should be the same as the name of the image. 

In [None]:
ecr = boto3.client('ecr')

try:
    # The repository might already exist
    # in your ECR
    cr_res = ecr.create_repository(
        repositoryName='example-image')
    pp.pprint(cr_res)
except Exception as e:
    print(e)

If you already have a repository called `example-image`, then there are two ways you can continue
* Delete the repository can create new one with the same name
* Create a repository using a name other than `example-image`

We will provide code for the second route below. But you will need to run it with caution, because the repository `example-image` is probably used by your org for production, and it happens to coincides with our choice of repository name. 

In [None]:
"""
If you want to delete the `example-image` repository,
Change this cell from markdown to python, then run it. 
"""
try:
    ecr.delete_repository(
        repositoryName='example-image')
    
    ecr.create_repository(
        repositoryName='example-image')
except Exception as e:
    print(e)

### Tag your image and push to ECR

Now let's tag the image with the full address of the repository we just created and push it there. Before doing that, you will need to grant docker access to your ECR. Refer to the [registry authentication section](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html) from the ECR documentation for more detail. 

In [None]:
%%bash
account=$(aws sts get-caller-identity --query Account | sed -e 's/^"//' -e 's/"$//')
region=$(aws configure get region)
ecr_account=${account}.dkr.ecr.${region}.amazonaws.com

# Give docker your ECR login password
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr_account

# Fullname of the repo
fullname=$ecr_account/example-image:latest

#echo $fullname
# Tag the image with the fullname
docker tag example-image:latest $fullname

# Push to ECR
docker push $fullname

In [None]:
# Inspect the ECR repository
repo_res = ecr.describe_images(
    repositoryName='example-image')
pp.pprint(repo_res)

## Prepare training data

SageMaker provides training data to your image through an S3 bucket that your service role has read access to. This means before triggering a training job, you need to make your training available in such an S3 bucket.

In this notebook, we will use preloaded data on a public bucket `sagemaker-sample-files`.

In [None]:
# inspect the bucket
public_bucket = "sagemaker-sample-files"
s3 = boto3.client('s3')
obj_res = s3.list_objects_v2(
    Bucket="sagemaker-sample-files")

# print out object keys compactly
for obj in obj_res['Contents']:
    if '/tabular/fraud_detection/synthethic_fraud_detection_SA' in obj['Key']:
        print(obj['Key'])

Let's pretend the data under `datasets/tabular/synthetic_fraud_detection_SA` is the data for your ML project.

The public bucket `sagemaker-sample-files` is located in us-east-1. We first need to copy the data to a bucket of yours that share the same region with the SageMaker instance you will use later.

In [None]:
# create a bucket
def create_tmp_bucket():
    """Create an S3 bucket that is intended to be used for short term"""
    bucket = f"sagemaker-{current_time()}" # accessible by SageMaker
    region = boto3.Session().region_name
    boto3.client('s3').create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration={
            'LocationConstraint': region
        })
    return bucket

bucket = create_tmp_bucket()

The bucket is created by you. By default, all objects in the bucket are private and are accessible by the owner of the bucket. To make the bucket accessible by SageMaker service, normally you would need to explicitly add the permission to access this bucket to the SageMaker service role. But we do not have to it here, because the bucket is prefixed by "sagemaker" and in the policy `AmazonSageMakerFullAccess`, we allow the service role to acccess all S3 prefixed by "sagemaker". 

In [None]:
input_prefix = 'input_data/'

In [None]:
# copy from sagemaker-samplef-files to {bucket}
s3 = boto3.client('s3')

# copy remote csv files to local
files = []
data_dir = '/tmp'
for obj in obj_res['Contents']:
    if '/tabular/fraud_detection/synthethic_fraud_detection_SA' in obj['Key']:
        key = obj['Key']
        if key.endswith('.csv'):
            filename=key.split('/')[-1]
            files.append(filename)
            with open(os.path.join(data_dir, filename), 'wb') as f:
                s3.download_fileobj(public_bucket, key, f)

# upload from local to the bucket you just created
for fname in files:
    with open(os.path.join(data_dir, fname), 'rb') as f:
        key = input_prefix + fname
        s3.upload_fileobj(f, bucket, key)

In [None]:
# inspect your bucket
obj_res = s3.list_objects_v2(
    Bucket=bucket)

for obj in obj_res['Contents']:
    print(obj['Key'])

## Prepare an S3 URI for saving model artifact

After your image is done with model training, it needs to write the trained model artifact into `/opt/ml/model`. This is directory where SageMaker looks for the trained model artifact and upload it to an S3 URI you will configure later. Naturally, the execution role `sm` needs to have write permission to this S3 URI. 

Refer to the section on [How Amazon SageMaker Processes Training Output](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html) in the official docs for more detail. 

## Put everything together

Now you have everything you need to create a training job. Let's review what you have done. You have 
* created an execution role for SageMaker service
* built and tested a docker image that includes the runtime and logic of your model training
* made the image accessible to SageMaker by hosting it on ECR
* made the training data available to SageMaker by hosting it on S3
* pointed SageMaker to an S3 bucket to write output 

Let pull the trigger and create a training job. We will invoke `CreateTrainingJob` API via boto3. You are strongly encouraged to read through the [description of the API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) before moving on. 

In [None]:
# set up

sm_boto3 = boto3.client('sagemaker')

# name training job
training_job_name = 'example-training-job-{}'.format(current_time())

# input data prefix
data_path = "s3://" + bucket + '/' + input_prefix

# location that SageMaker saves the model artifacts
output_prefix = 'example/output/'
output_path = "s3://" + bucket + '/' + output_prefix

# ECR URI of your image
region = boto3.Session().region_name
account = account_id()
image_uri = "{}.dkr.ecr.{}.amazonaws.com/example-image:latest".format(account, region)

algorithm_specification = {
    'TrainingImage': image_uri,
    'TrainingInputMode': 'File',
}


input_data_config = [
    {
        'ChannelName': 'train',
            'DataSource':{
                'S3DataSource':{
                    'S3DataType': 'S3Prefix',
                    'S3Uri': data_path,
                    'S3DataDistributionType': 'FullyReplicated',
                }
        }
        
    },
    {
        'ChannelName': 'test',
        'DataSource':{
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': data_path,
                'S3DataDistributionType': 'FullyReplicated',
            }
        }
    }
]


output_data_config = {
    'S3OutputPath': output_path
}

resource_config = {
    'InstanceType': 'ml.m5.large',
    'InstanceCount':1,
    'VolumeSizeInGB':10
}

stopping_condition={
    'MaxRuntimeInSeconds':120,
}

enable_network_isolation=False

In [None]:
ct_res = sm_boto3.create_training_job(
    TrainingJobName=training_job_name,
    AlgorithmSpecification=algorithm_specification,
    RoleArn=role_arn,
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    ResourceConfig=resource_config,
    StoppingCondition=stopping_condition,
    EnableNetworkIsolation=enable_network_isolation,
    EnableManagedSpotTraining=False,
)

In [None]:
# View the status of the training job
tj_state = sm_boto3.describe_training_job(
    TrainingJobName=training_job_name)
pp.pprint(tj_state.keys())

In [None]:
# check training job status every 30 seconds
stopped = False
while not stopped:
    tj_state = sm_boto3.describe_training_job(
        TrainingJobName=training_job_name)
    if tj_state['TrainingJobStatus'] in ['Completed', 'Stopped', 'Failed']:
        stopped=True
    else:
        print("Training in progress")
        time.sleep(30)

if tj_state['TrainingJobStatus'] == 'Failed':
    print("Training job failed ")
    print("Failed Reason: {}".tj_state['FailedReason'])
else:
    print("Training job completed")

## Inspect the trained model artifact

In [None]:
print("== Output config:")
print(tj_state['OutputDataConfig'])

print()

print("== Model artifact:")
pp.pprint(s3.list_objects_v2(Bucket=bucket, Prefix=output_prefix))

In [None]:
logs = boto3.client('logs')

log_res= logs.describe_log_streams(
    logGroupName='/aws/sagemaker/TrainingJobs',
    logStreamNamePrefix=training_job_name)

for log_stream in log_res['logStreams']:
    # get one log event
    log_event = logs.get_log_events(
        logGroupName='/aws/sagemaker/TrainingJobs',
        logStreamName=log_stream['logStreamName'])
    
    # print out messages from the log event
    for ev in log_event['events']:
        for k, v in ev.items():
            if k == 'message':
                print(v)

## Conclusion

Congratulations! You now understand the basics of a training job on SageMaker. It's funny to think that after this long notebook, you get a trained model artifact, which is a pickled None instance. But keep in mind that you can follow the exact same process to train a state-of-art model with billions of parameters and the compute cost is proportional to how long you train your model.

## Clean up resources

In [None]:
# delete the ECR repo
del_repo_res = ecr.delete_repository(
    repositoryName='example-image',
    force=True)

pp.pprint(del_repo_res)

In [None]:
# delete the S3 bucket
def delete_bucket_force(bucket_name):
    objs = s3.list_objects_v2(Bucket=bucket_name)['Contents']
    for obj in objs:
        s3.delete_object(
            Bucket=bucket_name,
            Key=obj['Key'])
    
    return s3.delete_bucket(Bucket=bucket_name)

del_buc_res = delete_bucket_force(bucket)

pp.pprint(del_buc_res)