# Now, we can start a new training job

We'll send a zip file called **trainingjob.zip**, with the following structure:
 - trainingjob.json (Sagemaker training job descriptor)
 - monitoring.json (Sagemaker monitoring inputs for data capture, baseline and schedule)
 - assets/deploy-model-prd.yml (Cloudformation for deploying our model into Production)
 - assets/deploy-model-dev.yml (Cloudformation for deploying our model into Development)

In [None]:
import time
import sagemaker
import boto3
import os

sts_client = boto3.client("sts")

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

artifact_bucket = os.environ['ARTIFACT_BUCKET']
prefix = os.environ['MODEL_NAME']
image_repo = os.environ['IMAGE_REPO']

print('artifact bucket: {}'.format(artifact_bucket))
print('image repo: {}'.format(image_repo))
print('data bucket: {}/{}'.format(bucket, prefix))
print('role: {}'.format(role))

### Create the training job decriptor

This includes some hyper parameters

In [None]:
hyperparameters = {
    "epochs": 100,
    "batch_size": 128,
}

And the training job image, and name

In [None]:
account_id = sts_client.get_caller_identity()["Account"]
region = boto3.session.Session().region_name
training_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, image_repo)

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = prefix + timestamp

training_params = {}

# Here we set the reference for the Image Classification Docker image, stored on ECR (https://aws.amazon.com/pt/ecr/)
training_params["AlgorithmSpecification"] = {
    "TrainingImage": training_image,
    "TrainingInputMode": "File",
    "MetricDefinitions": [
        {'Name':'train:loss', 'Regex':'Train Loss: (.*?);'},
        {'Name':'train:accuracy', 'Regex':'Train Accuracy: (.*?)%;'},
        {'Name':'val:loss', 'Regex':'Validation Loss: (.*?);'},
        {'Name':'val:accuracy', 'Regex':'Validation Accuracy: (.*?)%;'},
        {'Name':'test:loss', 'Regex':'Test Loss: (.*?);'},
        {'Name':'test:accuracy', 'Regex':'Test Accuracy: (.*?)%;'}
    ]
}

# The IAM role with all the permissions given to Sagemaker
training_params["RoleArn"] = role

# Here Sagemaker will store the final trained model
training_params["OutputDataConfig"] = {
    "S3OutputPath": 's3://{}/{}'.format(bucket, prefix)
}

# This is the config of the instance that will execute the training
training_params["ResourceConfig"] = {
    "InstanceCount": 1,
    "InstanceType": "ml.m4.xlarge",
    "VolumeSizeInGB": 30
}

# The job name. You'll see this name in the Jobs section of the Sagemaker's console
training_params["TrainingJobName"] = job_name

for i in hyperparameters:
    hyperparameters[i] = str(hyperparameters[i])
    
# Here you will configure the hyperparameters used for training your model.
training_params["HyperParameters"] = hyperparameters

# Training timeout
training_params["StoppingCondition"] = {
    "MaxRuntimeInSeconds": 360000
}

# The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
training_params["InputDataConfig"] = [{
    "ChannelName": "training",
    "DataSource": {
        "S3DataSource": {
            "S3DataType": "S3Prefix",
            "S3Uri": 's3://{}/{}/input/training'.format(bucket, prefix),
            "S3DataDistributionType": "FullyReplicated"
        }
    },
    "ContentType": "text/csv",
    "CompressionType": "None"
},{
    "ChannelName": "validation",
    "DataSource": {
        "S3DataSource": {
            "S3DataType": "S3Prefix",
            "S3Uri": 's3://{}/{}/input/validation'.format(bucket, prefix),
            "S3DataDistributionType": "FullyReplicated"
        }
    },
    "ContentType": "text/csv",
    "CompressionType": "None"
}]
training_params["Tags"] = []

###  Upload training data

Validate the training / test sets and upload these

In [None]:
train_loc = sagemaker_session.upload_data(path='input/data/training', key_prefix=prefix+'/input/training')
val_loc = sagemaker_session.upload_data(path='input/data/validation', key_prefix=prefix+'/input/validation')

print('training: {}'.format(train_loc))
print('validation: {}'.format(val_loc))

### Configure monitoring inputs

Set data capture config for endpoints

1. Data Capture log output
2. Baseline input location with file uploaded to s3
3. Baseline results s3 location
4. Schedule resports s3 location

In [None]:
data_capture_uri = 's3://{}/{}/datacapture'.format(bucket, prefix)
print('data capture uri: {}'.format(data_capture_uri))

Use the output predictions from testing for baseline file.  Make sure we have headers on this file

In [None]:
# Inspect the output predictions (NOTE: if using scientific format these will be treated as strings)
baseline_file = 'output/data/predictions.csv'
!head -2 $baseline_file

In [None]:
# Upload the predictions as baseline file
boto3.Session().resource('s3').Bucket(bucket).Object(baseline_file).upload_file(baseline_file)

In [None]:
# copy over the training dataset to Amazon S3 (if you already have it in Amazon S3, you could reuse it)
baseline_prefix = prefix + '/baselining'
baseline_results_prefix = baseline_prefix + '/results'

baseline_data_uri = 's3://{}/{}'.format(bucket,baseline_file)
baseline_results_uri = 's3://{}/{}'.format(bucket, baseline_results_prefix)
print('Baseline data file: {}'.format(baseline_data_uri))
print('Baseline results uri: {}'.format(baseline_results_uri))

Lets define the location for the monitor schedule outputs

In [None]:
monitoring_reports_uri = 's3://{}/{}/monitoring/reports'.format(bucket, prefix)

print('monitoring reports: {}'.format(monitoring_reports_uri))

Set the training job hash so we can force update of deployment.

Until AutoPublishCodeSha256 support to force Lambda redployment [see PR](https://github.com/awslabs/serverless-application-model/pull/1376) we need to update the lambda zip contents

In [None]:
import hashlib
import json

training_hash = hashlib.sha256(json.dumps(training_params).encode('utf-8')).hexdigest()
print('training hash: {}'.format(training_hash))

# TEMP: Write a new file to the API directory to force refresh
with open('../../api/training_hash.txt', 'w') as f:
    f.write(training_hash)

Save the training job and monitoring json files as json

In [None]:
monitoring_params = {
    'TrainSha256': training_hash,
    'DataCaptureUri': data_capture_uri,
    'MonitoringRoleArn': role,
    'BaselineInputUri': baseline_data_uri,
    'BaselineResultsUri':  baseline_results_uri,
    'ScheduleReportsUri': monitoring_reports_uri,
    'ScheduleMetricName': 'feature_baseline_drift_class_predictions', # alarm on class predictions drift
    'ScheduleMetricThreshold': str(0.4) # Must serialize parameters as string
}

with open('trainingjob.json', 'w') as f:
    json.dump(training_params, f)
with open('monitoring.json', 'w') as f:
    json.dump(monitoring_params, f)

### Upload deployment artifacts 

Generate the cloud formation template with API serverless endpoints uploading code to sagemaker bucket

In [None]:
!aws cloudformation package --template-file ../../assets/deploy-model-prd.yml \
    --output-template-file ../../assets/template-model-prd.yml --s3-bucket $artifact_bucket

Verify the template has been generated correctly

In [None]:
!cat ../../assets/template-model-prd.yml

## Ok, now it's time to push everything to the repo

In [None]:
%%bash

cd ../../../mlops-workshop-images/master
mkdir -p assets

cp $OLDPWD/trainingjob.json $OLDPWD/monitoring.json .
cp ../../mlops-workshop/assets/template-model-prd.yml assets/deploy-model-prd.yml  # Save as original name
cp ../../mlops-workshop/assets/deploy-model-dev.yml assets/deploy-model-dev.yml
cp ../../mlops-workshop/assets/wait-training-job.yml assets/wait-training-job.yml

git add --all
git commit -a -m " - test updated deployment"
git push

### Ok, now open the AWS console in another tab and go to the CodePipeline console to see the status of our building pipeline

> Finally, click here [NOTEBOOK](04_Check%20Progress%20and%20Test%20the%20endpoint.ipynb) to see the progress and test your endpoint