# Steps to start training your custom Tensorflow model in AWS SageMaker
# Training a tensorflow 2.1 model using a custom container

Some sections of this notebook has been inspired by the tutorial:
**SML Keras Training with Amazon SageMaker**
https://github.com/pranaychandekar/keras-sagemaker-train

**Script mode training with custom conyainer from sagemaker-examples**
https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/custom-training-containers/script-mode-container/notebook/script-mode-container.ipynb

In this notebook we will describe the most relevant steps to start training a custom algorithm in AWS SageMaker, not using a custom container, showing how to deal with experiments and solving some of the problems when facing with custom models when using SageMaker script mode on 

**Problem description**

Following steps will be explained:  
1. Create an Experiment and Trial to keep track of our experiments   
2. Load the training data to our training instance
3. Create the scripts to train our custom model, a Transformer.
4. Create an Estimator to train our model in a Tensorflow 2.1 container in script mode
5. Create a metric definitions to keep track of them in SageMaker
4. Download the trained model to make predictions
5. Resume training using the latest checkpoint from a previous training 


# Set up the environment and load the libraries

Let's start by setting up the environment:

In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
import os
import sagemaker
from sagemaker import get_execution_role
import time
import pickle
import boto3

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

In [2]:
print(sagemaker_session)
print(role)
print(region)

<sagemaker.session.Session object at 0x7f60bdbddda0>
arn:aws:iam::223817798831:role/service-role/AmazonSageMaker-ExecutionRole-20200708T194212
us-east-1


## Define variables with data location and output location in S3 bucket

In [3]:
#column_list_file = 'iris_train_column_list.txt'
data_folder_name='data'
train_filename = 'spa.txt'
non_breaking_en = 'nonbreaking_prefix.en'
non_breaking_es = 'nonbreaking_prefix.es'
trainedmodel_path = 'trained_model'
output_data_path = 'output_data'
model_info_file = 'model_info.pth'
input_vocab_file = 'in_vocab.pkl'
output_vocab_file = 'out_vocab.pkl'

train_file = os.path.abspath(os.path.join(data_folder_name, train_filename))
non_breaking_en_file = os.path.abspath(os.path.join(data_folder_name, non_breaking_en))
non_breaking_es_file = os.path.abspath(os.path.join(data_folder_name, non_breaking_es))

Define the working bucket name for this project or experiment and the three locations in S3 we will deal with:
- Training data
- Model and Output data
- Checkpoint data

In [4]:
# Specify your bucket name
bucket_name = 'edumunozsala-ml-sagemaker'
project_name = 'transformer-nmt-custom'

training_data_folder = r'{}/data'.format(project_name)
output_folder = r'{}'.format(project_name)
ckpt_folder = r'{}/ckpt'.format(project_name)

training_data_uri = r's3://' + bucket_name + r'/' + training_data_folder
output_data_uri = r's3://' + bucket_name + r'/' + output_folder
ckpt_data_uri = r's3://' + bucket_name + r'/' + ckpt_folder

In [5]:
training_data_uri,output_data_uri,ckpt_data_uri

('s3://edumunozsala-ml-sagemaker/transformer-nmt-custom/data',
 's3://edumunozsala-ml-sagemaker/transformer-nmt-custom',
 's3://edumunozsala-ml-sagemaker/transformer-nmt-custom/ckpt')

If the data is not yet in the S3 folder we upload it in the next code section:

In [6]:
sagemaker_session.upload_data(train_file,
                              bucket=bucket_name, 
                              key_prefix=training_data_folder)

sagemaker_session.upload_data(non_breaking_en_file,
                              bucket=bucket_name, 
                              key_prefix=training_data_folder)

sagemaker_session.upload_data(non_breaking_es_file,
                              bucket=bucket_name, 
                              key_prefix=training_data_folder)

's3://edumunozsala-ml-sagemaker/transformer-nmt-custom/data/nonbreaking_prefix.es'

## Create an experiment and trial

In [7]:
# Install the library necessary to handle experiments
!pip install sagemaker-experiments

Collecting sagemaker-experiments
  Using cached sagemaker_experiments-0.1.24-py3-none-any.whl (36 kB)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.24
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow2_p36/bin/python -m pip install --upgrade pip' command.[0m


Load the libraries

In [7]:
# Import the libraries to work with Experiments in SageMaker
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

In [8]:
# Set the experiment name
experiment_name='tf-transformer-custom'
# Set the trial name 
trial_name="{}-{}".format(experiment_name,'single-gpu-custom')

tags = [{'Key': 'my-experiments', 'Value': 'transformerEngSpa1Custom'}]

Create or load the experiment and the trial: **Explain**

In [9]:
# create the experiment if it doesn't exist
try:
    training_experiment = Experiment.load(experiment_name=experiment_name)
    print('Loaded experiment ',experiment_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        training_experiment = Experiment.create(experiment_name=experiment_name,
                                      description = "Experiment to track trainings on my tensorflow Transformer Eng-Spa", 
                                      tags = tags)
        print('Created experiment ',experiment_name)
# create the trial if it doesn't exist
try:
    single_gpu_trial = Trial.load(trial_name=trial_name)
    print('Loaded trial ',trial_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        single_gpu_trial = Trial.create(experiment_name=experiment_name, 
                             trial_name= trial_name,
                             tags = tags)
        print('Created trial ',trial_name)

Loaded experiment  tf-transformer-custom
Loaded trial  tf-transformer-custom-single-gpu-custom


In [10]:
# Create a configuration definition for our experiment and trial
trial_comp_name = 'single-gpu-custom-job'
# Set the configuration parameters for the experiment
experiment_config = {'ExperimentName': training_experiment.experiment_name, 
                       'TrialName': single_gpu_trial.trial_name,
                       'TrialComponentDisplayName': trial_comp_name}

Check and show information about the experiment and trial

In [11]:
#"{}-{}".format(trail_name, experiment_name)
print('Experiment: ',training_experiment.experiment_name)
# Show the trials in the experiment
for trial in training_experiment.list_trials():
    print('Trial: ',trial.trial_name)

Experiment:  tf-transformer-custom
Trial:  tf-transformer-custom-single-gpu-custom


# Construct a script for training

This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

**output_data_dir**

**checkpoint_iru**

Here is the entire script:

In [21]:
!pygmentize 'train/train.py'

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[37m#import sagemaker_containers[39;49;00m

[34mimport[39;49;00m [04m[36mmath[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mgc[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m

[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m

[37m# To install tensorflow_datasets[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m

[34mdef[39;49;00m [32minstall[39;49;00m(package):
    subprocess.check_call([sys.executable, [33m"[39;49;00m[33m-q[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33m-m[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mpip[39;49;00m[3

## Create the custom container image and register in Amazon ECR

In [54]:
a='Dockerfile.gpu'

In [63]:
%%sh

echo {a}

{a}


In [64]:
%%sh

# The name of our algorithm
algorithm_name=transformer-nmt-custom-gpu

chmod +x train/*

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

# On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
# to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

# Comment the line below to use a GPU
#docker build  -t ${algorithm_name} -f Dockerfile.cpu .

# Uncomment the below line if you wish to run on a GPU
docker build  -t ${algorithm_name} -f Dockerfile.gpu . 

docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Stopping docker: [  OK  ]
Starting docker:	.[  OK  ]
Sending build context to Docker daemon  42.53MB
Step 1/14 : FROM tensorflow/tensorflow:latest-gpu-py3
latest-gpu-py3: Pulling from tensorflow/tensorflow
7ddbc47eeb70: Pulling fs layer
c1bbdc448b72: Pulling fs layer
8c3b70e39044: Pulling fs layer
45d437916d57: Pulling fs layer
d8f1569ddae6: Pulling fs layer
85386706b020: Pulling fs layer
ee9b457b77d0: Pulling fs layer
bebfcc1316f7: Pulling fs layer
644140fd95a9: Pulling fs layer
d6c0f989e873: Pulling fs layer
7a8e64f26211: Pulling fs layer
c33b03e4dd22: Pulling fs layer
bca93af797c1: Pulling fs layer
47f6c197be35: Pulling fs layer
e5da48aa9554: Pulling fs layer
ca68d98a90c4: Pulling fs layer
644140fd95a9: Waiting
d6c0f989e873: Waiting
7a8e64f26211: Waiting
45d437916d57: Waiting
c33b03e4dd22: Waiting
bca93af797c1: Waiting
d8f1569ddae6: Waiting
85386706b020: Waiting
ee9b457b77d0: Waiting
bebfcc1316f7: Waiting
ca68d98a90c4: Waiting
47f6c197be35: Waiting
e5da48aa9554: 

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.

* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). 



In [12]:
#from sagemaker.tensorflow import TensorFlow
from sagemaker.estimator import Estimator

Define variables for account, region and container image:

In [13]:
account = boto3.client('sts').get_caller_identity().get('Account') # aws account 
#container_image = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account, region, project_name) # algorithm image path in ECR
container_image = '{}.dkr.ecr.{}.amazonaws.com/{}-gpu'.format(account, region, project_name) # algorithm image path in ECR
print('container image uri: ',container_image)

container image uri:  223817798831.dkr.ecr.us-east-1.amazonaws.com/transformer-nmt-custom-gpu


You can also initiate an estimator to train with TensorFlow 2.1 script. The only things that you will need to change are the script name and ``framewotk_version``

In [14]:
#instance_type='ml.m5.xlarge'
#instance_type='ml.m4.4xlarge'
instance_type='ml.p2.xlarge'
#instance_type='local'

- Define use of checkpoint and resumen
How to use, how it works

The local path that the algorithm writes its checkpoints to. SageMaker will persist all files under this path to checkpoint_s3_uri continually during training. On job startup the reverse happens - data from the s3 location is downloaded to this path before the algorithm is started. If the path is unset then SageMaker assumes the checkpoints will be provided under /opt/ml/checkpoints/.

- Define the use of metrics
You can monitor the metrics that a training job emits in real time in the **CloudWatch console**
To monitor training job metrics (CloudWatch console)

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

Choose Metrics, then choose /aws/sagemaker/TrainingJobs.

Choose TrainingJobName.

On the All metrics tab, choose the names of the training metrics that you want to monitor.

On the Graphed metrics tab, configure the graph options. For more information about using CloudWatch graphs, see Graph Metrics in the Amazon CloudWatch User Guide

You can monitor the metrics that a training job emits in real time by using the **SageMaker console**.

To monitor training job metrics (SageMaker console)

Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.

Choose Training jobs, then choose the training job whose metrics you want to see.

Choose TrainingJobName.

In the Monitor section, you can review the graphs of instance utilization and algorithm metrics



In [15]:
# Define the metrics to search for
metric_definitions = [{'Name': 'loss', 'Regex': 'Loss ([0-9\\.]+)'},{'Name': 'Accuracy', 'Regex': 'Accuracy ([0-9\\.]+)'}]

In [16]:
estimator = Estimator(#entry_point='train.py',
                       #source_dir="train",
                       role=role,
                       instance_count=1,
                       instance_type=instance_type,
                       #framework_version='2.1.0',
                       image_uri=container_image,
                       #py_version='py3',
                       output_path=output_data_uri,
                       code_location=output_data_uri,
                       base_job_name='tf-transformer',
                       script_mode= True,
                       #checkpoint_local_path = 'ckpt',
                       checkpoint_s3_uri = ckpt_data_uri,
                       metric_definitions = metric_definitions, 
                       hyperparameters={
                        'epochs': 5,
                        'nsamples': 20000,
                        'resume': False,
                        'train_file': 'spa.txt',
                        'non_breaking_in': 'nonbreaking_prefix.en',
                        'non_breaking_out': 'nonbreaking_prefix.es'
                       })
                       #distributions={'parameter_server': {'enabled': False}})

## Calling ``fit``

To start a training job, we call `estimator.fit(training_data_uri)`.

An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the TensorFlow container executes mnist.py, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and `model_dir` defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:
```bash
python mnist.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>
```
When training is complete, the training job will upload the saved model for TensorFlow serving.

Calling fit to train a model with TensorFlow 2.1 scroipt.

In [17]:
#job_name=f'tensorflow-single-gpu-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
job_name = '{}-{}'.format(trial_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

tf-transformer-custom-single-gpu-custom-2020-11-10-17-42-47


In [18]:
#estimator.fit({'training':training_data_uri,'testing':testing_data_uri})
estimator.fit({'training':training_data_uri}, job_name = job_name, 
              experiment_config = experiment_config)

INFO:sagemaker:Creating training-job with name: tf-transformer-custom-single-gpu-custom-2020-11-10-17-42-47


2020-11-10 17:42:52 Starting - Starting the training job...
2020-11-10 17:42:54 Starting - Launching requested ML instances.........
2020-11-10 17:44:27 Starting - Preparing the instances for training......
2020-11-10 17:45:40 Downloading - Downloading input data...
2020-11-10 17:46:10 Training - Downloading the training image...............
2020-11-10 17:48:46 Training - Training image download completed. Training in progress.[34m2020-11-10 17:48:46,233 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/usr/bin/python3 -m pip install -r requirements.txt[0m
[34m2020-11-10 17:48:47,885 sagemaker-training-toolkit INFO     Failed to parse hyperparameter resume value False to Json.[0m
[34mReturning the value itself[0m
[34m2020-11-10 17:48:47,886 sagemaker-training-toolkit INFO     Failed to parse hyperparameter non_breaking_out value nonbreaking_prefix.es to Json.[0m
[34mReturning the value itself[0m
[34m2020-11-10 17:48:47,886 sagemaker

[34m2020-11-10 17:49:15.999078: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1[0m
[34m2020-11-10 17:49:16.206249: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero[0m
[34m2020-11-10 17:49:16.207126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: [0m
[34mpciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7[0m
[34mcoreClock: 0.8755GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s[0m
[34m2020-11-10 17:49:16.207178: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1[0m
[34m2020-11-10 17:49:16.316889: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10[0m
[34m2020

[34m/opt/ml/model None[0m
[34mGet the train data[0m
[34mTokenize the input and output data and create the vocabularies[0m
[34mInput vocab:  5234[0m
[34mOutput vocab:  10042[0m
[34mCreating the checkpoint ...[0m
[34mTraining the model ....[0m
[34mStarting epoch 1[0m
[34mEpoch 1 Batch 0 Loss 3.4101 Accuracy 0.0000[0m
[34mEpoch 1 Batch 100 Loss 3.5298 Accuracy 0.0241[0m
[34mEpoch 1 Batch 200 Loss 3.4576 Accuracy 0.0477[0m
[34mEpoch 1 Batch 300 Loss 3.3463 Accuracy 0.0556[0m
[34mSaving checkpoint for epoch 1 in /opt/ml/checkpoints/ckpt-1[0m
[34mStarting epoch 2[0m
[34mEpoch 2 Batch 0 Loss 2.9338 Accuracy 0.0725[0m
[34mEpoch 2 Batch 100 Loss 2.6994 Accuracy 0.0988[0m
[34mEpoch 2 Batch 200 Loss 2.5204 Accuracy 0.1142[0m
[34mEpoch 2 Batch 300 Loss 2.3763 Accuracy 0.1186[0m
[34mSaving checkpoint for epoch 2 in /opt/ml/checkpoints/ckpt-2[0m
[34mStarting epoch 3[0m
[34mEpoch 3 Batch 0 Loss 1.9738 Accuracy 0.1272[0m
[34mEpoch 3 Batch 100 Loss 1.8206 Accu

Save the experiment, then you can view it and its trials from SageMaker Studio

In [32]:
# Save the trial
single_gpu_trial.save()
# Save the experiment
training_experiment.save()

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f849be47e80>,experiment_name='tf-transformer',experiment_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment/tf-transformer',display_name='tf-transformer',description='Experiment to track trainings on my tensorflow Transformer Eng-Spa',creation_time=datetime.datetime(2020, 11, 8, 17, 0, 49, 116000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 9, 16, 10, 7, 126000, tzinfo=tzlocal()),last_modified_by={},response_metadata={'RequestId': 'e4e4dea1-1b3f-4e86-9571-bd0fa9a7eaf9', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'e4e4dea1-1b3f-4e86-9571-bd0fa9a7eaf9', 'content-type': 'application/x-amz-json-1.1', 'content-length': '86', 'date': 'Mon, 09 Nov 2020 16:39:02 GMT'}, 'RetryAttempts': 0})

## Show metrics from SageMaker Console

Show where you can see the metrics from both sections

# Load the trained model

## Attach a previous training job

Look for your the training job you want to restore the model in SageMaker console, section Training jobs. 

We can skip the next cell if the previous estimator.fit command was executed

In [35]:
job_name

'tf-transformer-single-gpu-2020-11-09-16-18-20'

In [33]:
from sagemaker.tensorflow import TensorFlow

# Set the training job you want to attach to the estimator object
# Use this option if the training job was not trained in this execution
#my_training_job_name = 'single-gpu-2020-11-08-18-40-33'

# In case, when the training job have been trained in this execution, we can retrive the data from the job_name variable
my_training_job_name = job_name
# Attach the estimator to the selected training job
estimator = TensorFlow.attach(my_training_job_name)

ClientError: An error occurred (ValidationException) when calling the DescribeTrainingJob operation: Requested resource not found.

In [34]:
print('Job name where the model will be restored: ',estimator.latest_training_job.job_name)

Job name where the model will be restored:  tf-transformer-single-gpu-2020-11-09-16-18-20


In [38]:
print('Dir of model data: ',estimator.model_data)
print('Dir of output data: ',output_data_uri)
print('Buck name: ',bucket_name)

Dir of model data:  s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-09-16-18-20/model.tar.gz
Dir of output data:  s3://edumunozsala-ml-sagemaker/transformer-nmt
Buck name:  edumunozsala-ml-sagemaker


## Download the trained model

In [43]:
#s3_model_path='transformer-nmt/tf-transformer-2020-11-07-17-49-03-516/output/model.tar.gz'
init_model_path = len('s3://')+len(bucket_name)+1
s3_model_path=estimator.model_data[init_model_path:]
s3_output_data=output_data_uri[init_model_path:]+'/{}/output.tar.gz'.format(job_name)
print('Dir to download traned model: ', s3_model_path)
print('Dir to download model outputs: ', s3_output_data)

Dir to download traned model:  transformer-nmt/tf-transformer-single-gpu-2020-11-09-16-18-20/model.tar.gz
Dir to download model outputs:  transformer-nmt/tf-transformer-single-gpu-2020-11-09-16-18-20/output.tar.gz


In [44]:
sagemaker_session.download_data(trainedmodel_path,bucket_name,s3_model_path)

In [45]:
sagemaker_session.download_data(output_data_path,bucket_name,s3_output_data)

Next, extract the information out from the model.tar.gz file return by the training job in SageMaker:

In [48]:
!tar -zxvf $trainedmodel_path/model.tar.gz

transformer.index
checkpoint
transformer.data-00000-of-00001


Extract the files from output.tar.gz without recreating the directory structure, all files will be extracted to the working directory

In [52]:
!tar -xvzf $output_data_path/output.tar.gz --strip-components=1

data/out_vocab.pkl
data/in_vocab.pkl
data/model_info.pth


### Import the tensorflow model and load the model

In [53]:
from train.model import Transformer

We need to restore the parameters of the model we have saved in order to build an instance of the Transformer model

In [54]:
# Read the parameters from a dictionary
#model_info_path = os.path.join(model_dir, 'model_info.pth')
with open(model_info_file, 'rb') as f:
    model_info = pickle.load(f)
print('Model parameters',model_info)

Model parameters {'vocab_size_enc': 1976, 'vocab_size_dec': 3865, 'sos_token_input': [1974], 'eos_token_input': [1975], 'sos_token_output': [3863], 'eos_token_output': [3864], 'n_layers': 4, 'd_model': 64, 'ffn_dim': 128, 'n_heads': 8, 'drop_rate': 0.1}


In [55]:
#Create an instance of the Transforer model and load the saved model to th
transformer = Transformer(vocab_size_enc=model_info['vocab_size_enc'],
                          vocab_size_dec=model_info['vocab_size_dec'],
                          d_model=model_info['d_model'],
                          n_layers=model_info['n_layers'],
                          FFN_units=model_info['ffn_dim'],
                          n_heads=model_info['n_heads'],
                          dropout_rate=model_info['drop_rate'])

#Load the saved model
# Use a model_name argument to pass in on training and then apply here
transformer.load_weights('transformer')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f847078cac8>

## Make some predictions

In [56]:
# Install the library necessary to tokenize the sentences
!pip install tensorflow-datasets

Collecting tensorflow-datasets
  Downloading tensorflow_datasets-4.1.0-py3-none-any.whl (3.6 MB)
[K     |████████████████████████████████| 3.6 MB 14.8 MB/s eta 0:00:01
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 15.6 MB/s eta 0:00:01
Collecting typing-extensions; python_version < "3.8"
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting importlib-resources; python_version < "3.9"
  Downloading importlib_resources-3.3.0-py2.py3-none-any.whl (26 kB)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-0.25.0-py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 5.2 MB/s  eta 0:00:01
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.7-py3-none-any.whl (18 kB)
Collecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.52.0-py2.py3-none-any.whl (100

In [57]:
from serve.predict import translate
import tensorflow_datasets as tfds

INFO:matplotlib.font_manager:generated new fontManager


Load the input and output tokenizer or vocabularis used in the training. We need them to encode and decode the sentences

In [58]:
# Read the parameters from a dictionary
#model_info_path = os.path.join(model_dir, 'model_info.pth')
with open(input_vocab_file, 'rb') as f:
    tokenizer_inputs = pickle.load(f)

with open(output_vocab_file, 'rb') as f:
    tokenizer_outputs = pickle.load(f)


In [59]:
#Show some translations
sentence = "you should pay for it."
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(transformer,sentence,tokenizer_inputs, tokenizer_outputs,15,model_info['sos_token_input'],
                               model_info['eos_token_input'],model_info['sos_token_output'],
                               model_info['eos_token_output'])
print("Output sentence: {}".format(predicted_sentence))

Input sentence: you should pay for it.
Output sentence: 


In [60]:
#Show some translations
sentence = "This is a really powerful method!"
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(transformer,sentence,tokenizer_inputs, tokenizer_outputs,15,model_info['sos_token_input'],
                               model_info['eos_token_input'],model_info['sos_token_output'],
                               model_info['eos_token_output'])
print("Output sentence: {}".format(predicted_sentence))

Input sentence: This is a really powerful method!
Output sentence: 


# Deploy the trained model to an endpoint

To deploy the model on sagemaker we will try to save it from the transformer model created

In [None]:
transformer_fn = "seq2seq_encoder"

In [65]:
deploy_model_path='deploy_model/transformer_deploy'
tf.saved_model.save(transformer, deploy_model_path)
#transformer.save(deploy_model_path)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: deploy_model/transformer_deploy/assets


In [66]:
t2 = tf.saved_model.load(deploy_model_path)

The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The [Using your own inference code]() document explains how SageMaker runs inference containers.

Deployed the trained TensorFlow 2.1 model to an endpoint.

In [67]:
predictor = estimator.deploy(initial_instance_count=1, instance_type=instance_type)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Creating model with name: tf-transformer-2020-11-09-17-29-19-495
INFO:sagemaker:Creating endpoint with name tf-transformer-2020-11-09-17-29-19-495
INFO:sagemaker.local.image:serving
INFO:sagemaker.local.image:creating hosting dir in /tmp/tmpdig2owyk
INFO:sagemaker.local.image:docker command: docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.1.0-cpu
INFO:sagemaker.local.image:image pulled: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.1.0-cpu
INFO:sagemaker.local.image:No AWS credentials found in session but credentials from EC2 Metadata Service are available.
INFO:sagemaker.local.image:docker compose file: 
networks:
  sagemaker-local:
    name: sagemaker-local
services:
  algo-1-ugjn2:
    command: serve
    environment:
    - SAGEMAKER_TFS_NGINX_LOGLEVEL=info
    image: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.1.0-cpu
    n

Attaching to tmpdig2owyk_algo-1-ugjn2_1
[36malgo-1-ugjn2_1  |[0m INFO:__main__:starting services
[36malgo-1-ugjn2_1  |[0m Traceback (most recent call last):
[36malgo-1-ugjn2_1  |[0m   File "/sagemaker/serve.py", line 388, in <module>
[36malgo-1-ugjn2_1  |[0m     ServiceManager().start()
[36malgo-1-ugjn2_1  |[0m   File "/sagemaker/serve.py", line 342, in start
[36malgo-1-ugjn2_1  |[0m     self._create_tfs_config()
[36malgo-1-ugjn2_1  |[0m   File "/sagemaker/serve.py", line 92, in _create_tfs_config
[36malgo-1-ugjn2_1  |[0m     raise ValueError('no SavedModel bundles found!')
[36malgo-1-ugjn2_1  |[0m ValueError: no SavedModel bundles found!
[36mtmpdig2owyk_algo-1-ugjn2_1 exited with code 1
[0mAborting on container exit...


Exception in thread Thread-5:
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 627, in run
    _stream_output(self.process)
  File "/home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 687, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/sagemaker/local/image.py", line 632, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmpdig2owyk/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Pr

INFO:sagemaker.local.entities:Container still not up, got: -1
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 45
INFO:sagemaker.local.entities:Container still not up, got: -1
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 50
INFO:sagemaker.local.entities:Container still not up, got: -1
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 55
INFO:sagemaker.local.entities:Container still not up, got: -1
INFO:sagemaker.local.entities:Checking if serving container is up, attempt: 60
INFO:sagemaker.local.entities:Container still not up, got: -1


KeyboardInterrupt: 

# Resume training from a checkpoint

Let's download the training data and use that as input for inference.

In [73]:
experiment_name='tf-transformer'
trial_name='single-gpu'
trial_comp_name = 'single-gpu-training-job'

In [74]:
# create the experiment if it doesn't exist
try:
    experiment = Experiment.load(experiment_name=experiment_name)
    print('Load the experiment')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        experiment = Experiment.create(experiment_name=experiment_name)
        print('Create the experiment')


# create the trial if it doesn't exist
try:
    trial = Trial.load(trial_name=trial_name)
    print('Load the trial')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)
        print('Create the trial')

Load the experiment
Load the trial


In [75]:
# Set the configuration parameters for the experiment
experiment_config = {'ExperimentName': experiment.experiment_name, 
                       'TrialName': trial.trial_name,
                       'TrialComponentDisplayName': trial_comp_name}

Create an Estimator for a TensorFlow 2.1 model and set the parameter `--resume` to True to force the model to restore the latest checkpoint and resume training for the number of epochs selected

In [76]:
#instance_type='ml.m5.xlarge'
#instance_type='ml.m4.4xlarge'
instance_type='ml.p2.xlarge'
#instance_type='local'

# Define the metrics to search for
metric_definitions = [{'Name': 'loss', 'Regex': 'Loss ([0-9\\.]+)'},{'Name': 'Accuracy', 'Regex': 'Accuracy ([0-9\\.]+)'}]

In [78]:
estimator = TensorFlow(entry_point='train.py',
                       source_dir="train",
                       role=role,
                       instance_count=1,
                       instance_type=instance_type,
                       framework_version='2.1.0',
                       py_version='py3',
                       output_path=output_data_uri,
                       code_location=output_data_uri,
                       base_job_name='tf-transformer',
                       script_mode= True,
                       checkpoint_s3_uri = ckpt_data_uri,
                       metric_definitions = metric_definitions, 
                       hyperparameters={
                        'epochs': 5,
                        'nsamples': 40000,
                        'resume': True,
                        'train_file': 'spa.txt',
                        'non_breaking_in': 'nonbreaking_prefix.en',
                        'non_breaking_out': 'nonbreaking_prefix.es'
                       })

In [79]:
#job_name=f'tensorflow-single-gpu-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
job_name = '{}-{}'.format(trial_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

single-gpu-2020-11-08-18-40-33


In [80]:
# Fit or train the model from the latest checkpoint
estimator.fit({'training':training_data_uri}, job_name = job_name, 
              experiment_config = experiment_config)

INFO:sagemaker:Creating training-job with name: single-gpu-2020-11-08-18-40-33


2020-11-08 18:40:47 Starting - Starting the training job...
2020-11-08 18:41:16 Starting - Launching requested ML instances.........
2020-11-08 18:42:47 Starting - Preparing the instances for training.........
2020-11-08 18:44:02 Downloading - Downloading input data...
2020-11-08 18:44:28 Training - Downloading the training image........[34m2020-11-08 18:45:59,732 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-11-08 18:46:00,300 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "resume": true,
        "non_breaking_out": "nonbreaking_prefix.es",
        "nsamples": 40000,
        "train_file": "sp


2020-11-08 18:45:54 Training - Training image download completed. Training in progress.[34m  Building wheel for promise (setup.py): finished with status 'done'
  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21495 sha256=0e4a1bcae031239e48de28b80da81a42374fe6a78c4f4fe531599aada4269554
  Stored in directory: /root/.cache/pip/wheels/59/9a/1d/3f1afbbb5122d0410547bf9eb50955f4a7a98e53a6d8b99bd1
  Building wheel for future (setup.py): started
  Building wheel for future (setup.py): finished with status 'done'
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491058 sha256=ddac4fdd0607303ad63138a91b4c623df4d1244fbac32d172ab477dfaedf8e65
  Stored in directory: /root/.cache/pip/wheels/6e/9c/ed/4499c9865ac1002697793e0ae05ba6be33553d098f3347fb94[0m
[34mSuccessfully built promise future[0m
[34mInstalling collected packages: dill, dataclasses, importlib-resources, googleapis-common-protos, tensorflow-metadata, attrs, tqdm, promise, typing-extensi

## Show the metrics

In [21]:
result[:5]

[[0.00026031278, 0.990563631, 0.00917605218],
 [0.999760091, 0.000239968373, 3.97464422e-10],
 [0.000185193509, 0.974752605, 0.0250621513],
 [9.90088935e-08, 0.241644651, 0.758355319],
 [1.86230598e-09, 0.0252015758, 0.974798381]]

### Show results

## Delete the experiment

In [31]:
training_experiment.delete_all(action="--force")

# Delete the endpoint

Let's delete the endpoint we just created to prevent incurring any extra costs.

Delete the TensorFlow 2.1 endpoint as well.

In [32]:
estimator.delete_endpoint()

Gracefully stopping... (press Ctrl+C again to force)


# Deploy model using artifacts
https://sagemaker.readthedocs.io/en/stable/using_tf.html#deploy-to-a-sagemaker-endpoint

In [81]:
from sagemaker.tensorflow.serving import Model

In [85]:
model_data = estimator.model_data
instance_type='ml.m4.4xlarge'

In [83]:
model_data

's3://edumunozsala-ml-sagemaker/transformer-nmt/single-gpu-2020-11-08-18-40-33/output/model.tar.gz'

In [86]:
model = Model(model_data=model_data, role=role,framework_version='2.1.0')
predictor = model.deploy(initial_instance_count=1, instance_type=instance_type)

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Creating model with name: tensorflow-inference-2020-11-08-19-26-16-925
INFO:sagemaker:Creating endpoint with name tensorflow-inference-2020-11-08-19-26-17-266


---------------------------------*

UnexpectedStatusException: Error hosting endpoint tensorflow-inference-2020-11-08-19-26-17-266: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

In [37]:
X_test[:5]

array([[5.8, 2.7, 4.1, 1. ],
       [4.8, 3.4, 1.6, 0.2],
       [6. , 2.2, 4. , 1. ],
       [6.4, 3.1, 5.5, 1.8],
       [6.7, 2.5, 5.8, 1.8]])

In [38]:
predictor.predict(X_test[:5])

[36malgo-1-oufpd_1  |[0m 172.18.0.1 - - [28/Mar/2020:21:15:01 +0000] "POST /invocations HTTP/1.1" 200 254 "-" "-"


{'predictions': [[0.00026031278, 0.990563631, 0.00917605218],
  [0.999760091, 0.000239968373, 3.97464422e-10],
  [0.000185193509, 0.974752605, 0.0250621513],
  [9.90088935e-08, 0.241644651, 0.758355319],
  [1.86230598e-09, 0.0252015758, 0.974798381]]}

In [39]:
predictor.delete_endpoint()

Gracefully stopping... (press Ctrl+C again to force)


### References

- Referencias for experiment and trial
https://github.com/shashankprasanna/sagemaker-training-tutorial/blob/master/sagemaker-training-tutorial.ipynb
