# Steps to start training your custom Tensorflow model in AWS SageMaker
# SageMaker Experiments, TensorFlow script mode training and restore checkpoint to resume training

Some sections of this notebook has been inspired by the tutorial:

**Sagemaker Python SDK Examples: tensorflow_script_mode_training_and_serving.ipynb**

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/tensorflow_script_mode_training_and_serving.ipynb

In this notebook we will describe the most relevant steps to start training a custom algorithm in AWS SageMaker, not using a custom container, showing how to deal with experiments and solving some of the problems when facing with custom models when using SageMaker script mode on. Some basics concepts on SageMaker will not be detailed in order to focus on the relevant concepts.

Following steps will be explained: 
 
1. Create an Experiment and Trial to keep track of our experiments

2. Load the training data to our training instance

3. Create the scripts to train our custom model, a Transformer.

4. Create an Estimator to train our model in a Tensorflow 2.1 container in script mode

5. Create metric definitions to keep track of them in SageMaker

4. Download the trained model to make predictions

5. Resume training using the latest checkpoint from a previous training 


# Amazon SageMaker Overview

*Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.*

*Amazon SageMaker Developer Guide*

Amazon SageMaker provides many tools to help developers to manage the Machine Learning Lifecycle workflow:
- Fetch, Clean and transform the data: you can use SageMaker notebook instances to manipulate and analyze your data, then you can clean and transform it to the requiered format for your algorithm. And you can use Pipelines functionality to serve the data to your model during training.
- Train and evaluate the model: There are many different posibilities to train your model. You can use built-in algorithm, models provided by SageMaker, or you can use custom code to train in the most popular deep learning framewors (Tensorflow, Pytorch, Apache MXNet,..) or even use Apache Spark. Finally, you can use your own custom algorithm and build a Docker container then training the model on SageMaker. You can keep track of your model metrics to evaluate the performance.
- Deploy your model: Once your model is trained, you can deploy it in and endpoint service in SageMaker and make prediction one at a time or in batch mode.

![Alt](images/ml-concepts-10.png "Machine Learning Lifecycle work flow")

A simple and popular way to get started and work with SageMaker is to use the Amazon SageMaker Python SDK. It provides  Python APIs and containers that make it easy to train and deploy models in SageMaker, as well as examples for use with several different machine learning and deep learning frameworks.



# Problem description

For this project we will develope notebooks and scripts to train a Transformer Tensorflow 2 model to solve a neural machine translation problem, traslating simple sentences from English to Spanish. This problem and the model is extensively described in my Mdeium post ["Attention is all you need: Discovering the Transformer paper"](https://towardsdatascience.com/attention-is-all-you-need-discovering-the-transformer-paper-73e5ff5e0634).



# Data description

For this exercise, we’ll use pairs of simple sentences. The source text will be in English, and the target text will be in Spanish, from the Tatoeba project where people contribute, adding translations every day. This is the [link](http://www.manythings.org/anki/) to some translations in different languages. There you can download the Spanish/English `spa_eng.zip` file; it contains 124,457 pairs of sentences.

# Set up the environment

Let's start by setting up the environment:

First, we will import and load the libraries to use in our project.

In [1]:
%load_ext autoreload
%autoreload 2

In [5]:
import os
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import time
import pickle

import tensorflow as tf
# Create a SageMaker session to work with
sagemaker_session = sagemaker.Session()
# Get the role of our user and the region
role = get_execution_role()
region = sagemaker_session.boto_session.region_name
print(role)
print(region)

arn:aws:iam::223817798831:role/service-role/AmazonSageMaker-ExecutionRole-20200708T194212
us-east-1


## Define global variables and parameters

In [6]:
# Set the variables for data locations
data_folder_name='data'
train_filename = 'spa.txt'
non_breaking_en = 'nonbreaking_prefix.en'
non_breaking_es = 'nonbreaking_prefix.es'
# Set the directories for our nodel output
trainedmodel_path = 'trained_model'
output_data_path = 'output_data'
# Set the name of the artifacts that our model generate (model not included) 
model_info_file = 'model_info.pth'
input_vocab_file = 'in_vocab.pkl'
output_vocab_file = 'out_vocab.pkl'
# Set the absolute path of the train data 
train_file = os.path.abspath(os.path.join(data_folder_name, train_filename))
non_breaking_en_file = os.path.abspath(os.path.join(data_folder_name, non_breaking_en))
non_breaking_es_file = os.path.abspath(os.path.join(data_folder_name, non_breaking_es))

When working with Amazon SageMaker training jobs that will run on containers in a new instance or "vm", the data has to be share using a S3 Storage folder. For this purpose we define the bucket name and the folder names where our inputs and outputs will be stored. In our case we define:
- The **training data** URI: where our input data is located
- The **output folder**: where our training saves the outputs fron our model
- The **checkpoint folder**: where our model uploads the checkpoints


In [7]:
# Specify your bucket name
bucket_name = 'edumunozsala-ml-sagemaker'
# Set the training data folder in S3
training_folder = r'transformer-nmt/train'
# Set the output folder in S3
output_folder = r'transformer-nmt'
# Set the checkpoint in S3 folder for our model 
ckpt_folder = r'transformer-nmt/ckpt'

training_data_uri = r's3://' + bucket_name + r'/' + training_folder
output_data_uri = r's3://' + bucket_name + r'/' + output_folder
ckpt_data_uri = r's3://' + bucket_name + r'/' + ckpt_folder

In [5]:
training_data_uri,output_data_uri,ckpt_data_uri

('s3://edumunozsala-ml-sagemaker/transformer-nmt/train',
 's3://edumunozsala-ml-sagemaker/transformer-nmt',
 's3://edumunozsala-ml-sagemaker/transformer-nmt/ckpt')

Then we can upload to the training data folder in S3 the files necessary for training: training data, non breaking prefixes for the inputs (English) and the non breaking prefixes for the outputs (Spanish). Once uploaded they can be loaded for training in the SageMaker container.

In [6]:
inputs = sagemaker_session.upload_data(train_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

sagemaker_session.upload_data(non_breaking_en_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

sagemaker_session.upload_data(non_breaking_es_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

's3://edumunozsala-ml-sagemaker/transformer-nmt/train/nonbreaking_prefix.es'

# Create an experiment and trial

*Amazon SageMaker Experiments* is a capability of Amazon SageMaker that lets you organize, track, compare, and evaluate your machine learning experiments.

Machine learning is an iterative process. You need to experiment with multiple combinations of data, algorithm and parameters, all the while observing the impact of incremental changes on model accuracy. Over time this iterative experimentation can result in thousands of model training runs and model versions. This makes it hard to track the best performing models and their input configurations. It’s also difficult to compare active experiments with past experiments to identify opportunities for further incremental improvements.

Experiments will help us to organize and manage all executions, metrics and results of a ML project.

In [7]:
# Install the library necessary to handle experiments
!pip install sagemaker-experiments

Collecting sagemaker-experiments
  Using cached sagemaker_experiments-0.1.24-py3-none-any.whl (36 kB)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.24
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow2_p36/bin/python -m pip install --upgrade pip' command.[0m


Load the libraries to handle experiments

In [8]:
# Import the libraries to work with Experiments in SageMaker
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

Set the experiment and trial name and one tag to help us to identify the reason for this items.

In [10]:
# Set the experiment name
experiment_name='tf-transformer'
# Set the trial name 
trial_name="{}-{}".format(experiment_name,'single-gpu')

tags = [{'Key': 'my-experiments', 'Value': 'transformerEngSpa1'}]

You can create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for : a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

We will create a Trial to track each training job run. But this is just a simple example, not intented to explore all the capabilities of the product.

In [11]:
# create the experiment if it doesn't exist
try:
    training_experiment = Experiment.load(experiment_name=experiment_name)
    print('Loaded experiment ',experiment_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        training_experiment = Experiment.create(experiment_name=experiment_name,
                                      description = "Experiment to track trainings on my tensorflow Transformer Eng-Spa", 
                                      tags = tags)
        print('Created experiment ',experiment_name)
# create the trial if it doesn't exist
try:
    single_gpu_trial = Trial.load(trial_name=trial_name)
    print('Loaded trial ',trial_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        single_gpu_trial = Trial.create(experiment_name=experiment_name, 
                             trial_name= trial_name,
                             tags = tags)
        print('Created trial ',trial_name)

Loaded experiment  tf-transformer
Loaded trial  tf-transformer-single-gpu


## Trackers

Another interesting tool to mention, is Tracker objects. They can store information about different types of topics or objects in our model or training process like inputs, parameters, artifacts or metrics. The tracker is attached to a trial, associating the object to the training job. We can record that information and analyze it later on the experiment. **Note** that only parameters, input artifacts, and output artifacts are saved to SageMaker. Metrics are saved to file.

As an example, we create a Tracker to register the input data and two parameters about how that data is processed in our project.

In [12]:
from smexperiments.tracker import Tracker
# Create the tracker for the inout data
tracker_name='TextPreprocessing'
trial_comp_name = None # Change to a an exsting TrialComponent to load it

try:
    tracker = Tracker.load(trial_component_name=trial_comp_name)
    print('Loaded Tracker ',tracker_name)
except Exception as ex:
    tracker = Tracker.create(display_name=tracker_name)
    tracker.log_input(name="EngtoSpa Translations", media_type="s3/uri", value=inputs)
    tracker.log_parameters({
        "Tokenizer": 'Subword',
        "Max Length": 15,
    })
    print('Created Tracker ',tracker_name)
    
# Atach the Tracker to the trial
single_gpu_trial.add_trial_component(tracker.trial_component)

Created Tracker  TextPreprocessing


Our last step consist in create the experiment configuration, a dictionary that contains the experiment name, the trial name and the trial component and it will be used to label our training job.

In [13]:
# Create a configuration definition for our experiment and trial
trial_comp_name = 'single-gpu-components'
# Set the configuration parameters for the experiment
experiment_config = {'ExperimentName': training_experiment.experiment_name, 
                       'TrialName': single_gpu_trial.trial_name,
                       'TrialComponentDisplayName': trial_comp_name}

Check and show information about the experiment and trial

In [19]:
print('Experiment: ',training_experiment.experiment_name)
# Show the trials in the experiment
#for trial in training_experiment.list_trials():
    #print('Trial: ',trial.trial_name)

for trial_comp in TrialComponent.list(trial_name=single_gpu_trial.trial_name):
        print('Trial Components: ',trial_comp)

Experiment:  tf-transformer
Trial Components:  TrialComponentSummary(trial_component_name='TrialComponent-2020-11-12-115920-sbov',trial_component_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment-trial-component/trialcomponent-2020-11-12-115920-sbov',display_name='TextPreprocessing',creation_time=datetime.datetime(2020, 11, 12, 11, 59, 20, 739000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 12, 11, 59, 20, 739000, tzinfo=tzlocal()),last_modified_by={})
Trial Components:  TrialComponentSummary(trial_component_name='tf-transformer-single-gpu-2020-11-12-11-44-28-aws-training-job',trial_component_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment-trial-component/tf-transformer-single-gpu-2020-11-12-11-44-28-aws-training-job',display_name='single-gpu-components',trial_component_source={'SourceArn': 'arn:aws:sagemaker:us-east-1:223817798831:training-job/tf-transformer-single-gpu-2020-11-12-11-44-28', 'SourceType': 'SageMakerTrainingJob'},status

# Construct a script for training

Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The SageMaker Python SDK handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job.

Script mode supports training with a Python script, a Python module, or a shell script.

This project's training script was adapted from the Tensorflow model of a Transformer, we develop in a previous post (mentioned previously). We have modified it to handle: 
- the ``train_file``, ``non_breaking_in``and ``non_breaking_out`` parameters passed in with the values of the training data-set, the non breaking prefixes for the input data and the non breaking prefixes for the output data.

- the ``data_dir`` parameter passed in by SageMaker with the value of the enviroment variable `SM_CHANNEL_TRAINING`. This is an S3 path used for input data sharing during training.

- the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

- the local checkpoint path to store the model checkpoints during training. We use the default value ``/opt/ml/checkpoints`` that will be uploaded to S3. We comment this behavior later when defining our estimator.

- At the end of the training job we have added a step to export the trained model, only the weights, to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

- the ``output_data_dir`` parameter passed in by SageMaker with the value of the enviroment variable `SM_OUTPUT_DATA_DIR`. This is a folder path used to save output data from our model. This folder will be uploaded to S3 to store the output.tar.zip. In our case we need to save the tokenizer for the input texts, the tokenizer for the outputs, the input and output vocab size and the tokens for ``eos`` and ``sos``. 


In addition to the train.py file, our source code folder includes the files:
- model.py: Tensorflow model definition
- utils.py: utility functions to process the text data 
- utils_train.py: contains functions to calculate the loss and learning rate scheduler.


Here is the entire script for the train.py file:

In [18]:
!pygmentize 'train/train.py'

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[37m#import sagemaker_containers[39;49;00m

[34mimport[39;49;00m [04m[36mmath[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mgc[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m

[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m

[37m# To install tensorflow_datasets[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m

[34mdef[39;49;00m [32minstall[39;49;00m(package):
    subprocess.check_call([sys.executable, [33m"[39;49;00m[33m-q[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33m-m[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mpip[39;49;00m[3

[34mif[39;49;00m [31m__name__[39;49;00m == [33m'[39;49;00m[33m__main__[39;49;00m[33m'[39;49;00m:
    [37m# Install tensorflow_datasets[39;49;00m
    [37m#install('tensorflow_datasets')[39;49;00m

    [37m# All of the model parameters and training parameters are sent as arguments when the script[39;49;00m
    [37m# is executed. Here we set up an argument parser to easily access the parameters.[39;49;00m

    parser = argparse.ArgumentParser()

    [37m# Training Parameters[39;49;00m
    parser.add_argument([33m'[39;49;00m[33m--batch-size[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34m64[39;49;00m, metavar=[33m'[39;49;00m[33mN[39;49;00m[33m'[39;49;00m,
                        help=[33m'[39;49;00m[33minput batch size for training (default: 64)[39;49;00m[33m'[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--max-len[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[3

Our source code needs the tensorflow_dataset library and it is not include in the Tensorflow 2.1. container image provided by SageMaker. To solve this issue we explicitly install it in our train.py file using the command `subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])`.

# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container where the model will run, uploading your script or source code to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `source_dir`and `entry_point`, the folder with the source code and the file to run the training.
* `framework_version` is the tensorflow version we want to run our code.
* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.
* `code_location` is a S3 folder URI where the `source_dir` will be upload. When the instace starts the content of that folder will be downloaded to a local path, `opt/ml/code`. The `entry_point`, our main code or function, has to be included in that folder.
* `output_path` is the S3 path where all the outputs of our training job will be uploaded when the training ends. In our example we will upload to this S3 folder the local content in the folders `SM_MODEL_DIR` and `SM_OUTPUT_DATA_DIR`.
* the `checkpoint_local_path`and `checkpoint_s3_uri` parameters will be explained in the next section **"Resume training from a checkpoint"**
* `script_mode = True` to set script mode. 

In [14]:
from sagemaker.tensorflow import TensorFlow

In [26]:
# Uncomment the type of instance to use
#instance_type='ml.m4.4xlarge'
instance_type='ml.p2.xlarge'
#instance_type='local'

Another important parameter of our Tensorflow estimator is the `instance_type` that is the type of "virtual machine" where the container will run. The values we play around in this project are:
- local: The container will run locally on the notebook instance. this is very useful to debug or verify that our estimator definition is correct and the train.py runs successfully. It is much more faster to run the container locally, the start up time for a remote instance is too long when you are coding and debugging.
- ml.mX.Yxlarge: It is a CPU instance, when you are running your code for a short train, maybe for validation purposes. Check AWS documentation for a list of alternative instance.
- ml.p2.xlarge: This instance use a GPU and it is the preferred one when you want to launch a long running training.

When running in local mode, some estimator functionalities are not available like uploading the checkpoints to S3 and its parameters should not be defined.

Finally we want to mention the definition of metrics. Using a dictionary, we can define a metric name and the regular expression to extract its value from the messages the training script writes on the logs or the stdout during training. Later we can see those metrics in the SageMaker console. We show you how to do it in a following section.


In [15]:
# Define the metrics to search for
metric_definitions = [{'Name': 'loss', 'Regex': 'Loss ([0-9\\.]+)'},{'Name': 'Accuracy', 'Regex': 'Accuracy ([0-9\\.]+)'}]

Now, we can define the estimator:

In [28]:
# Create the Tensorflow estimator using a Tensorflow 2.1 container
estimator = TensorFlow(entry_point='train.py',
                       source_dir="train",
                       role=role,
                       instance_count=1,
                       instance_type=instance_type,
                       framework_version='2.1.0',
                       py_version='py3',
                       output_path=output_data_uri,
                       code_location=output_data_uri,
                       base_job_name='tf-transformer',
                       script_mode= True,
                       #checkpoint_local_path = 'ckpt', #Use default value /opt/ml/checkpoint
                       checkpoint_s3_uri = ckpt_data_uri,
                       metric_definitions = metric_definitions, 
                       hyperparameters={
                        'epochs': 8,
                        'nsamples': 60000,
                        'resume': False,
                        'train_file': 'spa.txt',
                        'non_breaking_in': 'nonbreaking_prefix.en',
                        'non_breaking_out': 'nonbreaking_prefix.es'
                       })


# Start the training job: ``fit``

To start a training job, we call `estimator.fit` method with the a few parameter values.

- An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can access the training data from the local location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.
- `job_name` the name for the training job.
- `experiment_config` the dictionary with the name of the experiment and trial to attach this job to.


When training starts, the TensorFlow container executes `train.py`, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't explicitly define it, `model_dir` defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>/model`, so the script execution is as follows:
```bash
python train.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>/model --epochs=1 --nsamples=5000 ...
```

When training is complete, the training job will upload the saved model and other output artifacts to S3.

In [29]:
# Set the job name and show it
job_name = '{}-{}'.format(trial_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

tf-transformer-single-gpu-2020-11-12-12-25-30


Calling fit to train a model with TensorFlow 2.1 scroipt.

In [30]:
# Call the fit method to launch the training job
estimator.fit({'training':training_data_uri}, job_name = job_name, 
              experiment_config = experiment_config)

INFO:sagemaker:Creating training-job with name: tf-transformer-single-gpu-2020-11-12-12-25-30


2020-11-12 12:25:33 Starting - Starting the training job...
2020-11-12 12:25:39 Starting - Launching requested ML instances......
2020-11-12 12:26:52 Starting - Preparing the instances for training.........
2020-11-12 12:28:30 Downloading - Downloading input data
2020-11-12 12:28:30 Training - Downloading the training image...........[34m2020-11-12 12:30:19,406 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-11-12 12:30:19,885 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "resume": false,
        "non_breaking_out": "nonbreaking_prefix.es",
        "nsamples": 60000,
        "train_file": "spa.

[34mInput vocab:  11460[0m
[34mOutput vocab:  9383[0m
[34mCreating the checkpoint ...[0m
[34mTraining the model ....[0m
[34mStarting epoch 1[0m
[34mEpoch 1 Batch 0 Loss 4.2718 Accuracy 0.0000[0m
[34mEpoch 1 Batch 100 Loss 4.4878 Accuracy 0.0394[0m
[34mEpoch 1 Batch 200 Loss 4.4102 Accuracy 0.0553[0m
[34mEpoch 1 Batch 300 Loss 4.2828 Accuracy 0.0607[0m
[34mEpoch 1 Batch 400 Loss 4.1153 Accuracy 0.0634[0m
[34mEpoch 1 Batch 500 Loss 3.9366 Accuracy 0.0683[0m
[34mEpoch 1 Batch 600 Loss 3.7706 Accuracy 0.0784[0m
[34mEpoch 1 Batch 700 Loss 3.6183 Accuracy 0.0863[0m
[34mEpoch 1 Batch 800 Loss 3.4776 Accuracy 0.0956[0m
[34mEpoch 1 Batch 900 Loss 3.3528 Accuracy 0.1038[0m
[34mSaving checkpoint for epoch 1 in /opt/ml/checkpoints/ckpt-1[0m
[34mStarting epoch 2[0m
[34mEpoch 2 Batch 0 Loss 2.1816 Accuracy 0.1741[0m
[34mEpoch 2 Batch 100 Loss 2.2265 Accuracy 0.1770[0m
[34mEpoch 2 Batch 200 Loss 2.2128 Accuracy 0.1801[0m
[34mEpoch 2 Batch 300 Loss 2.1806 Accur

Save the experiment, then you can view it and its trials from SageMaker Studio

In [31]:
# Save the trial
single_gpu_trial.save()
# Save the experiment
training_experiment.save()

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f3f90c75d68>,experiment_name='tf-transformer',experiment_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment/tf-transformer',display_name='tf-transformer',description='Experiment to track trainings on my tensorflow Transformer Eng-Spa',creation_time=datetime.datetime(2020, 11, 8, 17, 0, 49, 116000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 12, 11, 50, 27, 732000, tzinfo=tzlocal()),last_modified_by={},response_metadata={'RequestId': '862d03a0-abf6-4215-a759-2ddcc4f622fd', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '862d03a0-abf6-4215-a759-2ddcc4f622fd', 'content-type': 'application/x-amz-json-1.1', 'content-length': '86', 'date': 'Thu, 12 Nov 2020 13:17:51 GMT'}, 'RetryAttempts': 0})

## Show metrics from SageMaker Console

You can monitor the metrics that a training job emits in real time in the **CloudWatch console**:
- Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
- Choose Metrics, then choose /aws/sagemaker/TrainingJobs.
- Choose TrainingJobName.
- On the All metrics tab, choose the names of the training metrics that you want to monitor.


Another option is to monitor the metrics by using the **SageMaker console**.
- Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.
- Choose Training jobs, then choose the training job whose metrics you want to see.
- Choose TrainingJobName.
- In the Monitor section, you can review the graphs of instance utilization and algorithm metrics

It is a simple way to check how your model is "learning" during the training stage.


# Restore a training job and download the trained model

At this point, we have a trained model stored in S3. But we are interested in making some predictions with it.

After you train your model, you can deploy it using Amazon SageMaker to get predictions in any of the following ways:
- To set up a persistent endpoint to get one prediction at a time, use SageMaker hosting services.
- To get predictions for an entire dataset, use SageMaker batch transform.

But in this notebook we do not cover this feature because sometimes we are more interested in reloading our model in a new notebook to apply an evaluation method or study its parameters or gradients. So, here we are going to download the model artifacts from S3 and load them to an "empty" model instance. 

## Attach a previous training job

If we have just trained a model using our estimator variable in this notebook execution, we can skip this step. But probably you trained your model for hours and now you need to restore your estimator variable from a previous training job. Check for the training job you want to restore the model in SageMaker console, copy the name and paste it in the next section of code. And then you call the `attach` method of the estimator object and now you can continue to work with our training job.

We can skip the next cell if the previous estimator.fit command was executed

In [2]:
from sagemaker.tensorflow import TensorFlow

# Set the training job you want to attach to the estimator object
# Use this option if the training job was not trained in this execution
my_training_job_name = 'tf-transformer-single-gpu-2020-11-12-18-36-15'

# In case, when the training job have been trained in this execution, we can retrive the data from the job_name variable
#my_training_job_name = job_name
# Attach the estimator to the selected training job
estimator = TensorFlow.attach(my_training_job_name)


2020-11-12 19:13:13 Starting - Preparing the instances for training
2020-11-12 19:13:13 Downloading - Downloading input data
2020-11-12 19:13:13 Training - Training image download completed. Training in progress.
2020-11-12 19:13:13 Uploading - Uploading generated training model
2020-11-12 19:13:13 Completed - Training job completed


In [11]:
# Set the job_name
job_name = estimator.latest_training_job.job_name
print('Job name where the model will be restored: ',estimator.latest_training_job.job_name)

Job name where the model will be restored:  tf-transformer-single-gpu-2020-11-12-18-36-15


In [12]:
print('Dir of model data: ',estimator.model_data)
print('Dir of output data: ',output_data_uri)
print('Buck name: ',bucket_name)

Dir of model data:  s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/output/model.tar.gz
Dir of output data:  s3://edumunozsala-ml-sagemaker/transformer-nmt
Buck name:  edumunozsala-ml-sagemaker


## Download the trained model

The estimator object variable `model_data` points to the `model.tar.gz` file which contains the saved model. And the other output files from our model that we need to rebuild and tokenize or detokenize the sentences can be found in the S3 folder `output_path/output/output.tar.gz`. We can download both files and unzip them.

In [18]:
# Set the model and the output path in S3 to download the data 
init_model_path = len('s3://')+len(bucket_name)+1
s3_model_path=estimator.model_data[init_model_path:]
s3_output_data=output_data_uri[init_model_path:]+'/{}/output/output.tar.gz'.format(job_name)
print('Dir to download traned model: ', s3_model_path)
print('Dir to download model outputs: ', s3_output_data)

Dir to download traned model:  transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/output/model.tar.gz
Dir to download model outputs:  transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/output/output.tar.gz


In [19]:
sagemaker_session.download_data(trainedmodel_path,bucket_name,s3_model_path)

In [20]:
sagemaker_session.download_data(output_data_path,bucket_name,s3_output_data)

Next, extract the information out from the model.tar.gz file return by the training job in SageMaker:

In [21]:
!tar -zxvf $trainedmodel_path/model.tar.gz

transformer.data-00000-of-00002
transformer.index
transformer.data-00001-of-00002
checkpoint


Extract the files from output.tar.gz without recreating the directory structure, all files will be extracted to the working directory

In [28]:
!tar -xvzf $output_data_path/output.tar.gz #--strip-components=1

out_vocab.pkl
model_info.pth
in_vocab.pkl


### Import the tensorflow model and load the model

We import the `model.py` file with our model definition but we only have the weights of the model, so we need to rebuild it. The model parameters where saved during training in the `model_info.pth`, we just need to read that file and use the parameters to initiate an empty instance of the model. And then we can load the weights, `load_weights()` into that instance.


In [29]:
from train.model import Transformer

# Read the parameters from a dictionary
with open(model_info_file, 'rb') as f:
    model_info = pickle.load(f)
print('Model parameters',model_info)

#Create an instance of the Transforer model and load the saved model to th
transformer = Transformer(vocab_size_enc=model_info['vocab_size_enc'],
                          vocab_size_dec=model_info['vocab_size_dec'],
                          d_model=model_info['d_model'],
                          n_layers=model_info['n_layers'],
                          FFN_units=model_info['ffn_dim'],
                          n_heads=model_info['n_heads'],
                          dropout_rate=model_info['drop_rate'])

#Load the saved model
# To do: Use variable to store the model name and pass it in as a hyperparameter of the estimator
transformer.load_weights('transformer')

Model parameters {'vocab_size_enc': 11460, 'vocab_size_dec': 9383, 'sos_token_input': [11458], 'eos_token_input': [11459], 'sos_token_output': [9381], 'eos_token_output': [9382], 'n_layers': 4, 'd_model': 64, 'ffn_dim': 128, 'n_heads': 8, 'drop_rate': 0.1}


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f00f7b045f8>

## Make some predictions

And now everything is ready to make prediction with our trained model:
- Import the `predict.py` file with the functions to make a prediction and to translate a sentence. The code was described in the original post.
- Read the files and load the tokenizer for the input and output sentences
- Call to `traslate` function with the model, the tokenizers, the `sos`and `eos` tokens, the sentence to translate and the max length of the output. It returns the predicted sentence detokenize, a plain text, with the translation. 

In [30]:
# Install the library necessary to tokenize the sentences
!pip install tensorflow-datasets

Collecting tensorflow-datasets
  Downloading tensorflow_datasets-4.1.0-py3-none-any.whl (3.6 MB)
[K     |████████████████████████████████| 3.6 MB 13.8 MB/s eta 0:00:01
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.8-py3-none-any.whl (19 kB)
Collecting importlib-resources; python_version < "3.9"
  Downloading importlib_resources-3.3.0-py2.py3-none-any.whl (26 kB)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-0.25.0-py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 5.6 MB/s  eta 0:00:01
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 17.5 MB/s eta 0:00:01
Collecting typing-extensions; python_version < "3.8"
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.52.0-py2.py3-none-any.whl (100

In [31]:
from serve.predict import translate
import tensorflow_datasets as tfds

Load the input and output tokenizer or vocabularis used in the training. We need them to encode and decode the sentences

In [32]:
# Read the parameters from a dictionary
#model_info_path = os.path.join(model_dir, 'model_info.pth')
with open(input_vocab_file, 'rb') as f:
    tokenizer_inputs = pickle.load(f)

with open(output_vocab_file, 'rb') as f:
    tokenizer_outputs = pickle.load(f)


In [33]:
#Show some translations
sentence = "you should pay for it."
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(transformer,sentence,tokenizer_inputs, tokenizer_outputs,15,model_info['sos_token_input'],
                               model_info['eos_token_input'],model_info['sos_token_output'],
                               model_info['eos_token_output'])
print("Output sentence: {}".format(predicted_sentence))

Input sentence: you should pay for it.
Output sentence: Deberías pagar por ello.


In [34]:
#Show some translations
sentence = "This is a really powerful method!"
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(transformer,sentence,tokenizer_inputs, tokenizer_outputs,15,model_info['sos_token_input'],
                               model_info['eos_token_input'],model_info['sos_token_output'],
                               model_info['eos_token_output'])
print("Output sentence: {}".format(predicted_sentence))

Input sentence: This is a really powerful method!
Output sentence: ¡Esto es un montón de las carreras de las ocho!


# Resume training from a checkpoint

Sometimes we need to stop our training, and maybe do some research in the performance or reallocate more resources to continue with the project. But when it is done, we need to resume the training, restoring the model and the optimizer states and continue for some more epochs to achieve a final trained model with a better performance.

To help with that scenario, the `checkpoint_local_path`and `checkpoint_s3_uri` estimator parameters are much relevant. The first one is the local path, inside the container, that the algorithm writes its checkpoints to. SageMaker will persist all files under this path to `checkpoint_s3_uri` continually during training. On job startup the reverse happens - data from the s3 location is downloaded to this path before the algorithm is started. If the path is unset then SageMaker assumes the checkpoints will be provided under `/opt/ml/checkpoints/`. Using this feature we can resume training from the last checkpoint (or a previous one). 

For this purpose, we set the model parameter `resume = True` and `fit` the estimator to execute another training. 

Load the experiment and trial created in a previous run or create a new one:

In [73]:
# Set the experiment name
experiment_name='tf-transformer'
# Set the trial name 
trial_name="{}-{}".format(experiment_name,'single-gpu')
trial_comp_name = 'single-gpu-training-job'

tags = [{'Key': 'my-experiments', 'Value': 'transformerEngSpa1'}]

In [74]:
# create the experiment if it doesn't exist
try:
    experiment = Experiment.load(experiment_name=experiment_name)
    print('Load the experiment')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        experiment = Experiment.create(experiment_name=experiment_name)
        print('Create the experiment')


# create the trial if it doesn't exist
try:
    trial = Trial.load(trial_name=trial_name)
    print('Load the trial')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)
        print('Create the trial')

Load the experiment
Load the trial


In [75]:
# Set the configuration parameters for the experiment
experiment_config = {'ExperimentName': experiment.experiment_name, 
                       'TrialName': trial.trial_name,
                       'TrialComponentDisplayName': trial_comp_name}

Create an Estimator for a TensorFlow 2.1 model and set the parameter `--resume` to True to force the model to restore the latest checkpoint and resume training for the number of epochs selected

In [17]:
#instance_type='ml.m5.xlarge'
#instance_type='ml.m4.4xlarge'
instance_type='ml.p2.xlarge'
#instance_type='local'

# Define the metrics to search for
metric_definitions = [{'Name': 'loss', 'Regex': 'Loss ([0-9\\.]+)'},{'Name': 'Accuracy', 'Regex': 'Accuracy ([0-9\\.]+)'}]

In [18]:
# Create an estimator with the hyperparameter resume = True
estimator = TensorFlow(entry_point='train.py',
                       source_dir='train',
                       role=role,
                       instance_count=1,
                       instance_type=instance_type,
                       framework_version='2.1.0',
                       py_version='py3',
                       output_path=output_data_uri,
                       code_location=output_data_uri,
                       base_job_name='tf-transformer',
                       script_mode= True,
                       checkpoint_s3_uri = ckpt_data_uri,
                       metric_definitions = metric_definitions, 
                       hyperparameters={
                        'epochs': 5,
                        'nsamples': 60000,
                        'resume': True,
                        'train_file': 'spa.txt',
                        'non_breaking_in': 'nonbreaking_prefix.en',
                        'non_breaking_out': 'nonbreaking_prefix.es'
                       })

In [19]:
# Set the job name and show it
job_name = '{}-{}'.format(trial_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

tf-transformer-single-gpu-2020-11-12-18-36-15


In [20]:
# Fit or train the model from the latest checkpoint
estimator.fit({'training':training_data_uri}, job_name = job_name, 
              experiment_config = experiment_config)

INFO:sagemaker:Creating training-job with name: tf-transformer-single-gpu-2020-11-12-18-36-15


2020-11-12 18:36:34 Starting - Starting the training job...
2020-11-12 18:36:39 Starting - Launching requested ML instances......
2020-11-12 18:37:59 Starting - Preparing the instances for training.........
2020-11-12 18:39:13 Downloading - Downloading input data......
2020-11-12 18:40:24 Training - Downloading the training image......
2020-11-12 18:41:37 Training - Training image download completed. Training in progress...[34m2020-11-12 18:41:43,057 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[34m2020-11-12 18:41:43,563 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "resume": true,
        "non_brea

[34mSuccessfully installed attrs-20.3.0 dataclasses-0.7 dill-0.3.3 future-0.18.2 googleapis-common-protos-1.52.0 importlib-resources-3.3.0 promise-2.3 tensorflow-datasets-4.1.0 tensorflow-metadata-0.25.0 tqdm-4.51.0 typing-extensions-3.7.4.3[0m
[34mYou should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
[34m/opt/ml/model s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model[0m
[34mGet the train data[0m
[34mTokenize the input and output data and create the vocabularies[0m
[34mInput vocab:  11460[0m
[34mOutput vocab:  9383[0m
[34mCreating the checkpoint ...[0m
[34mLast checkpoint restored.[0m
[34mTraining the model ....[0m
[34mStarting epoch 1[0m
[34mEpoch 1 Batch 0 Loss 0.7465 Accuracy 0.3560[0m
[34mEpoch 1 Batch 100 Loss 0.7395 Accuracy 0.3574[0m
[34mEpoch 1 Batch 200 Loss 0.7495 Accuracy 0.3559[0m
[34mEpoch 1 Batch 300 Loss 0.7552 Accuracy 0.3551[0m
[34mEpoch 1 Batch 4

And this training job will return a new trained model, you can download to make prediction as we describe in a former section.

## Delete the experiment

In [31]:
training_experiment.delete_all(action="--force")

# References

- Referencias for experiment and trial
https://github.com/shashankprasanna/sagemaker-training-tutorial/blob/master/sagemaker-training-tutorial.ipynb
