# Using TensorFlow in SageMaker - Quickstart

Starting by the TensorFlow's framework version 1.11, you can use the SageMaker TensorFlow Container to train any TensorFlow script. 

For this example, you use [Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow). You can use the same technique for other scripts or repositories. For example, [TensorFlow Model Zoo](https://github.com/tensorflow/models) and [TensorFlow benchmark scripts](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks).

### Get the data
For training data, use plain text versions of Sherlock Holmes stories.

In [None]:
import os
data_dir = os.path.join(os.getcwd(), 'sherlock')

os.makedirs(data_dir, exist_ok=True)

!wget https://sherlock-holm.es/stories/plain-text/cnus.txt --force-directories --output-document=sherlock/input.txt

## Preparing the training script

Let's start by cloning the repository that contains the example:

In [None]:
!git clone https://github.com/sherjilozair/char-rnn-tensorflow

This repository includes a [README.md](https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/README.md#basic-usage) with an overview of the project, requirements, and basic usage:

> #### **Basic Usage**
> _To train with default parameters on the tinyshakespeare corpus, run **python train.py**. 
To access all the parameters use **python train.py --help.**_

[train.py](https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/train.py#L11) uses the [argparse](https://docs.python.org/3/library/argparse.html) library and requires the following arguments:

```python
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Data and model checkpoints directories
parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare', help='data directory containing input.txt with training examples')
parser.add_argument('--save_dir', type=str, default='save', help='directory to store checkpointed models')
...
args = parser.parse_args()

```
When SageMaker training finishes, it deletes all data generated inside the container with exception of the directories **_/opt/ml/model_** and **_/opt/ml/output_**. To ensure that model data is not lost during training, training scripts are invoked in SageMaker with an additional argument **--model_dir**, that needs to be handle by the training script. 

We need to replace the argument **--save_dir** with the required argument **--model_dir**: 

In [None]:
# this command will replace data_dir by model_dir in the training script
!sed -i 's/save_dir/model_dir/g' char-rnn-tensorflow/train.py

Now, the training script can executed in the container as shown bellow:

> ```bash
python train.py --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
```

## Test locally using SageMaker Python SDK TensorFlow Estimator

You can use the SageMaker Python SDK [TensorFlow](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/README.rst#training-with-tensorflow) estimator to easily train locally and in SageMaker. To train locally, you can set the instance type to [local](https://github.com/aws/sagemaker-python-sdk#local-mode) as follow:

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

# sets the script arguments --num_epochs and --data_dir
hyperparameters = {'num_epochs': 1, 
                   'data_dir': '/opt/ml/input/data/training'}

estimator = TensorFlow(entry_point='train.py',
                       source_dir='char-rnn-tensorflow',
                       train_instance_type='local',      # Run in local mode
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(), # Passes to the container the AWS role that you are using on this notebook
                       framework_version='1.11.0', # Uses TensorFlow 1.11
                       py_version='py3',
                       script_mode=True)
             

estimator.fit({'training': f'file://{data_dir}'}) # Starts training and creates a data channel named training with the contents
# data_dir in the folder /opt/ml/input/data/training

## How Script Mode executes the script in the container

The above cell downloads SageMaker TensorFlow container with TensorFlow Python 3, CPU version, locally and simulates a SageMaker training job. 
When training starts, SageMaker TensorFlow executes **train.py**, passing **hyperparameters** and **model_dir** as script arguments. The example above is executed as follows:
```bash
python -m train --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
```

Let's explain the values of **--data_dir** and **--model_dir** with more details:

- **/opt/ml/input/data/training** is the directory inside the container where the training data is downloaded. The data is downloaded to this folder because **training** is the channel name defined in ```estimator.fit({'training': 'file://{data_dir'}})```. See [training data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-trainingdata) for more information. 

- **/opt/ml/model** use this directory to save models, checkpoints, or any other data. Any data saved in this folder is saved in the S3 bucket defined for training. See [model data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-envvariables) for more information.

### Reading additional information from the container

Often, a user script needs additional information from the container that is not available in ```hyperparameters```.
SageMaker containers write this information as **environment variables** that are available inside the script.

For example, the example above can read information about the **training** channel provided in the training job request by adding the environment variable `SM_CHANNEL_TRAINING` as the default value for the `--data_dir` argument:

```python
if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  # reads input channels training and testing from the environment variables
  parser.add_argument('--data_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
```

Script mode displays the list of available environment variables in the training logs. You can find the [entire list here](https://github.com/aws/sagemaker-containers/blob/master/README.md#environment-variables-full-specification).

# Training in SageMaker

After you test the training job locally, upload the dataset to an S3 bucket so SageMaker can access the data during training:

In [None]:
import sagemaker

training_data = sagemaker.Session().upload_data(path='sherlock', key_prefix='datasets/sherlock')

The returned variable inputs above is a string with a s3 location which SageMaker Tranining has permissions
to read data from. **It has education purposes, requiring
 more robust solutions for larger datasets:**

In [None]:
training_data

To train in SageMaker:
- change the estimator argument **train_instance_type** to any SageMaker ml instance available for training.
- set the **training** channel to a S3 location.

For example:

In [None]:
estimator = TensorFlow(entry_point='train.py',
                       source_dir='char-rnn-tensorflow',
                       train_instance_type='ml.c4.xlarge', # Executes training in a ml.c4.xlarge instance
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(),
                       framework_version='1.11.0',
                       py_version='py3',
                       script_mode=True)
             

estimator.fit({'training': f'file://{data_dir}'}) # Starts training and creates a data channel named 
# training with the contents of data dir

## Using script launchers for training
In our previous example, we edited **train.py** to train in SageMaker. Another option is to provide a launcher script as follows:

In [None]:
%%writefile char-rnn-tensorflow/launcher.sh # this line creates a file named launcher.sh under char-rnn-tensorflow

python train.py --data_dir /opt/ml/data/training --save_dir /opt/ml/model --epochs 1

Then we create a **`TensorFlow`** estimator using **laucher.sh** as the entry-point. Notice that the hyperparameters are not necessary anymore:

In [None]:
estimator = TensorFlow(entry_point='launcher.sh',
                       source_dir='char-rnn-tensorflow',
                       train_instance_type='local',
                       train_instance_count=1,
                       role=sagemaker.get_execution_role(),
                       framework_version='1.11.0',
                       py_version='py3',
                       script_mode=True)
             

estimator.fit({'training': training_data})

Laucher scripts are executed with the same arguments of Python scripts and have access to the same environment variables:

In [None]:
%%writefile launcher.sh 

# prints SageMaker environment variables before training
printenv | grep SM_

# invokes the training script using values from environment variables
python train.py --data_dir ${SM_TRAINING_CHANNEL} --save_dir ${SM_MODEL} --epochs ${SM_HP_EPOCHS}

Let's detail the environment variables used above:

- **`SM_MODEL`**: the directory inside the container where the training model data must be saved inside the container, i.e. **/opt/ml/model**.
- **`SM_TRAINING_CHANNEL`**: the directory inside the container where training data from the channel **training** is available, i.e. **/opt/ml/model**.
- **`SM_HP_EPOCHS`**: value of the hyperparameter named **epochs**. This hyperparameter needs to be provide by the user.

Laucher scripts can be handy for multiple scenarios including:

## Installing requirements before training


```bash
> launcher.sh

apt-get install cowsay -y
pip install tensorflow==1.12
```

```python
TensorFlow(entry_point='launcher.sh', ...).fit()
```

## Training using different computer languages
The example below install **TensorFlow for C** and executes a hello world:

In [None]:
%%writefile tf_c/hello_tf.c

#include <stdio.h>
#include <tensorflow/c/c_api.h>

int main() {

    printf("Hello from TensorFlow C library version %s", TF_Version());
    return 0;
}

In [None]:
%%writefile tf_c/launcher.sh

wget -q -t 3 https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.11.0.tar.gz
    
tar -xzvf libtensorflow-cpu-linux-x86_64-1.11.0.tar.gz -C /usr/local

ldconfig

gcc -I/usr/local/include -L/usr/local/lib hello_tf.c -ltensorflow -o hello_tf

./hello_tf

In [None]:
tf_c_estimator = TensorFlow(entry_point='launcher.sh',
                            source_dir='tf_c',
                            train_instance_type='local',
                            train_instance_count=1,
                            role=sagemaker.get_execution_role(),
                            framework_version='1.11.0',
                            py_version='py3',
                            script_mode=True)
             

tf_c_estimator.fit()

## Acessing training output information
**`estimator.output_path`** contains the S3 location where training outputs are stored after training. The training output includes data from the special directories **/opt/ml/output** and **/opt/ml/model**.

### Acessing training model data
Model data written to **/opt/ml/model** during training is compressed in a tar file named **model.tar.gz** and uploaded to S3. **`estimator.model_data()`** returns the location of **model.tar.gz**.

### Running multiple training jobs in parallel
```estimator.fit``` waits for the training job to end by default, and streams the training logs. You can run training jobs asyncronouly be setting the attribute ```wait``` to false:

```python
estimator.fit(inputs={'training': training_data}, wait=False)
```

### Retrieving training job information
You can use the ```estimator.attach(training_job_name)``` to retrieve information of a running training job:

```python
estimator = TensorFlow.attach('my-training-job')
```
You can use the aws cli to access training job information from the command line:

In [None]:
!aws sagemaker describe-training-job --training-job-name {estimator.latest_training_job.job_name}

You can use the [aws console page](https://console.aws.amazon.com/sagemaker/home?#/jobs) to access information about training jobs as well.

You can use the boto library:

In [None]:
import boto3

sagemaker_client = boto3.Session().client('sagemaker')

sagemaker_client.describe_training_job(TrainingJobName=estimator.latest_training_job.job_name)

AWS sagemaker provides libraries most common languages including C, C#, Java, Javascript, and others. **add links**

## Using boto3 and the command line
Some other useful actions acessible from boto3 and awscli include:

#### Stopping a training job
**aws cli**

In [None]:
!aws sagemaker stop-training-job--training-job-name {estimator.latest_training_job.job_name}

**boto3**

In [None]:
sagemaker_client.stop_training_job(TrainingJobName=estimator.latest_training_job.job_name) # double check this

### Listing training jobs

**aws cli**

In [None]:
!aws sagemaker list-training-jobs

**boto 3**

In [None]:
sagemaker_client.list_training_jobs() # double check this

You can access awscli sagemaker for a complete list of commands or type **`aws sagemaker help`** in the command line.

You can access sagemaker boto3 for the complete documentation.

## Using the command line for training with SageMaker Tensorflow container

In [None]:
%%writefile train.sh
#!/usr/bin/env bash

# Usage:
#   train.sh <TRAINING_IMAGE> <JOB_NAME> <ROLE_ARN> <TRAINING_DATA> <DEFAULT_BUCKET> <SOURCE_DIR> <ENTRY_POINT>
#

TRAINING_IMAGE=$1
JOB_NAME=$2
ROLE_ARN=$3
TRAINING_DATA=$4
IMAGE_NAME=$5
SOURCE_DIR=$6
ENTRY_POINT=$7

tar -cvzf source.tar.gz $SOURCE_DIR{}
fiowaengfioewahjgioewhio

INPUTS = "ChannelName = training,\
          DataSource = {\
              S3DataSource = {\
                  S3DataType = S3Prefix,\
                  S3Uri = $TRAINING_DATA,\
                  S3DataDistributionType = FullyReplicated\
              },\
          },\
          ContentType = string,\
          CompressionType = None,\
          RecordWrapperType = None,\
          InputMode = File"

ALGO_SPEC = "TrainingImage = ${TRAINING_IMAGE}, TrainingInputMode = File"


aws sagemaker create-training-job \
--training-job-name ${JOB_NAME} \
--role-arn ${ROLE_ARN} \
--input-data-config ${INPUTS} \
--output-data-config "S3OutputPath = ${DEFAULT_BUCKET}/${JOB_NAME}" \
--resource-config "InstanceType = ml.c4.xlarge, InstanceCount = 1, VolumeSizeInGB = 30" \
--stopping-condition "MaxRuntimeInSeconds = 3600" \
--hyper-parameters "sagemaker_submit_directory = ${}, sagemaker_program = ${ENTRY_POINT}" 

In [None]:
training_image = estimator.train_image()
role = sagemaker.get_execution_role()
default_bucket = sagemaker.Session().default_bucket

!train.sh  {training_image} tensorflow-quickstart-$(date +%s) {role} {training_data} {default_bucket}