# Introduction

This notebook demonstrates how to use SageMaker with [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/ "Trainium") to train a text classification model, and then deploy the trained model in [AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/ "Inferentia"). We are going to start with a [pretrained BERT model from Hugging Face](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France.#bert-base-model-uncased), and fine-tune it with IMDB dataset. This dataset consists of sentences labeled to be either positive or negative sentiment. The training job will take place on `ml.trn1` instance which hosts the AWS Trainium accelerator. Then the trained model weights will be deployed in an endpoint in `ml.inf1` instance which hosts the AWS Inferentia accelerator. You will send a sentence to the endpoint, and the model will predict the sentiment of the input sentence.

For SageMaker Studio environment during launch, we recommend using Image `Pytorch 1.10 Python 3.8 CPU Optimized` with `Python 3` kernel, and use instance `ml.t3.medium`.

# Train a text classification model

In this notebook, you will use SageMaker to prepare and process the training data, and then execute a training job on AWS Trainium. Once the model is trained, you will deploy it to an endpoint. This endpoint will accept a sentence in plain text, and the model will classify this sentence as either positive or negative sentiment. Let's start this with installing necessary libraries and import them into the SageMaker runtime:

In [None]:
!pip install --upgrade pip

In [None]:
!pip install --no-cache-dir datasets==2.11.0 torch==1.11.0 ipywidgets

In [None]:
!pip install --force-reinstall transformers==4.17.0

In [None]:
!pip install -U sagemaker==2.116.0

In [None]:
!pip install --upgrade s3fs botocore

In [None]:
!pip3 list | grep datasets

Currently latest SageMaker version is 2.116.0.

In [None]:
import sagemaker
import transformers
from sagemaker.pytorch import PyTorch
from datasets import load_dataset
from tqdm.auto import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sagemaker import utils
import os
import boto3
import botocore
from datasets.filesystems import S3FileSystem
from pathlib import Path
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
from datetime import datetime
import json
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import torch
import time

In [None]:
# check SageMaker SDK version
print(sagemaker.__version__) # expect 2.116.0

In [None]:
print(transformers.__version__) # expect 4.17.0

## Create Sagemaker session

Next, create a SageMaker session and define an execution role. Default role should suffice.

In [None]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

In [None]:
print(f"sagemaker bucket: {sess.default_bucket()}")

The default bucket name printed is where all the data (model artifact, training script, and model checlpoints are going to be saved. Below, we also define a few more parameters:

In [None]:
instance_count = 1
source_dir = 'train_scripts'

bucket=sagemaker.Session().default_bucket()
base_job_name="imdb-classification"
checkpoint_in_bucket="checkpoints" # dir name to hold pt file. 

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
print(checkpoint_s3_bucket)

## Preprocessing and tokenization

In this section, we are going to use Hugging Face API to download a small dataset for text classification. This is the dataset we will use to train a model for our text classification task. Once the dataset is downloaded, we will then persist this dataset in our default S3 bucket:

In [None]:
tokenizer_name = 'bert-base-uncased'

# dataset used
dataset_name = 'imdb'

# s3 key prefix for the data
s3_prefix = 'HFDatasets/imdb'

In [None]:
dataset = load_dataset("imdb")

In [None]:
# inspect data
dataset['train'][5]

## Uploading data to `sagemaker_session_bucket`

Here, we use a `S3FileSystem` interface to define our SageMaker session's connection to the default S3 bucket, this interface is used as an input to `save_to_disk` API, so dataset is stored in the designated S3 path.

In [None]:
s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}'
dataset.save_to_disk(training_input_path,fs=s3) # uncomment to save or overwrite data in s3.

Now the training data is stored in S3, we are going to develop the training script and define a PyTorch Estimator to run the training script.

## Fine-tuning & start Sagemaker Training Job

A training script is required for SageMaker PyTorch estimator to run a model training job. Below is the script for fine-tuning a pretrained Hugging Face distilBERT model with the dataset (IMDB movie review) we just put in the S3.

In [None]:
!pygmentize ./train_scripts/bert_train_torchrun_trn1.py

In the training script, there are several important details worth mentioning:

1. **distributed training (hardware)** This is an example of data parallel distributed training. In this training scenario, since there are multiple NeuronCores in this `trn1` instance, each NeuronCore receives a copy of the model and a shard of data. Each NeuronCore is managed by a worker that runs a copy of the training script. Gradient from each worker is aggregated and averaged, such that each worker receives exactly same updates to the model weights. Then another iteration of training resumes.  


2. **Distributed training (software)** A specialized backend `torch.xla.distributed.xla_backend` is required for PyTorch to run on XLA device such as Trainium. In the training loop, since each worker generates its own gradient, `xm.optimiser_Step(optimizer)` makes sure all workers receive same gradient update before next iteration of training. 

3. **Bring your own training data** Hugging Face provides `load_from_disk` API to load training data specified by an S3 path. SageMaker uses an environment veriable `SM_CHANNEL_TRAIN` to track the S3 path back when we uploaded the training data in a previous cell. Thus `SM_CHANNEL_TRAIN` is used as an input to `load_from_disk` API.

4. **Persist trained weights** Trained weights can be used on different hardware targets. Weights are stored in SageMaker session's default S3 bucket. The session's bucket is managed by SageMaker. You may leverage environment variable `SM_MODEL_DIR` to access this bucket and write the trained weights in this bucket. 

In [None]:
instance_count = 1
num_cores = 2
source_dir = 'train_scripts'
date_string = datetime.now().strftime("%Y%m%d-%H%M%S")

bucket=sagemaker.Session().default_bucket()
base_job_name="imdb-classification-" + str(date_string)
checkpoint_in_bucket="checkpoints" # dir name to hold pt file. 

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
print(checkpoint_s3_bucket)

In [None]:
pt_estimator = PyTorch(
    entry_point="bert_train_torchrun_trn1.py", # Specify your train script
    source_dir="train_scripts",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.trn1.2xlarge',
    framework_version='1.11.0',
    py_version='py38',
    disable_profiler=True,
    output_path=checkpoint_s3_bucket,
    base_job_name=base_job_name,
    
    # Parameters required to enable checkpointing
    checkpoint_s3_uri=checkpoint_s3_bucket,
    volume_size = 512,
    distribution={
        "torch_distributed": {
            "enabled": True
        }
    }
)

pt_estimator.fit({'train': training_input_path}) 

To find out S3 path where the trained weights are stored:

In [None]:
model_path_uri = pt_estimator.model_data 
print(model_path_uri)

Now that model training work is done, and the model weights are stored to S3, from here and on, we are going to focus on inference. The trained model weight file compressed and stored in the S3 path shown above. It is a dictionary, which contains keys that matche the original Hugging Face distilbert model. This means the trained weight may to run on any hardware platform, as long as the Hugging Face library installed. 

# Inference using trained model

In order to deploy the trained model using AWS Inferentia we need to do the following steps, This is needed inorder to optimize the model to get better performance on inferentia.

1. Trace the trained model. The model needs to be converted to a torchscript format first.
2.  Understanding the inference script.
3. Compile and Deploy the model using Amazon SageMaker endpoints.
4. Test the model


## 1. Trace the trained model

We have created a helper function which loads the trained model and runs trace on it. To trace a model into a torchscript (`.pth`), it requires the model and an `example_input`, which is a tensor or tuple of tensors. `torch.jit.trace` is actually recording all the tensor ops during the forward pass. The result is a torchscript file that captures the sequence of ops applied to the input tensor and all the way through the output nodes. The output is a model graph.

this is a required step to have SageMaker Neo compile our model artifact, which will take a `tar.gz` file containing the traced model.

Note: The `.pth` extension when saving our model is required.

Extract parts of `model_path_uri`:

In [None]:
model_bucket = model_path_uri.replace("s3://","").split("/")[0] # get bucket name.
prefix_path = "/".join(model_path_uri.replace("s3://","").split("/")[1:])

In [None]:
sess.download_data("./",model_bucket,prefix_path)

Now you should see a `model.tar.gz` in the current directory. This is the trained weights. Unzip it and put it in `distilbert` directory:

In [None]:
! rm -r bert
! mkdir bert
! tar -xvf ./model.tar.gz -C ./bert

Now invoke the helper function with path to the model weights and sample input sequence.

In [None]:
from traceutils import trace

In [None]:
# Prepare sample input for jit model tracing
seq_0 = "I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered controversial I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot."


Below, we will use `trace` function to convert this model into a torchscript via `torch.jit.trace` API:

In [None]:
trace_model_path = trace(seq_0,"./bert/checkpoint.pt")

In [None]:
!tar -czvf traced_model.tar.gz -C traced_model model.pth && mv traced_model.tar.gz traced_model/

We upload the traced model tar.gz file to Amazon S3, where our compilation job will download it from:

In [None]:
# A new session for deployment in another region
sess = sagemaker.Session(boto3.session.Session(region_name="us-east-2"))

In [None]:
traced_model_url = sess.upload_data(
    path="traced_model/traced_model.tar.gz",
    key_prefix="neuron-experiments/bert-seq-classification/traced-model",
)

## 2. Understanding the inference script

Before we deploy any model, let's check out the code we have written to do inference on a SageMaker endpoint, with a default uncompiled model.

In [None]:
!pygmentize inference_scripts/inference_inf1.py

As usual, we have a `model_fn` - receives the model directory, is responsible for loading and returning the model -, an `input_fn` and `output_fn` - in charge of pre-processing/checking content types of input and output to the endpoint - and a `predict_fn`, which receives the outputs of `model_fn` and `input_fn` (meaning, the loaded model and the deserialized/pre-processed input data) and defines how the model will run inference.


#### Now, lets see what changes in the inference code when we want to do inference with a model that has been compiled for Inferentia

```python
# %load -s model_fn code/inference_inf1.py
def model_fn(model_dir):
    
    model_dir = '/opt/ml/model/'
    dir_contents = os.listdir(model_dir)
    model_path = next(filter(lambda item: 'model' in item, dir_contents), None)
    
    tokenizer_init = AutoTokenizer.from_pretrained('bert-base-uncased', max_length=128)
    model = torch.jit.load(os.path.join(model_dir, model_path))

    
    return (model, tokenizer_init)
```

In this case, within the `model_fn` we first grab the model artifact located in `model_dir` (the compilation step will name the artifact `model_neuron.pt`, but we just get the first file containing `model` in its name for script flexibility). Then, **we load the Neuron compiled model with `torch.jit.load`**. 

Other than this change to `model_fn`, we only need to add an extra import `import torch_neuron` to the beginning of the script, and get rid of all `.to(device)` calls, since the Neuron runtime will take care of loading our model to the NeuronCores on our Inferentia instance. All other functions are unchanged.

## 3. Compile and Deploy the model using Amazon SageMaker endpoints

We now create a new `PyTorchModel` that will use `inference_inf1.py` as its entry point script. PyTorch version 1.9.0 is the latest that supports Neo compilation to Inferentia, as you can see from the warning in the compilation cell output.

In [None]:
prefix = "neuron-experiments/bert-seq-classification"
flavour = "normal"
date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

compiled_sm_model = PyTorchModel(
    model_data=traced_model_url,
    predictor_cls=Predictor,
    framework_version="1.11.0",
    role=role,
    sagemaker_session=sess,
    entry_point="inference_inf1.py",
    source_dir="inference_scripts",
    py_version="py38",
    name=f"{flavour}-bert-pt181-{date_string}",
    env={"SAGEMAKER_CONTAINER_LOG_LEVEL": "10"},
)

Finally, we are ready to compile the model and deploy it. Two notes here:
* HuggingFace models should be compiled to `dtype` `int64`
* the format for `compiler_options` differs from the standard Python `dict` that you can use when compiling for "normal" instance types; for inferentia, you must provide a JSON string with CLI arguments, which correspond to the ones supported by the [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-cc/command-line-reference.html) (read more about `compiler_options` [here](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents))

In [None]:
%%time

hardware = "inf1"
flavour = "compiled-inf"
compilation_job_name = f"bert-{flavour}-{hardware}-" + date_string

compile_start = time.time()
compiled_inf1_model = compiled_sm_model.compile(
    target_instance_family=f"ml_{hardware}",
    input_shape={"input_ids": [1, 128], "attention_mask": [1, 128]},
    job_name=compilation_job_name,
    role=role,
    framework="pytorch",
    framework_version="1.11.0",
    output_path=f"s3://{sess.default_bucket()}/{prefix}/neo-compilations/{flavour}-model",
    compiler_options=json.dumps("--dtype int64"),
    #     compiler_options={'dtype': 'int64'},    # For compiling to "normal" instance types, cpu or gpu-based
    compile_max_run=900,
)
compile_end = time.time()
print('\nSageMake Neo compile time: {} seconds'.format(compile_end - compile_start))

# After successful compilation, we deploy our model to an inf1.xlarge instance.
print("\nCompilation successful !!! Deploying the model ...")

date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

deploy_start = time.time()

compiled_inf1_predictor = compiled_inf1_model.deploy(
    instance_type="ml.inf1.xlarge",
    initial_instance_count=1,
    endpoint_name=f"test-neo-{hardware}-{date_string}",
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

deploy_end = time.time()
print('\nSageMaker endpoint deployment time: {} seconds'.format(deploy_end - deploy_start))

# 4. Test the model with Sample Input


In [None]:
# Predict with model endpoint
payload1 = "The new Hugging Face SageMaker DLC makes it super easy to deploy models in production. I love it!"
compiled_inf1_predictor.predict(payload1)

In [None]:
# Predict with model endpoint
payload1 = "This movie is very bad"
compiled_inf1_predictor.predict(payload1)

### Clean up

In [None]:
compiled_inf1_predictor.delete_model()
compiled_inf1_predictor.delete_endpoint()