For ease of use, we advice to open this notebook in an Amazon SageMaker instance and use the conda_pytorch_latest_p36 kernel, or in Amazon SageMaker Studio.

# Fine-tuning and deploying a HuggingFace summarization model on SageMaker with your own scripts and dataset

In this notebook, we will see how to fine-tune and deploy one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task on [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html) with your own scripts and data.

In the first part "Preparing the dataset" we show how to load your own dataset to s3 into separated files for training, validation and testing. We will use the [samsum dataset](https://arxiv.org/pdf/1911.12237.pdf) which contains conversations and their summaries, but we also provide code to do it for your own custom dataset. In our case the text and summary columns are called `dialogue` and `summary` respectively, and the data is saved in s3 under the prefix `samsum-dataset`.

Afterwards, we walk you through how to create your own train and inference scripts to fine-tune and deploy a HuggingFace model on Amazon SageMaker.

Make sure that the latest version of Sagemaker SDK is installed

In [None]:
# Install the required libraries
!pip install datasets
!pip install py7zr
!pip install sagemaker –U

## Part 1: Preparing the dataset

One way to prepare your dataset for training on Amazon SageMaker is to have your training, validation and test datasets saved separately. This enables to effectively decouple data preparation from training in an architecture and for example ensure that the same datasets can be reused by different models with the same split. In this example we download the [samsum dataset](https://arxiv.org/pdf/1911.12237.pdf) and prepare it for HuggingFace using the [datasets](https://github.com/huggingface/datasets) library. Any dataset containing text and summaries can work here.

We first import required packages and define the prefix where to save the data:

In [None]:
import os
import json
import io, boto3, sagemaker
import pandas as pd

from datasets import load_dataset, filesystems, DatasetDict


s3_resource = boto3.resource('s3')
session = sagemaker.Session()
session_bucket = session.default_bucket()

s3_prefix = 'samsum-dataset'

We download the samsum dataset using curl. If you would like to use your own custom dataset, you do not need to run this.

In [None]:
%%sh
mkdir corpus && cd corpus
curl https://arxiv.org/src/1911.12237v2/anc/corpus.7z --output corpus.7z
py7zr x corpus.7z
rm corpus.7z

In [None]:
# Converting the json files to jsonlines in order to save it in Hugging Face dataset format for optimal speed and efficiency

data_path = 'corpus/'

frames = []
for file in os.listdir(data_path):
    if file.endswith('.json'):
        with open(os.path.join(data_path, file)) as f:
            json_dict = json.load(f)
            with open(os.path.join(data_path, file.replace('.json', '.jsonl')), 'w') as f:
                f.write('\n'.join(map(json.dumps, json_dict)))


In [None]:
# TO USE WITH YOUR OWN CUSTOM DATASET PLEASE UNCOMMENT
# If you would like to use your own custom dataset (single CSV/JSON), you can use the datasets.Dataset.train_test_split() method  to shuffle and split your data. 
# The splits will be shuffled by default. You can deactivate this behavior by setting shuffle=False


# # For single JSON file
# dataset_json = load_dataset('json', data_files='path_to_your_file', split ='train') #

# # Replace type to 'csv' if you are using a single CSV file, the rest of the steps are exactly the same
# # dataset_csv = load_dataset('csv', data_files='path_to_your_file', split ='train') # path to your file


# # Split into 70% train, 30% test + validation
# train_test_validation = dataset_json.train_test_split(test_size=0.3)

# # Split 30% test + validation into half test, half validation
# test_validation = train_test_validation['test'].train_test_split(test_size=0.5)

# # Gather the splits  to have a single DatasetDict

# train_test_valid_dataset = DatasetDict({
#     'train': train_test_validation['train'],
#     'validation': test_validation['train'],
#     'test': test_validation['test'],})

In [None]:
# If you are using the samsum dataset that is already split, you can simply load the separate files

dataset = load_dataset('json', data_files={'train': ['corpus/train.jsonl'],
                                              'validation' : 'corpus/val.jsonl',
                                              'test': 'corpus/test.jsonl'})

In [None]:
dataset

In [None]:
print('DIALOGUE\n{dialogue}'.format(dialogue=dataset['train']['dialogue'][0]))
print('\nSUMMARY\n{summary}'.format(summary=dataset['train']['summary'][0]))

Finally we write the training, validation and test dataframes to separate CSVs and upload them to S3.

Use the save_to_disk method to directly save your dataset to S3 in Hugging Face dataset format. The format is backed by the Apache Arrow format which enables processing of large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. You can use the load_to_disk method in your train script to directly load the dataset in the format it was saved.

In [None]:
s3 = filesystems.S3FileSystem()
dataset.save_to_disk(f's3://{session_bucket}/{s3_prefix}/train/', fs=s3)

## Part 2: Fine-tune and deploy a HuggingFace model on Amazon SageMaker

Now that the data is ready and saved in s3, we will demonstrate how to fine-tune and deploy a HuggingFace model on Amazon SageMaker with your own scripts.

In [None]:
text_column = 'dialogue'
target_column = 'summary'

This notebook is built to run  with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`pegasus-xsum`](https://huggingface.co/google/pegasus-xsum) checkpoint. 

In [None]:
model_name = 'google/pegasus-xsum'

### Write the training script

To fine-tune a HuggingFace model with a custom dataset on Amazon SageMaker, we will write a training script to be used by the Amazon SageMaker Training Job.

The training script will need to do the following steps:
- Load a pretrained Tokenizer and Model
- Load and Tokenize datasets
- Define the Training Arguments
- Define a Trainer
- Train the model and save the checkpoint with the best performance on the validation set
- Evaluate the best checkpoint on the test set

These steps will be done in a `train()` function which uses a couple helper functions:
`tokenize()` takes a batch, specified text and target columns, and tokenizes them with the Tokenizer loaded in memory,
`load_and_tokenize()` which reads data from s3 and applies the `tokenize()` function, and `compute_metrics()` to compute ROUGE scores for evaluation.

The script uses AutoTokenizer and AutoModelForSeq2SeqLM which works with any [🤗 Transformers](https://github.com/huggingface/transformers) model for summarization. You might however want to change some hyperparameters depending on what works best for each model. Here we used adafactor as optimizer for Pegasus for example.

All computations will be running inside Amazon SageMaker HuggingFace training and inference containers, which we call using the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)

In [None]:
! mkdir training_code

In [None]:
%%writefile training_code/requirements.txt
nltk
rouge_score

In [None]:
%%writefile training_code/train.py
# This is the script that will be used in the training container
import argparse
import logging
import os
import sys

import numpy as np
import nltk
nltk.download('punkt')
from nltk import sent_tokenize

from datasets import load_metric, load_from_disk
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments


logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler(sys.stdout))




def tokenize(batch, text_column, target_column, max_source, max_target):
    tokenized_input = tokenizer(batch[text_column], padding='max_length',
                                truncation=True,
                                max_length=max_source)
    tokenized_target = tokenizer(batch[target_column], padding='max_length',
                                 truncation=True,
                                 max_length=max_target)

    tokenized_input['labels'] = tokenized_target['input_ids']

    return tokenized_input


def load_and_tokenize_dataset(data_dir, split, text_column, target_column, max_source,
                              max_target):
    
    dataset = load_from_disk(os.path.join(data_dir, split))
    tokenized_dataset = dataset.map(lambda x: tokenize(x, text_column,
                                    target_column, max_source, max_target),
                                    batched=True, batch_size=512)
    tokenized_dataset.set_format('numpy', columns=['input_ids',
                                 'attention_mask', 'labels'])

    return tokenized_dataset


def compute_metrics(eval_pred):
    metric = load_metric("rouge")
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions,
                                           skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in
                     decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label
                      in decoded_labels]
    result = metric.compute(predictions=decoded_preds,
                            references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for
                       pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}


def train(args):
    logger.info("Loading tokenizer...\n")
    global tokenizer 
    global model_name
    model_name = args.model_name
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    logger.info("Loading pretrained model\n")
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    logger.info("Pretrained model loaded\n")

    logger.info("Fetching and tokenizing data for training")
    train_dataset = load_and_tokenize_dataset(args.train_data_dir,
                                              'train',
                                              args.text_column,
                                              args.target_column,
                                              args.max_source,
                                              args.max_target)
    
    logger.info("Tokenizing data for training loaded")
    
    eval_dataset = load_and_tokenize_dataset(args.train_data_dir,
                                             'validation',
                                             args.text_column,
                                             args.target_column,
                                             args.max_source,
                                             args.max_target)
    test_dataset = load_and_tokenize_dataset(args.train_data_dir,
                                             'test',
                                             args.text_column,
                                             args.target_column,
                                             args.max_source,
                                             args.max_target)

    logger.info("Defining training arguments\n")
    training_args = Seq2SeqTrainingArguments(
        output_dir=args.model_dir,
        num_train_epochs=args.epoch,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.eval_batch_size,
        learning_rate=args.lr,
        warmup_steps=args.warmup_steps,
        weight_decay=args.weight_decay,
        logging_dir=args.log_dir,
        logging_strategy=args.logging_strategy,
        load_best_model_at_end=True,
        adafactor=True,
        do_train=True,
        do_eval=True,
        do_predict=True,
        save_total_limit=3,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        predict_with_generate=True,
        metric_for_best_model='eval_loss',
        seed=7,
    )

    logger.info("Defining seq2seq Trainer")
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    
    logger.info("Starting Training")
    trainer.train()
    logger.info('Model trained successfully')
    trainer.save_model()
    logger.info('Model saved successfully')
    
    # Evaluation
    logger.info("*** Evaluate on test set***")

    logger.info(trainer.predict(test_dataset))

    logger.info('Removing unused checkpoints to save space in container')
    os.system(f"rm -rf {args.model_dir}/checkpoint-*/")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name", type=str,
                        default='google/pegasus-xsum')
    parser.add_argument("--train-data-dir", type=str,
                        default=os.environ["SM_CHANNEL_TRAIN"])
    #parser.add_argument("--val-data-dir", type=str,
     #                   default=os.environ["SM_CHANNEL_VALIDATION"])
    #parser.add_argument("--test-data-dir", type=str,
    #                    default=os.environ["SM_CHANNEL_TEST"])
    parser.add_argument("--text-column", type=str, default='dialogue')
    parser.add_argument("--target-column", type=str, default='summary')
    parser.add_argument("--max-source", type=int, default=512)
    parser.add_argument("--max-target", type=int, default=80)
    parser.add_argument("--model-dir", type=str,
                        default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--epoch", type=int, default=5)
    parser.add_argument("--train-batch-size", type=int, default=2)
    parser.add_argument("--eval-batch-size", type=int, default=2)
    parser.add_argument("--warmup-steps", type=float, default=500)
    parser.add_argument("--lr", type=float, default=2e-5)
    parser.add_argument("--weight-decay", type=float, default=0.0)
    parser.add_argument("--log-dir", type=str,
                        default=os.environ["SM_OUTPUT_DIR"])
    parser.add_argument("--logging-strategy", type=str, default='epoch')
    train(parser.parse_args())


By default the Trainer saves several checkpoints before selecting the best one. Once the best checkpoint is loaded in memory and saved, those remaining checkpoints are not needed anymore. They can be safely deleted (which we do in the last line of the `train()`) to liberate space in the SM_MODEL_DIR which content will be used later for creating a SageMaker Model and deploy it to an endpoint.

### Fine-tuning the model on SageMaker

We first load a couple of libraries and objects, namely `sagemaker` and the `HuggingFace` SageMaker Estimator which will be used to launch a training job.

In [None]:
import sagemaker

session = sagemaker.Session()
session_bucket = session.default_bucket()
role = sagemaker.get_execution_role()

In [None]:
from sagemaker.huggingface import HuggingFace

In [None]:
output_path = f"s3://{session_bucket}/{s3_prefix}"

We define a few arguments to be sent to the training script which will be read by the parser.

In [None]:
# We set the number of epochs to 1 to reduce the training time in this demo.
# For complete fine-tuning of the model please consider increasing the number of epochs to e.g. 5
hyperparameters = {
    'model-name': model_name,
    'text-column': text_column,
    'target-column': target_column,
    'epoch': 1
}

Thanks to [🤗 Transformers'](https://github.com/huggingface/transformers) `Trainer` seemless integration with [SageMaker Distributed Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html), we can make use of instances with several GPU units to parallelize and speed up training, without any modification to our training script.

When defining the SageMaker HuggingFace Estimator we specify a training script and source directory (here only containing train.py, but it could contain any additional modules and a requirements.txt), as well as the instance type on which to run the Training Job.

In [None]:
# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
huggingface_estimator = HuggingFace(entry_point = 'train.py',
                                    source_dir = 'training_code',
                                    base_job_name = 'huggingface-summarizer',
                                    instance_type = 'ml.p3.16xlarge',
                                    instance_count = 1,
                                    volume_size = 200,
                                    transformers_version = '4.6.1',
                                    pytorch_version = '1.7.1',
                                    py_version = 'py36',
                                    output_path = output_path,
                                    role = role,
                                    hyperparameters = hyperparameters,
                                    distribution = distribution
                                   )

We then launch the training job by specifying where to read the data from.
'train' will be loaded inside SM_CHANNEL_TRAIN, 'validation' inside SM_CHANNEL_VALIDATION and 'test' inside SM_CHANNEL_TEST, which will be the data directories inside the container running `train.py`.

In [None]:
huggingface_estimator.fit({'train':f's3://{session_bucket}/{s3_prefix}/train/'}, wait = False)

With distributed training on a p3.16xlarge instance, the training should take around 6 hours for 5 epochs.

### Bring your own inference script

Our friends at HuggingFace have made inference on SageMaker for transformers model simpler than ever thanks to the [SageMaker HuggingFace Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit). You can directly deploy the previously trained model by simply setting up the environment variable "HF_TASK":"summarization" following the instructions on the [HuggingFace website](https://huggingface.co/google/pegasus-xsum) selecting "Deploy" and then "Amazon SageMaker", without the need to write an inference script.

However, when needing specific postprocessing, for example if for a same input you want to return several summaries based on different text generation parameters, bringing your own `inference.py` script might be useful, and relatively straight forward:

In [None]:
! mkdir inference_code

In [None]:
%%writefile inference_code/inference.py

# This is the script that will be used in the inference container
import json 
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def model_fn(model_dir):
    """
    Load the model and tokenizer for inference 
    """
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_dir).to(device).eval()

    return {'model':model, 'tokenizer':tokenizer} 


def predict_fn(input_data, model_dict):
    """
    Make a prediction with the model
    """
    text = input_data.pop('inputs')
    parameters_list = input_data.pop('parameters_list', None)
    
    tokenizer = model_dict['tokenizer']
    model = model_dict['model']

    # Parameters may or may not be passed    
    input_ids = tokenizer(text, truncation=True, padding='longest', return_tensors="pt").input_ids.to(device)
    
    if parameters_list:
        predictions = []
        for parameters in parameters_list:
            output = model.generate(input_ids, **parameters)
            predictions.append(tokenizer.batch_decode(output, skip_special_tokens=True))
    else:
        output = model.generate(input_ids)
        predictions = tokenizer.batch_decode(output, skip_special_tokens=True)
    
    return predictions


def input_fn(request_body, request_content_type):
    """
    Transform the input request to a dictionary
    """
    return json.loads(request_body)

As we can see, the only requirements to writing such an inference script for HuggingFace on SageMaker is that the inference script shall contain the following template functions:
- `model_fn()` reading the content of what was saved at the end of the training job inside SM_MODEL_DIR, or from an existing model weights directory saved as a tar.gz in s3. We will use it to load the trained Model and associated Tokenizer
- `input_fn()` used here simply to format the data receives from a request made to the endpoint.
- `predict_fn()` calling the output of model_fn() (so here the model and tokenizer) to run inference on the output of `input_fn()`.

Optionally an `output_fn()` can be created for inference formatting, using the output of `predict_fn()`, but we did not use it here.


### Create and deploy a SageMaker Model to an endpoint and test it

This time we will import the SageMaker `HuggingFaceModel` object which will help us creating a SageMaker Model and deploy it to an endpoint.

In [None]:
from sagemaker.huggingface import HuggingFaceModel

Again, we specify here the inference script that we wrote earlier, a source directory (here again containing only `inference.py` but could contain modules and a requirements.txt) and model_data specifying where to load the model weights from. Using huggingface_estimator.model_data directly points to the s3 location where the output of the huggingface_estimator (after training) was saved, but any s3 arn containing pre-trained weights compressed as a tar.gz could work.

In [None]:
model_name = 'summarization-model'

model_for_deployment = HuggingFaceModel(entry_point='inference.py',
                                        source_dir='inference_code',
                                        model_data= huggingface_estimator.model_data,
                                        role=role,
                                        pytorch_version='1.7.1',
                                        py_version='py36',
                                        transformers_version='4.6.1',
                                        name=model_name)

Finally we deploy the register model by specifying the instance type.

In [None]:
endpoint_name = 'summarization-endpoint'

predictor = model_for_deployment.deploy(initial_instance_count=1,
                                        instance_type='ml.g4dn.xlarge',
                                        endpoint_name=endpoint_name,
                                        serializer=sagemaker.serializers.JSONSerializer(),
                                        deserializer=sagemaker.deserializers.JSONDeserializer()
                                        )

Once the model is deployed, you can test it directly:

In [None]:
texts = [
    "Alex: Were you able to attend Friday night's basketball game?\nBenjamin: I was unable to make it.\nAlex: You should have been there. It was intense.\nBenjamin: Is that right. Who ended up winning?\nAlex: Our team was victorious.\nBenjamin: I wish I was free that night. I'm kind of mad that I didn't go.\nAlex: It was a great game. Everything alright tough?\nBenjamin: Yeah man thanks for asking, it's just that my mom is sick and I am taking care of her.\nAlex: Oh sorry to hear that. Hope she makes a fast recovery :muscle:\nBenjamin: She will, she just has a nasty flu but she will be alright :D\nAlex: Glad to hear that!\nBenjamin: What was the score at the end of the game?\nAlex: Our team won 101-98.\nBenjamin: Sounds like it was a close game then.\nAlex: That's the reason it was such a great game.\nBenjamin: I'll go to the next one for sure.\nAlex: It's next weekend so you better put on your calendar ahaha\nBenjamin: ahaha I will I will. Talk to you later!\nAlex: Alright! Tell your mom I hope she gets better quickly."
        ]

inputs = {"inputs": texts,
          "parameters_list":[
              {
                  "length_penalty":1,
                  "num_beams":5,
                  "do_sample":True
              },
              {
                  "length_penalty":0.6,
                  "num_beams":3,
                  "do_sample":True
              },
              {
                  "max_length":25,
                  "top_p":0.92,
                  "top_k":50,
                  "do_sample":True
              }
          ]
         }

In [None]:
summary = predictor.predict(inputs)                                                                     
print(summary)

Lastly, please remember to delete the Amazon SageMaker endpoint to avoid charges.

In [None]:
predictor.delete_endpoint()