## Train your first LM model with Amazon SageMaker

### Sentiment Analysis with `DistilBERT` and `imdb` dataset

1. [Introduction](#Introduction)  
2. [Environment and Permissions](#Environment-and-Permissions)
3. [Preprocess - Tokenization of the dataset](#Preprocessing)   
4. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
5. [Deploying the endpoint](#Deploying-the-endpoint)  

# Introduction

Welcome to our end-to-end binary Text-Classification example. In this demo, we will use the Hugging Face `transformers` and `datasets` library and the SageMaker SDK to launch a SageMaker Training job and fine-tune a pre-trained transformer for binary text classification. In particular, we will use the pre-trained DistilBERT model and fine-tune it using the `imdb` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 


_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**. This notebook has been tested in SageMaker Studio with a PyTorch 1.13 Python 3.9 CPU Optimized Kernel_ 

## Environment and Permissions 

In [None]:
!pip install datasets
!pip install -U sagemaker
!pip install -U transformers
!pip install -U accelerate

In [None]:
import sagemaker
import boto3
import sagemaker.huggingface

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Visualizing our data
We are using the `datasets` library to download the `imdb` [dataset](https://huggingface.co/datasets/imdb). The dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing.
Let's see how our dataset looks like

In [None]:
from datasets import load_dataset

train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"])

In [None]:
train_dataset, test_dataset

In [None]:
train_dataset[10]

The dataset includes the 'text' field, which is the free text review, comment about the movie, and the 'label' field, which is a binary variable coded with value 0 for negative review and value 1 for positive ones. 

# Preprocessing

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors.
Text, use a [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.

## Tokenization 

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

tokenizer_name = 'distilbert-base-uncased'
dataset_name = 'imdb'
# s3 key prefix for the data
s3_prefix = 'samples/datasets/imdb'

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 

# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for pytorch
train_dataset =  train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

train_dataset = train_dataset.remove_columns("text")
test_dataset = test_dataset.remove_columns("text")

## Uploading data to `sagemaker_session_bucket`

After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [None]:
import os
import boto3
from botocore.exceptions import NoCredentialsError, ClientError

s3_client = boto3.client('s3')

def upload_directory_to_s3(local_directory, bucket_name, s3_directory):
    s3_client = boto3.client('s3')
    
    for root, dirs, files in os.walk(local_directory):
        for filename in files:
            local_path = os.path.join(root, filename)
            relative_path = os.path.relpath(local_path, local_directory)
            s3_path = os.path.join(s3_directory, relative_path).replace("\\", "/")  # Ensure S3 path uses forward slashes
            
            try:
                s3_client.upload_file(local_path, bucket_name, s3_path)
                print(f'Successfully uploaded {local_path} to s3://{bucket_name}/{s3_path}')
            except FileNotFoundError:
                print(f'The file was not found: {local_path}')
            except NoCredentialsError:
                print('Credentials not available')
            except ClientError as e:
                print(f'Failed to upload {local_path} to s3://{bucket_name}/{s3_path}: {e}')

In [None]:
# save train_dataset to s3
training_input_path = "./data/train"
train_dataset.save_to_disk(training_input_path)

upload_directory_to_s3("./data/train", sess.default_bucket(), f"{s3_prefix}/train")

# save test_dataset to s3
test_input_path = "./data/test"
test_dataset.save_to_disk(test_input_path)

upload_directory_to_s3("./data/test", sess.default_bucket(), f"{s3_prefix}/test")

## Creating an Estimator and starting a training job

From all supported models from Hugging Face, we selected https://huggingface.co/distilbert-base-uncased. It is light ("distilled") version of the BERT model, introduced in [this paper](https://arxiv.org/abs/1910.01108), where authors claim to have reduced the size of the model by 40% compared to BERT-base, while retaining 97%
of its language understanding capabilities and being 60% faster. Size reduction and faster execution helps to reduce the footprint of the instance required for training and inference, reducing trainig time.
According to the model card in hugging face, the raw model can be used for next sentence prediction but it's mostly intended to be fine-tuned on a downstream task like as sequence classification, token classification or question answering.
In our case we will fine-tune the model to binary classify the sentence as brining positive or negative sentiment.   

### Language Model Training (Model Fine-Tuning)
Language Models size ranges from hundreds of million to hundreds of billion of parameters. For example BERT-base consists of 110M parameters. As training such models can take weeks, months ore much more, it is common practise to start from a pre-trained model and tune the network parameters (called model fine-tuning) using domain specific datasets with supervised learning. 

This is what we will do in the following. As you will see, the training (fine-tuning) on the emotion data set, takes just about 15 minutes. 

In [None]:
from sagemaker.huggingface import HuggingFace

training_input_path = "s3://{}/{}/train".format(sess.default_bucket(), s3_prefix)
test_input_path = "s3://{}/{}/test".format(sess.default_bucket(), s3_prefix)

# hyperparameters, which are passed into the training job
hyperparameters = {
    "epochs": 1,
    "train_batch_size": 32,
    "model_name": "distilbert-base-uncased",
    "learning_rate": 0.00003,
}

In [None]:
huggingface_estimator = HuggingFace(
    entry_point="train_script.py",
    source_dir="./scripts",
    instance_type="ml.g4dn.12xlarge",
    instance_count=1,
    role=role,
    transformers_version="4.26",
    pytorch_version="1.13",
    py_version="py39",
    hyperparameters=hyperparameters,
)

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})

While waiting for training to complete, which will take about 15 minutes, you can move to the SageMaker Console, click on Training -> Training Jobs and you will see your training job (InProgress Status) that you started calling the fit method in the cell above.
When the training is completed you can see your training Job in Completed Status. Clicking on the job name you are directed to a page reporting several details of your training job, in particular at bottom of the page you can find the location of the model artifact you just created. 

## Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
predictor = huggingface_estimator.deploy(initial_instance_count=1, instance_type="ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

In [None]:
sentiment_input = {
    "inputs": "This is the best movie ever made in history, an absolute sculpted work of art that depicts every emotion of human existence, from suffering, to courage to love, in front of the background of political astuteness and socio-hierarchal analysis."
}

predictor.predict(sentiment_input)

In [None]:
sentiment_input = {
    "inputs": "Another bloated film that gets all the history wrong, turns all of the characters into stick figures and makes piles of money for the star."
}

predictor.predict(sentiment_input)

Finally, we delete the endpoint again.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()