# Test1: HuggingFace and AWS Sagemaker Training

## Intro

This is a notebook to test the HuggingFace transformers and datasets library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer for multi-class text classification.

The pre-trained model will be fine-tuned using the emotion dataset.

In [None]:
!pip install datasets
!pip install transformers

## Permissions

In [None]:
s3_bucket = "govuk-data-infrastructure-integration"

In [None]:
import sagemaker

sess = (
    sagemaker.Session()
)  # Manage interactions with the Amazon SageMaker APIs and any other AWS services needed.
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = s3_bucket
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Preprocessing

We are using the datasets library to download and preprocess the emotion dataset. After preprocessing, the dataset will be uploaded to our sagemaker_session_bucket to be used within our training job. The emotion dataset consists of 16000 training examples, 2000 validation examples, and 2000 testing examples.

### Tokenization

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = "distilbert-base-uncased"

# dataset used
dataset_name = "emotion"

# s3 key prefix for the data
s3_prefix = "model-data/huggingface_transformer_models/test_bucket/emotion"

In [None]:
# download tokenizer for 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)


# load dataset
train_dataset, test_dataset = load_dataset(dataset_name, split=["train", "test"])

# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for pytorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

### Uploading data to sagemaker_session_bucket

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3.

In [None]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f"s3://{sess.default_bucket()}/{s3_prefix}/train"
train_dataset.save_to_disk(training_input_path, fs=s3)

# save test_dataset to s3
test_input_path = f"s3://{sess.default_bucket()}/{s3_prefix}/test"
test_dataset.save_to_disk(test_input_path, fs=s3)

## Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as entry_point, which instance_type should be used, which hyperparameters are passed in .....

In [None]:
!pygmentize ./scripts/train.py

In [None]:
import time

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters = {
    "epochs": 1,  # number of training epochs
    "train_batch_size": 32,  # batch size for training
    "eval_batch_size": 64,  # batch size for evaluation
    "learning_rate": 3e-5,  # learning rate used during training
    "model_id": "distilbert-base-uncased",  # pre-trained model
    "fp16": True,  # Whether to use 16-bit (mixed) precision training
}

In [None]:
# define Training Job Name
job_name = f'huggingface-test-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="train.py",  # fine-tuning script used in training jon
    source_dir="./scripts",  # directory where fine-tuning script is stored
    instance_type="ml.p3.2xlarge",  #   # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    base_job_name=job_name,  # the name of the training job
    role=role,  # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version="4.6.1",  # the transformers version used in the training job
    pytorch_version="1.7.1",  # the pytorch_version version used in the training job
    py_version="py36",  # the python version used in the training job
    hyperparameters=hyperparameters,  # the hyperparameter used for running the training job
)

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {"train": training_input_path, "test": test_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

## Deploying the endpoint

To deploy our endpoint, we call deploy() on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
# predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

In [None]:
# sentences = [{"inputs": "I get so nervous before a demo"}, #fear
#              {"inputs": "I am shocked that the API works so well "}, #suprise
#              {"inputs": "It's a shame that I havent learned this sooner"}, #sadness
#              {"inputs": "It's a disgrace that AWS is not free"}, #anger
#              {"inputs": "I am delighted to have learned this amazing new technology"}, #joy
#              {"inputs": "I was so shocked at my suprise party. I also hated every minute of it."} #suprise/anger
#             ]

# for sentence in sentences:
#     prediction = predictor.predict(sentence)
#     print(prediction)

**IMPORTANT** Finally, we delete the inference endpoint.



In [None]:
predictor.delete_endpoint()