# Hugging Face SageMaker - Text-Classification
### Sentiment Analysis with `DistilBERT` and `imdb` dataset

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

---

1. [Introduction](#Introduction)  
2. [Environment and Permissions](#Environment-and-Permissions)
3. [Preprocess - Tokenization of the dataset](#Preprocessing)   
4. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
5. [Deploying the endpoint](#Deploying-the-endpoint)  

# Introduction

Welcome to our end-to-end binary Text-Classification example. This demo uses the Hugging Face `transformers` and `datasets` library together with a custom Amazon SageMaker SDK extension to fine-tune a pre-trained transformer for binary text classification. The pre-trained model will be fine-tuned using the `imdb` dataset. The following is a diagram illustrating what we will do

![ Architecture](./files/architecture.png)

_**NOTE: You can run this demo in SageMaker Studio, your local machine or SageMaker Notebook Instances using PyTorch 1.13 Python 3.9**_

# Environment and Permissions 

In [None]:
!pip install datasets
!pip install -U sagemaker

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Visualizing our data
We are using the `datasets` library to download the `imdb` [dataset](https://huggingface.co/datasets/imdb). The dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing.
Let's see how our dataset looks like

In [None]:
from datasets import load_dataset

train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"])

In [9]:
train_dataset, test_dataset

(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [59]:
train_dataset[10]

{'text': 'It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn\'t go on to star in more and better films. Sadly, I didn\'t think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat\'s Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is a

# Preprocessing

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors.
Text, use a [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.

## Tokenization 

In [None]:
from sagemaker import get_execution_role
from sagemaker.pytorch.processing import PyTorchProcessor

pytorch_processor = PyTorchProcessor(
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    framework_version="1.13",
    py_version="py39",
)

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime

s3_prefix = "huggingface-text-classification"
processing_job_name = "{}-{}".format(s3_prefix, strftime("%d-%H-%M-%S", gmtime()))
output_destination = "s3://{}/{}".format(sess.default_bucket(), s3_prefix)

pytorch_processor.run(
    code="preprocessing.py",
    source_dir="scripts/preprocess",
    job_name=processing_job_name,
    outputs=[
        ProcessingOutput(
            output_name="train",
            destination="{}/train".format(output_destination),
            source="/opt/ml/processing/train",
        ),
        ProcessingOutput(
            output_name="test",
            destination="{}/test".format(output_destination),
            source="/opt/ml/processing/test",
        ),
    ],
)

In [None]:
preprocessing_job_description = pytorch_processor.jobs[-1].describe()
preprocessing_job_description

### Visualizing our processed dataset
Let's load our tokenized dataset and see how it looks

In [49]:
from datasets import load_from_disk

s3_prefix = "huggingface-text-classification"
processed_train_dataset = load_from_disk(
    "s3://{}/{}/train".format(sess.default_bucket(), s3_prefix)
)
processed_train_dataset

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 25000
})

In [60]:
processed_train_dataset[10]

{'labels': tensor(0),
 'input_ids': tensor([  101,  2009,  2001,  2307,  2000,  2156,  2070,  1997,  2026,  5440,
          3340,  1997,  2382,  2086,  3283,  2164,  2198, 23168,  1010,  3841,
         14474, 11335,  1998, 14166, 22004,  1012,  2027,  2246,  3243,  6919,
          1012,  2021,  2008,  2001,  2009,  1012,  2027,  2020,  2025,  2445,
          2151,  3494,  2030,  2204,  3210,  2000,  2147,  2007,  1012,  1045,
          4445,  5319,  2030,  8725,  2054,  1996,  3494,  2020,  2725,  1012,
          1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,  2070,  1997,
          1996,  3760,  2931,  4395,  2020,  2986,  1010, 17798, 27227,  1998,
         28385,  3409,  2020,  3243, 17824,  1998,  9657,  1999,  2037,  2235,
         29240,  3033,  1012,  2027,  3662,  2070,  5848,  1998,  2009,  2003,
          6517,  2027,  2134,  1005,  1056,  2175,  2006,  2000,  2732,  1999,
          2062,  1998,  2488,  3152,  1012, 13718,  1010,  1045,  2134,  1005,
          1056,  

## Creating an Estimator and start a training job

In [21]:
from sagemaker.huggingface import HuggingFace

training_input_path = "s3://{}/{}/train".format(sess.default_bucket(), s3_prefix)
test_input_path = "s3://{}/{}/test".format(sess.default_bucket(), s3_prefix)

# hyperparameters, which are passed into the training job
hyperparameters = {
    "epochs": 1,
    "train_batch_size": 32,
    "model_name": "distilbert-base-uncased",
    "learning_rate": 0.00003,
}

In [22]:
huggingface_estimator = HuggingFace(
    entry_point="train.py",
    source_dir="./scripts/train",
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    role=role,
    transformers_version="4.26",
    pytorch_version="1.13",
    py_version="py39",
    hyperparameters=hyperparameters,
)

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})

## Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
predictor = huggingface_estimator.deploy(initial_instance_count=1, instance_type="ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

In [27]:
sentiment_input = {
    "inputs": "This is the best movie ever made in history, an absolute sculpted work of art that depicts every emotion of human existence, from suffering, to courage to love, in front of the background of political astuteness and socio-hierarchal analysis."
}

predictor.predict(sentiment_input)

[{'label': 'LABEL_1', 'score': 0.984623908996582}]

In [28]:
sentiment_input = {
    "inputs": "Another bloated film that gets all the history wrong, turns all of the characters into stick figures and makes piles of money for the star."
}

predictor.predict(sentiment_input)

[{'label': 'LABEL_0', 'score': 0.9781230092048645}]

Finally, we delete the endpoint again.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)

![ This badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-script-mode|pytorch-sagemaker-huggingface|huggingface_text_classification.ipynb)
