# Huggingface Sagemaker - Getting Started Demo
### Sentiment Analysis with `DistilBERT` and `imdb` dataset

1. [Introduction](#Introduction)  
2. [Environment and Permissions](#Environment-and-Permissions)
3. [Preprocess - Tokenization of the dataset](#Preprocessing)   
4. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)  
5. [Deploying the endpoint](#Deploying-the-endpoint)  

# Introduction

Welcome to our end-to-end binary Text-Classification example. In this demo, we will use the Hugging Faces `transformers` and `datasets` library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using the `imdb` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

![Architecture](./files/architecture.png)

_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

# Environment and Permissions 

In [2]:
pip install datasets

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::394934205700:role/service-role/AmazonSageMaker-ExecutionRole-20220201T182213
sagemaker bucket: sagemaker-us-east-1-394934205700
sagemaker session region: us-east-1


## Visualizing our data
We are using the `datasets` library to download the `imdb` dataset. The [imdb](http://ai.stanford.edu/~amaas/data/sentiment/) dataset consists of 25000 training and 25000 testing highly polar movie reviews.
Let see how our dataset looks like

In [4]:
from datasets import load_dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 2/2 [00:00<00:00, 317.67it/s]


In [5]:
train_dataset, test_dataset

(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [6]:
train_dataset[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# Preprocessing

Before you can train a model on a dataset, it needs to be preprocessed into the expected model input format. Whether your data is text, images, or audio, they need to be converted and assembled into batches of tensors.
Text, use a [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.

## Tokenization 

In [7]:
from sagemaker import get_execution_role
from sagemaker.pytorch.processing import PyTorchProcessor

pytorch_processor = PyTorchProcessor(role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1,
                                     framework_version='1.13',
                                     py_version='py39')

In [8]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime 

s3_prefix = "tlv-summit-demo"
processing_job_name = "tlv-summit-demo-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}'.format(sess.default_bucket(), s3_prefix)

pytorch_processor.run(
    code='preprocessing.py',
    source_dir='scripts/preprocess',
    job_name=processing_job_name,
    outputs=[
        ProcessingOutput(output_name='train',
            destination='{}/train'.format(output_destination),
            source='/opt/ml/processing/train'),
        ProcessingOutput(output_name='test',
            destination='{}/test'.format(output_destination),
            source='/opt/ml/processing/test')
    ]
)

INFO:sagemaker:Creating processing-job with name tlv-summit-demo-30-08-10-36


Using provided s3_resource
[34mCollecting transformers==4.26 (from -r requirements.txt (line 1))
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 53.3 MB/s eta 0:00:00[0m
[34mCollecting datasets==2.10.1 (from -r requirements.txt (line 2))
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 kB 48.8 MB/s eta 0:00:00[0m
[34mCollecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.26->-r requirements.txt (line 1))
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 41.1 MB/s eta 0:00:00[0m
[34mCollecting regex!=2019.12.17 (from transformers==4.26->-r requirements.txt (line 1))
  Downloading regex-2023.3.23-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (768 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 769.0/769.0 kB 68.9 MB/s eta 0:00:00[0m
[34mCollecting tokenizers!=0

In [9]:
preprocessing_job_description = pytorch_processor.jobs[-1].describe()
preprocessing_job_description

{'ProcessingInputs': [{'InputName': 'code',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-394934205700/tlv-summit-demo-30-08-10-36/source/sourcedir.tar.gz',
    'LocalPath': '/opt/ml/processing/input/code/',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}},
  {'InputName': 'entrypoint',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-394934205700/tlv-summit-demo-30-08-10-36/source/runproc.sh',
    'LocalPath': '/opt/ml/processing/input/entrypoint',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}}],
 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train',
    'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-394934205700/tlv-summit-demo/train',
     'LocalPath': '/opt/ml/processing/train',
     'S3UploadMode': 'EndOfJob'},
    'AppManaged': False},
 

### Visualizing our processed dataset
Let load our tokenized dataset and see how it look

In [10]:
from datasets import load_from_disk
s3_prefix = "tlv-summit-demo"
processed_train_dataset = load_from_disk('s3://{}/{}/train'.format(sess.default_bucket(), s3_prefix))
processed_train_dataset

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 25000
})

In [11]:
processed_train_dataset[0]

{'labels': tensor(0),
 'input_ids': tensor([  101,  1045, 12524,  1045,  2572,  8025,  1011,  3756,  2013,  2026,
          2678,  3573,  2138,  1997,  2035,  1996,  6704,  2008,  5129,  2009,
          2043,  2009,  2001,  2034,  2207,  1999,  3476,  1012,  1045,  2036,
          2657,  2008,  2012,  2034,  2009,  2001,  8243,  2011,  1057,  1012,
          1055,  1012,  8205,  2065,  2009,  2412,  2699,  2000,  4607,  2023,
          2406,  1010,  3568,  2108,  1037,  5470,  1997,  3152,  2641,  1000,
          6801,  1000,  1045,  2428,  2018,  2000,  2156,  2023,  2005,  2870,
          1012,  1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,  1996,
          5436,  2003,  8857,  2105,  1037,  2402,  4467,  3689,  3076,  2315,
         14229,  2040,  4122,  2000,  4553,  2673,  2016,  2064,  2055,  2166,
          1012,  1999,  3327,  2016,  4122,  2000,  3579,  2014,  3086,  2015,
          2000,  2437,  2070,  4066,  1997,  4516,  2006,  2054,  1996,  2779,
         25430, 1

## Creating an Estimator and start a training job

In [12]:
from sagemaker.huggingface import HuggingFace

training_input_path = 's3://{}/{}/train'.format(sess.default_bucket(), s3_prefix)
test_input_path = 's3://{}/{}/test'.format(sess.default_bucket(), s3_prefix)

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased',
                 'learning_rate': 0.00003
                 }

In [13]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts/train',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.26',
                            pytorch_version='1.13',
                            py_version='py39',
                            hyperparameters = hyperparameters)

In [14]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-04-30-08-16-36-610


Using provided s3_resource
2023-04-30 08:16:37 Starting - Starting the training job...
2023-04-30 08:17:02 Starting - Preparing the instances for training......
2023-04-30 08:18:03 Downloading - Downloading input data...
2023-04-30 08:18:25 Training - Downloading the training image........................
2023-04-30 08:22:11 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-30 08:22:27,888 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-30 08:22:27,906 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-04-30 08:22:27,918 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-30 08:22:27,921 sagemaker_pytorch_container.training INFO     Invoking user training script.[0

## Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [19]:
predictor = huggingface_estimator.deploy(initial_instance_count=1, instance_type="ml.g4dn.xlarge")

INFO:sagemaker:Creating model with name: huggingface-pytorch-training-2023-04-30-10-31-22-825
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-training-2023-04-30-10-31-22-825
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-training-2023-04-30-10-31-22-825


----------!

Then, we use the returned predictor object to call the endpoint.

In [16]:
sentiment_input= {"inputs":"This is the best movie ever made in history, an absolute sculpted work of art that depicts every emotion of human existence, from suffering, to courage to love, in front of the background of political astuteness and socio-hierarchal analysis."}

predictor.predict(sentiment_input)

[{'label': 'LABEL_1', 'score': 0.9899144172668457}]

In [17]:
sentiment_input= {"inputs":"Another bloated film that gets all the history wrong, turns all of the characters into stick figures and makes piles of money for the star."}

predictor.predict(sentiment_input)

[{'label': 'LABEL_0', 'score': 0.972507655620575}]

Finally, we delete the endpoint again.

In [18]:
predictor.delete_model()
predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: huggingface-pytorch-training-2023-04-30-08-32-59-259
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-training-2023-04-30-08-32-59-259
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-training-2023-04-30-08-32-59-259
