# Text Classification with Amazon SageMaker HuggingFace and Hyperparameter Tuning

Automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.


## Introduction

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. 


This notebook demonstrates the use of the [HuggingFace `transformers` library](https://huggingface.co/transformers/) together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on multi class text classification. In particular, the pre-trained model will be fine-tuned using the [`20 newsgroups dataset`](http://qwone.com/~jason/20Newsgroups/). To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on.

## Install Python packages

In [1]:
import sys

!{sys.executable} -m pip install "scikit_learn>=1.0.2" "sagemaker>=2.82.1" "transformers>=4.18.0" "datasets[s3]>=1.18.2" "nltk>=3.7"

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


If you run this notebook in SageMaker Studio, you need to make sure `ipywidgets` is installed and restart the kernel, so please uncomment the code in the next cell, and run it.

In [2]:
%%capture
import IPython
import sys

!{sys.executable} -m pip install ipywidgets
IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [1]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import pandas as pd
import re
import string
from sklearn.model_selection import train_test_split
import sagemaker.huggingface


sess = sagemaker.Session()
role = get_execution_role()

print(
    role
)  # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket()  # Replace with your own bucket name if needed
print(bucket)
s3_prefix = "huggingface/20_newsgroups"  # Replace with the prefix under which you want to store the data if needed

arn:aws:iam::976939723775:role/service-role/AmazonSageMaker-ExecutionRole-20210317T133000
sagemaker-us-west-2-976939723775


### Data Preparation

Now we'll download a dataset from the web on which we want to train the text classification model.

In this example, let us train the text classification model on the [`20 newsgroups dataset`](http://qwone.com/~jason/20Newsgroups/). The `20 newsgroups dataset` consists of 20000 messages taken from 20 Usenet newsgroups.

In [2]:
import os
import shutil

data_dir = "20_newsgroups_bulk"
if os.path.exists(data_dir):  # cleanup existing data folder
    shutil.rmtree(data_dir)

In [3]:
!aws s3 cp s3://sagemaker-sample-files/datasets/text/20_newsgroups/20_newsgroups_bulk.tar.gz .

download: s3://sagemaker-sample-files/datasets/text/20_newsgroups/20_newsgroups_bulk.tar.gz to ./20_newsgroups_bulk.tar.gz


In [4]:
!tar xzf 20_newsgroups_bulk.tar.gz
!ls 20_newsgroups_bulk

alt.atheism		  rec.autos	      sci.space
comp.graphics		  rec.motorcycles     soc.religion.christian
comp.os.ms-windows.misc   rec.sport.baseball  talk.politics.guns
comp.sys.ibm.pc.hardware  rec.sport.hockey    talk.politics.mideast
comp.sys.mac.hardware	  sci.crypt	      talk.politics.misc
comp.windows.x		  sci.electronics     talk.religion.misc
misc.forsale		  sci.med


In [5]:
file_list = [os.path.join(data_dir, f) for f in os.listdir(data_dir)]
print("Number of files:", len(file_list))

Number of files: 20


In [6]:
documents_count = 0
for file in file_list:
    df = pd.read_csv(file, header=None, names=["text"])
    documents_count = documents_count + df.shape[0]
print("Number of documents:", documents_count)

Number of documents: 19997


Let's inspect the dataset files and analyze the categories.

In [7]:
categories_list = [f.split("/")[1] for f in file_list]

In [8]:
categories_list

['talk.politics.misc',
 'rec.autos',
 'rec.sport.baseball',
 'alt.atheism',
 'sci.space',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'comp.graphics',
 'rec.sport.hockey',
 'talk.politics.mideast',
 'sci.crypt',
 'talk.politics.guns',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'misc.forsale',
 'sci.electronics',
 'soc.religion.christian',
 'rec.motorcycles',
 'sci.med',
 'talk.religion.misc']

We can see that the dataset consists of 20 topics, each in different file.

Let us inspect the dataset to get some understanding about how the data and the label is provided in the dataset. 

In [9]:
df = pd.read_csv("./20_newsgroups_bulk/rec.motorcycles", header=None, names=["text"])
df

Unnamed: 0,text
0,Newsgroups: rec.motorcycles\nPath: cantaloupe....
1,Newsgroups: rec.motorcycles\nPath: cantaloupe....
2,Newsgroups: rec.motorcycles\nPath: cantaloupe....
3,Newsgroups: rec.motorcycles\nPath: cantaloupe....
4,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...
...,...
995,Path: cantaloupe.srv.cs.cmu.edu!rochester!corn...
996,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...
997,Newsgroups: rec.motorcycles\nPath: cantaloupe....
998,Newsgroups: rec.motorcycles\nPath: cantaloupe....


In [10]:
df["text"][0]

'Newsgroups: rec.motorcycles\nPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!agate!linus!linus.mitre.org!mbunix.mitre.org!cookson\nFrom: cookson@mbunix.mitre.org (Cookson)\nSubject: Volvo Attack!\nMessage-ID: <1993Apr21.143403.4644@linus.mitre.org>\nSender: news@linus.mitre.org (News Service)\nNntp-Posting-Host: mbunix.mitre.org\nOrganization: The MITRE Corp., Bedford, Ma.\nDate: Wed, 21 Apr 1993 14:34:03 GMT\nLines: 22\n\nI was privelged enough to experience my first Volvo attack this weekend.\n\nI was last in a line of traffic that was about 6 vehicles long, riding\ndown Rt. 40 in Groton Ma.  At the side of the road, sitting off on the\nshoulder was the killer Volvo in question.  No brake lights, no turn signal,\nnothing.  We were doing about 40 mph and I was following the cage in front\nof me about 2.5-3 sec. back.  Well, as said cage passes the Volvo, the\nBrain Dead Idiot (tm) behind the wheel decides that she 

In [11]:
df = pd.read_csv("./20_newsgroups_bulk/comp.sys.mac.hardware", header=None, names=["text"])
df

Unnamed: 0,text
0,Newsgroups: comp.sys.mac.hardware\nPath: canta...
1,Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv....
2,Newsgroups: comp.sys.mac.hardware\nPath: canta...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...
4,Newsgroups: comp.sys.mac.hardware\nPath: canta...
...,...
995,Newsgroups: comp.sys.mac.hardware\nPath: canta...
996,Xref: cantaloupe.srv.cs.cmu.edu comp.sys.ibm.p...
997,Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv....
998,Newsgroups: comp.sys.mac.hardware\nPath: canta...


In [12]:
df["text"][0]

'Newsgroups: comp.sys.mac.hardware\nPath: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!agate!headwall.Stanford.EDU!nntp.Stanford.EDU!cmwand\nFrom: cmwand@leland.Stanford.EDU (Christopher Wand)\nSubject: Re: Syquest 150 ???\nMessage-ID: <1993Apr20.043629.21237@leland.Stanford.EDU>\nSender: news@leland.Stanford.EDU (Mr News)\nOrganization: DSG, Stanford University, CA 94305, USA\nReferences: <93759@hydra.gatech.EDU>\nDistribution: usa\nDate: Tue, 20 Apr 93 04:36:29 GMT\nLines: 30\n\nIn article <93759@hydra.gatech.EDU> gt8798a@prism.gatech.EDU (Anthony S. Kim) writes:\n>I remember someone mention about a 150meg syquest.  Has anyone else\n>heard anything about this?  I\'d be interested in the cost per megabyte and the\n>approximate cost of the drive itself and how they compare to the Bernoulli 150.\n\nI think you must be talking about the Syquest 105 (code named Mesa I believe).\nIt is a 3.5" Winchester technology drive pretty much like the other Syquest\ndrives i

As we can see from the above, there is a single file for each class in the dataset. Each record is just a plain text paragraphs with header, body, footer and quotes. We will need to process them into a suitable data format.

## Data Preprocessing
We need to preprocess the dataset to remove the header, footer, quotes, leading/trailing whitespace, extra spaces, tabs, and HTML tags/markups. 

Download the `nltk` tokenizer and other libraries

In [13]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/alfred/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
from sklearn.datasets._twenty_newsgroups import (
    strip_newsgroup_header,
    strip_newsgroup_quoting,
    strip_newsgroup_footer,
)

This following function will remove the header, footer and quotes (of earlier messages in each text).

In [25]:
def strip_newsgroup_item(item):
    item = strip_newsgroup_header(item)
    item = strip_newsgroup_quoting(item)
    item = strip_newsgroup_footer(item)
    return item

The following function will take care of removing leading/trailing whitespace, extra spaces, tabs, and HTML tags/markups.

In [26]:
def process_text(texts):
    final_text_list = []
    for text in texts:

        # Check if the sentence is a missing value
        if isinstance(text, str) == False:
            text = ""

        filtered_sentence = []

        # Lowercase
        text = text.lower()

        # Remove leading/trailing whitespace, extra space, tabs, and HTML tags/markups
        text = text.strip()
        text = re.sub("\[.*?\]", "", text)
        text = re.sub("https?://\S+|www\.\S+", "", text)
        text = re.sub("<.*?>+", "", text)
        text = re.sub("[%s]" % re.escape(string.punctuation), "", text)
        text = re.sub("\n", "", text)
        text = re.sub("\w*\d\w*", "", text)

        for w in word_tokenize(text):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric
            if not w.isnumeric():
                filtered_sentence.append(w)
        final_string = " ".join(filtered_sentence)  # final string of cleaned words

        final_text_list.append(final_string)

    return final_text_list

Now we will read each of the `20_newsgroups` dataset files, call `strip_newsgroup_item` and `process_text` functions we defined earlier, and then aggregate all data into one dataframe.

In [27]:
all_categories_df = pd.DataFrame()

for file in file_list:
    print(f"Processing {file}")
    label = file.split("/")[1]
    df = pd.read_csv(file, header=None, names=["text"])
    df["text"] = df["text"].apply(strip_newsgroup_item)
    df["text"] = process_text(df["text"].tolist())
    df["label"] = label
    all_categories_df = all_categories_df.append(df, ignore_index=True)

Processing 20_newsgroups_bulk/talk.politics.misc


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/rec.autos


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/rec.sport.baseball


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/alt.atheism


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/sci.space


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/comp.sys.mac.hardware


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/comp.windows.x


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/comp.graphics


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/rec.sport.hockey


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/talk.politics.mideast


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/sci.crypt


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/talk.politics.guns


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/comp.os.ms-windows.misc


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/comp.sys.ibm.pc.hardware


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/misc.forsale


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/sci.electronics


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/soc.religion.christian


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/rec.motorcycles


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/sci.med


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Processing 20_newsgroups_bulk/talk.religion.misc


  all_categories_df = all_categories_df.append(df, ignore_index=True)


Let's inspect how many categories there are in our dataset.

In [28]:
all_categories_df["label"].value_counts()

talk.politics.misc          1000
rec.autos                   1000
sci.med                     1000
rec.motorcycles             1000
sci.electronics             1000
misc.forsale                1000
comp.sys.ibm.pc.hardware    1000
comp.os.ms-windows.misc     1000
talk.politics.guns          1000
sci.crypt                   1000
talk.politics.mideast       1000
rec.sport.hockey            1000
comp.graphics               1000
comp.windows.x              1000
comp.sys.mac.hardware       1000
sci.space                   1000
alt.atheism                 1000
rec.sport.baseball          1000
talk.religion.misc          1000
soc.religion.christian       997
Name: label, dtype: int64

In our dataset there are 20 categories which is too much, so we will combine the sub-categories.

In [29]:
# replace to politics
all_categories_df["label"].replace(
    {
        "talk.politics.misc": "politics",
        "talk.politics.guns": "politics",
        "talk.politics.mideast": "politics",
    },
    inplace=True,
)

# replace to recreational
all_categories_df["label"].replace(
    {
        "rec.sport.hockey": "recreational",
        "rec.sport.baseball": "recreational",
        "rec.autos": "recreational",
        "rec.motorcycles": "recreational",
    },
    inplace=True,
)

# replace to religion
all_categories_df["label"].replace(
    {
        "soc.religion.christian": "religion",
        "talk.religion.misc": "religion",
        "alt.atheism": "religion",
    },
    inplace=True,
)

# replace to computer
all_categories_df["label"].replace(
    {
        "comp.windows.x": "computer",
        "comp.sys.ibm.pc.hardware": "computer",
        "comp.os.ms-windows.misc": "computer",
        "comp.graphics": "computer",
        "comp.sys.mac.hardware": "computer",
    },
    inplace=True,
)
# replace to sales
all_categories_df["label"].replace({"misc.forsale": "sales"}, inplace=True)

# replace to science
all_categories_df["label"].replace(
    {
        "sci.crypt": "science",
        "sci.electronics": "science",
        "sci.med": "science",
        "sci.space": "science",
    },
    inplace=True,
)

Now we are left with 6 categories, which is much better.

In [30]:
all_categories_df["label"].value_counts()

computer        5000
recreational    4000
science         4000
politics        3000
religion        2997
sales           1000
Name: label, dtype: int64

Let's calculate number of words for each row.

In [31]:
all_categories_df["word_count"] = all_categories_df["text"].apply(lambda x: len(str(x).split()))
all_categories_df.head()

Unnamed: 0,text,label,word_count
0,too many,politics,2
1,when mcmanus says we have the worlds best medi...,politics,819
2,im addicted to chocolate myself,politics,5
3,wow does this mean out of homosexuals will be ...,politics,26
4,if you cant convict em dont bust em plea bargi...,politics,18


Let's get basic statistics about the dataset.

In [32]:
all_categories_df["word_count"].describe()

count    19997.000000
mean       159.346102
std        434.479067
min          0.000000
25%         37.000000
50%         74.000000
75%        148.000000
max      11351.000000
Name: word_count, dtype: float64

We can see that the mean value is around 159 words. However, there are outliers, such as a text with 11351 words. This can make it harder for the model to result in good performance. We will take care to drop those rows.

Let's drop empty rows first.

In [33]:
no_text = all_categories_df[all_categories_df["word_count"] == 0]
print(len(no_text))

# drop these rows
all_categories_df.drop(no_text.index, inplace=True)

90


Let's drop the rows that are longer than 256 words, as it is a length close to the mean value of the word count. This is done to make it easy for the model to train without outliers. 

In [34]:
long_text = all_categories_df[all_categories_df["word_count"] > 256]
print(len(long_text))

# drop these rows
all_categories_df.drop(long_text.index, inplace=True)

2409


In [35]:
all_categories_df["label"].value_counts()

computer        4659
recreational    3675
science         3506
politics        2370
religion        2349
sales            939
Name: label, dtype: int64

Let's get basic statistics about the dataset after our outliers fixes.

In [36]:
all_categories_df["word_count"].describe()

count    17498.000000
mean        79.797348
std         59.636188
min          1.000000
25%         33.000000
50%         64.000000
75%        113.000000
max        256.000000
Name: word_count, dtype: float64

This looks much more balanced.

Now we drop the `word_count` columns as we will not need it anymore.

In [37]:
all_categories_df.drop(columns="word_count", axis=1, inplace=True)

In [38]:
all_categories_df

Unnamed: 0,text,label
0,too many,politics
2,im addicted to chocolate myself,politics
3,wow does this mean out of homosexuals will be ...,politics
4,if you cant convict em dont bust em plea bargi...,politics
5,so it will be interesting to see the reaction ...,politics
...,...,...
19991,this is cute but i see no statement telling me...,religion
19992,im confused could you restate what yer saying ...,religion
19994,not true consider the case of a coin i flip it...,religion
19995,contradicting itself on facts for example,religion


Let's convert categorical label to integer number, in order to prepare the dataset for training.

In [39]:
categories = all_categories_df["label"].unique().tolist()
categories

['politics', 'recreational', 'religion', 'science', 'computer', 'sales']

In [40]:
categories.index("recreational")

1

In [41]:
all_categories_df["label"] = all_categories_df["label"].apply(lambda x: categories.index(x))

In [42]:
all_categories_df["label"].value_counts()

4    4659
1    3675
3    3506
0    2370
2    2349
5     939
Name: label, dtype: int64

We partition the dataset into 80% training and 20% validation set and save to `csv` files.

In [43]:
train_df, test_df = train_test_split(all_categories_df, test_size=0.2)

In [44]:
train_df.to_csv("train.csv", index=None)

In [45]:
test_df.to_csv("test.csv", index=None)

Let's inspect the label distribution in the training dataset

In [46]:
train_df["label"].value_counts()

4    3735
1    2936
3    2767
0    1921
2    1880
5     759
Name: label, dtype: int64

Let's inspect the label distribution in the test dataset

In [47]:
test_df["label"].value_counts()

4    924
3    739
1    739
2    469
0    449
5    180
Name: label, dtype: int64

## Tokenization 

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library [tokenizers](https://github.com/huggingface/tokenizers). The “Fast” implementations allows:

 - A significant speed-up in particular when doing batched tokenization.
 - Additional methods to map between the original string (character and words) and the token space (e.g. getting the index of the token comprising a given character or the span of characters corresponding to a given token). 

In [48]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = "distilbert-base-uncased"

In [49]:
# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

### Load train and test datasets

Let's create a [Dataset](https://huggingface.co/docs/datasets/loading_datasets.html) from our local `csv` files for training and test we saved earlier.

In [50]:
dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"})

Using custom data configuration default-54a3fef1df061eb4


Downloading and preparing dataset csv/default to /home/alfred/.cache/huggingface/datasets/csv/default-54a3fef1df061eb4/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /home/alfred/.cache/huggingface/datasets/csv/default-54a3fef1df061eb4/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [51]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13998
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3500
    })
})

In [52]:
dataset["train"]

Dataset({
    features: ['text', 'label'],
    num_rows: 13998
})

In [53]:
dataset["train"][0]

{'text': 'in win one may assign hotkeys for the program items within theprogram manager how about the program manager itself is there onealready or is there some way to assign one',
 'label': 4}

In [54]:
dataset["test"]

Dataset({
    features: ['text', 'label'],
    num_rows: 3500
})

In [55]:
dataset["test"][0]

{'text': 'i dont know if it causes the body any harm but in the ive been teaching nine and ten years olds ive never hadone fall over from eating boogers which many kids do on aregular basis',
 'label': 3}

In [56]:
# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

In [57]:
train_dataset = dataset["train"]
test_dataset = dataset["test"]

### Tokenize train and test datasets

Let's tokenize the train dataset

In [58]:
train_dataset = train_dataset.map(tokenize, batched=True)

  0%|          | 0/14 [00:00<?, ?ba/s]

Let's tokenize the test dataset

In [59]:
test_dataset = test_dataset.map(tokenize, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

### Set format for PyTorch

In [60]:
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

## Uploading data to `sagemaker_session_bucket`

After we processed the datasets, we are going to upload it to S3.

In [61]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f"s3://{sess.default_bucket()}/{s3_prefix}/train"
train_dataset.save_to_disk(training_input_path, fs=s3)

# save test_dataset to s3
test_input_path = f"s3://{sess.default_bucket()}/{s3_prefix}/test"
test_dataset.save_to_disk(test_input_path, fs=s3)

print(training_input_path)
print(test_input_path)

s3://sagemaker-us-west-2-976939723775/huggingface/20_newsgroups/train
s3://sagemaker-us-west-2-976939723775/huggingface/20_newsgroups/test


## Set up hyperparameter tuning job
Now that we are done with all the setup that is needed, we are ready to train our HuggingFace model. To begin, let us create a `HuggingFace` estimator object. This estimator will launch the training job.

## Training the HuggingFace model for supervised text classification

In order to create a sagemaker training job we need a `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In an Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./code',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            volume_size=256,
                            role=role,
                            transformers_version='4.6',
                            pytorch_version='1.7',
                            py_version='py36',
                            hyperparameters = {'epochs': 1,
                                               'model_name':'distilbert-base-uncased',
                                               'num_labels': 6
                                              })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --num_labels 6
```

The `hyperparameters` you define in the `HuggingFace` estimator are passed in as named arguments. 

SageMaker is providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local-gpu'` for `gpu` usage. _Note: this does not work within SageMaker Studio_


We create a metric_definition dictionary that contains regex-based definitions that will be used to parse the job logs and extract metrics

In [62]:
metric_definitions = [
    {"Name": "loss", "Regex": "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "learning_rate", "Regex": "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "eval_loss", "Regex": "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "eval_f1", "Regex": "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "eval_precision", "Regex": "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "eval_recall", "Regex": "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {"Name": "eval_runtime", "Regex": "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {
        "Name": "eval_samples_per_second",
        "Regex": "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?",
    },
    {"Name": "epoch", "Regex": "'epoch': ([0-9]+(.|e\-)[0-9]+),?"},
]

In [63]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters = {"epochs": 1, "model_name": "distilbert-base-uncased", "num_labels": 6}

Now, let's define the SageMaker `HuggingFace` estimator with resource configurations and hyperparameters to train Text Classification on `20 newsgroups` dataset, running on a `p3.2xlarge` instance.

In [64]:
huggingface_estimator = HuggingFace(
    entry_point="train.py",
    source_dir="./code",
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    volume_size=256,
    role=role,
    transformers_version="4.6",
    pytorch_version="1.7",
    py_version="py36",
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
)

Once we've defined our estimator we can specify the hyperparameters we'd like to tune and their possible values.  We have three different types of hyperparameters.
- Categorical parameters need to take one value from a discrete set.  We define this by passing the list of possible values to `CategoricalParameter(list)`
- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`
- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`

*Note, if possible, it's almost always best to specify a value as the least restrictive type.  For example, tuning learning rate as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with values 0.01, 0.1, 0.15, or 0.2.*

In [65]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

hyperparameter_ranges = {
    "train_batch_size": IntegerParameter(8, 32),
}

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job.  If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a `MetricDefinition` object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.

In this case, we elected to monitor `eval_accuracy` as you can see below. 

In [66]:
objective_metric_name = "eval_accuracy"
objective_type = "Maximize"
hpo_metric_definitions = [
    {"Name": "eval_accuracy", "Regex": "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"}
]

Now, we'll create a `HyperparameterTuner` object, to which we pass:
- The `HuggingFace` estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [67]:
tuner = HyperparameterTuner(
    huggingface_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    hpo_metric_definitions,
    max_jobs=6,
    max_parallel_jobs=3,
    objective_type=objective_type,
    strategy='Bayesian', #Strategy to be used for hyperparameter estimations
)

## Launch hyperparameter tuning job
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

This should take around 28 minutes to complete. Switch to lab while wating for the fine tuning job to complete.

In [68]:
%%time

tuner.fit({"train": training_input_path, "test": test_input_path}, logs=True)

..........................................................................................................................................................................................................................................................................................................................................................!
CPU times: user 1.66 s, sys: 229 ms, total: 1.88 s
Wall time: 29min 8s


## Analyze Results of a Hyperparameter Tuning job

Once you have completed a tuning job, (or even while the job is still running) you can use the code below to analyze the results to understand how each hyperparameter effects the quality of the model.

In [69]:
sm_client = boto3.Session().client("sagemaker")

tuning_job_name = tuner.latest_tuning_job.name
tuning_job_name

'huggingface-pytorch--220510-2100'

## Track hyperparameter tuning job progress
After you launch a tuning job, you can see its progress by calling `describe_tuning_job` API. The output from describe-tuning-job is a JSON object that contains information about the current state of the tuning job. You can call `list_training_jobs_for_tuning_job` to see a detailed list of the training jobs that the tuning job launched.

In [70]:
tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

is_minimize = (
    tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["Type"]
    != "Maximize"
)
objective_name = tuning_job_result["HyperParameterTuningJobConfig"][
    "HyperParameterTuningJobObjective"
]["MetricName"]

6 training jobs have completed


In [71]:
from pprint import pprint

if tuning_job_result.get("BestTrainingJob", None):
    print("Best model found so far:")
    pprint(tuning_job_result["BestTrainingJob"])
else:
    print("No training jobs have reported results yet.")

Best model found so far:
{'CreationTime': datetime.datetime(2022, 5, 10, 21, 0, 15, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'eval_accuracy',
                                                 'Value': 0.8057143092155457},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2022, 5, 10, 21, 12, 40, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:976939723775:training-job/huggingface-pytorch--220510-2100-002-a51ba0eb',
 'TrainingJobName': 'huggingface-pytorch--220510-2100-002-a51ba0eb',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2022, 5, 10, 21, 2, 1, tzinfo=tzlocal()),
 'TunedHyperParameters': {'train_batch_size': '15'}}


## Fetch all results as `DataFrame`
We can list hyperparameters and objective metrics of all training jobs and pick up the training job with the best objective metric.

In [72]:
import pandas as pd

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_analytics.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=is_minimize)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

Number of training jobs with valid objective: 6
{'lowest': 0.7548571228981018, 'highest': 0.8057143092155457}


  pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName


Unnamed: 0,train_batch_size,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
4,15.0,huggingface-pytorch--220510-2100-002-a51ba0eb,Completed,0.805714,2022-05-10 21:02:01+00:00,2022-05-10 21:12:40+00:00,639.0
0,18.0,huggingface-pytorch--220510-2100-006-82e267e1,Completed,0.792286,2022-05-10 21:16:33+00:00,2022-05-10 21:26:16+00:00,583.0
3,22.0,huggingface-pytorch--220510-2100-003-f681625a,Completed,0.79,2022-05-10 21:02:10+00:00,2022-05-10 21:12:45+00:00,635.0
1,25.0,huggingface-pytorch--220510-2100-005-b806ad96,Completed,0.785429,2022-05-10 21:16:30+00:00,2022-05-10 21:26:54+00:00,624.0
2,16.0,huggingface-pytorch--220510-2100-004-f3c72f5a,Completed,0.777714,2022-05-10 21:16:26+00:00,2022-05-10 21:27:01+00:00,635.0
5,10.0,huggingface-pytorch--220510-2100-001-82213142,Completed,0.754857,2022-05-10 21:02:01+00:00,2022-05-10 21:12:36+00:00,635.0


## Deploy the best trained model
Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train, because usually for inference, less compute power is needed than for training, and in addition, instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference.

- `ml.p3.2xlarge` - deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 `Gbps` of networking throughput for machine learning and HPC applications.
- `ml.g4dn.xlarge` - the industry’s most cost-effective and versatile GPU instances for deploying machine learning models such as image classification, object detection, and speech recognition, and for graphics-intensive applications such as remote graphics workstations, game streaming, and graphics rendering.

In [75]:
predictor = tuner.deploy(1, "ml.p3.2xlarge")


2022-05-10 21:12:40 Starting - Preparing the instances for training
2022-05-10 21:12:40 Downloading - Downloading input data
2022-05-10 21:12:40 Training - Training image download completed. Training in progress.
2022-05-10 21:12:40 Uploading - Uploading generated training model
2022-05-10 21:12:40 Completed - Training job completed
--------!

Then, we use the returned predictor object to call the endpoint.

In [76]:
def predict_sentence(sentence):
    result = predictor.predict({"inputs": sentence})
    index = int(result[0]["label"].split("LABEL_")[1])
    print(categories[index])

In [77]:
sentences = [
    "The modem is an internal AT/(E)ISA 8-bit card (just a little longer than a half-card).",
    "In the cage I usually wave to bikers.  They usually don't wave back.  My wife thinks it's strange but I don't care.",
    "Voyager has the unusual luck to be on a stable trajectory out of the solar system.",
]

# using the same processing logic that we used during data preparation for training
processed_sentences = process_text(sentences)

for sentence in processed_sentences:
    predict_sentence(sentence)

computer
recreational
science


### Clean up
Endpoints should be deleted when no longer in use, since (per the [SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/)) they're billed by time deployed.


In [78]:
predictor.delete_endpoint()