In [None]:
! pip install -r requirements.txt

# Fine-tuning distil-BERT for sentiment analysis of product reviews

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. You can learn more about this flavor of BERT here: https://huggingface.co/distilbert-base-uncased

This notebook showcases the use of pre-trained models in Domino and demonstrates the process of GPU-accelerated fine-tuning using Nvidia GPUs. We use the [Amazon Polarity](https://huggingface.co/datasets/amazon_polarity) dataset, which contains 3.5 million samples of sentiments for product reviews. Due to the size of this dataset, for demonstration purposes we use a [10% sample](https://huggingface.co/datasets/ben-epstein/amazon_polarity_10_pct) of the original data.



In [None]:
from functools import partial
import os

import torch
import argparse

import numpy as np
import nvidia
import pandas as pd
import evaluate

from transformers import (
    enable_full_determinism,
    pipeline,
    Trainer,
    EvalPrediction,
    DistilBertTokenizer,
    DistilBertForSequenceClassification,
    TrainingArguments,
    EarlyStoppingCallback,
    BatchEncoding
)

from datasets import load_dataset, Dataset, DatasetDict, ClassLabel
from datasets.formatting.formatting import LazyBatch
import itertools
import mlflow.transformers


In [None]:
cuda_install_dir = '/'.join(nvidia.__file__.split('/')[:-1]) + '/cuda_runtime/lib/'
os.environ['LD_LIBRARY_PATH'] =  cuda_install_dir

Let's make sure GPU acceleration is available.

In [None]:
if torch.cuda.is_available():
    print("GPU acceleration is available!")
else:
    print("GPU acceleration is NOT available! Training, fine-tuning, and inference speed will be adversely impacted.")
    
enable_full_determinism(True)

_Let's load the original distilbert dataset and classify a handful of test statments. The NLP pipeline produces a label and a prediction score._

In [None]:
model_id = "distilbert-base-uncased"

model = DistilBertForSequenceClassification.from_pretrained(model_id, num_labels=2, id2label={0: "negative", 1: "positive"})
tokenizer = DistilBertTokenizer.from_pretrained(model_id)

nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

sentences = [
    "This towel did not match the description. Is it far too small/",  
    "I love the colors! This is a perfect birthday gift", 
    "It came damanged.", 
    "I'm not sure about it yet. I'll check back in a few weeks and update.",
]
results = nlp(sentences)

for sample in zip(sentences, results):
    print(sample)

## Amazon Polarity dataset

Let's now load the Amazon Polarity dataset. The dataset has two attributes for this task:

* **content** - The product review, which we will rename `text`
* **label** - sentiment, which we will encode as follows:
    * negative  : 0
    * positive : 1
 
The huggingface dataset comes with these labels already encoded as `ClassLabel`s, so we shouldn't need to make any changes.
    
Let's process the dataset and show the first 5 samples.

In [None]:
name = "Domino-ai/amazon_polarity_10_pct"
ds = load_dataset(name)
ds = ds.rename_columns({"content": "text"})

print(ds)

# Look at the first few rows
ds["train"].to_pandas()[:5]


We can now proceed with splitting it into a training, test, and validation sets. As we see, this dataset has a `train` and a `test` split already. So we will split our `train` split into train/val, and save our test split for final evaluation.

### Preparing training, test, and validation subset

Next, we split the training dataset into a training, test, and validation subsets.

In [None]:
ds_train_val = ds["train"].train_test_split(test_size=0.1, seed=42, stratify_by_column="label")

ds["train"] = ds_train_val["train"]
ds["validation"] = ds_train_val["test"]

print(f"Samples in train      : {len(ds['train'])}")
print(f"Samples in validation : {len(ds['validation'])}")
print(f"Samples in test       : {len(ds['test'])}")

Now let's score (make predictions on) the test set using only the pretrained model.

In [None]:
sentences = ds["test"]["text"][:10]
df_test = pd.DataFrame(ds["test"][:10])
results = nlp(sentences)

We can build a DataFrame with the ground truth and the prediction and see how the pretrained model is doing in terms of model performance.

In [None]:
results_df = pd.DataFrame.from_dict(results)
df_test["label"] = df_test["label"].replace([0, 1], ["negative", "positive"])
results_df.columns = ["pred", "score"]
results_df.reset_index(drop=True, inplace=True)

results_df = pd.concat([df_test[["text", "label"]].reset_index(drop=True), results_df], axis=1)

results_df["Correct"] = results_df["label"].eq(results_df["pred"])

results_df.head()

We can calculate the accuracy of the predictions:

In [None]:
accuracy = len(results_df[results_df["Correct"] == True]) / len(results_df)

print("Accuracy : {:.2f}".format(accuracy))

It's always important to look at your performance per-label, so let's do that here:

In [None]:
accuracy_df = pd.concat([results_df["label"].value_counts(), results_df.groupby("label")["Correct"].mean().mul(100).round(2)], axis=1)
accuracy_df = accuracy_df.reset_index()
accuracy_df.columns = ["Label", "Count", "Accuracy"]
accuracy_df.head()

## Model Fine-tuning

The fine-tuning process takes the base model (distil-bert) and performs additional training, tweaking it towards a more specialized use-case. Here, we'll use the training subset of the Amazon Polarity sentiment dataset. This transfer learning approach will enables us to produce a more accurate model with a smaller training time.

### Datasets preparation

First, we need to prepare the three datasets (training, validation, and test) by tokenizing their inputs.

You'll notice that we don't pad our inputs. This is because it uses a lot of memory, and takes a long time upfront to do so. Instead, in the next step, we pass in our tokenizer during the training process so that our inputs get padded dynamically during training, using less memory.

In [None]:
def preprocess_function(
    tokenizer: DistilBertTokenizer, examples: LazyBatch
) -> BatchEncoding:
    return tokenizer(
        examples["text"], truncation=True, padding=False, max_length=512
    )  # 512 because we use BERT


ds = ds.map(partial(preprocess_function, tokenizer))


### Setting up and training

Next, we define the training metrics (in our case, f1) and some additional customization points like training epochs, size of batches etc.

In [None]:
metric_choice = "f1"
METRIC = evaluate.load(metric_choice)

def compute_metrics(eval_pred: EvalPrediction) -> dict:
    predictions, labels = np.array(eval_pred.predictions), np.array(eval_pred.label_ids)
    predictions = predictions.argmax(axis=1)
    return METRIC.compute(
        predictions=predictions, references=labels, average="weighted"
    )

# Autologging with mlflow directly into Domino
mlflow.transformers.autolog(
    log_input_examples=True,
    log_model_signatures=True,
    log_models=False,
    log_datasets=False
)

args = TrainingArguments(
        output_dir = "/mnt/artifacts/temp/",
        evaluation_strategy = "steps",
        learning_rate=0.00001,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        weight_decay=0.01,
        metric_for_best_model=metric_choice,
        save_total_limit = 2,
        save_strategy = "steps",
        load_best_model_at_end=True,
        optim="adamw_torch")

trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=args,
        train_dataset=ds["train"],
        eval_dataset=ds["validation"],
        compute_metrics=compute_metrics)

We can now perform the training.

**Note that you will need a hardware tier with sufficient memory and compute, ideally a HW tier which provides GPU acceleration. Otherwise the training process can take a substantial amount of time or crash due to not having access to enough system memory**

In [None]:
trainer.train()  

### Model evaluation

We can now test the accuracy of the model using the test set.

In [None]:
f1_test = trainer.predict(ds["test"]).metrics["test_f1"]
print(f"F1 on test: {f1_test:.2f}")

### Saving the fine-tuned model

Finally, we can save the fine-tuned model and used it for online predictions via a [Model API](https://docs.dominodatalab.com/en/latest/user_guide/8dbc91/host-models-as-rest-apis/).

In [None]:
'''
Please change this location accordingly. You might want to change this depending on whether you are using a git based project
or a DFS based project and if you want to use this model
''' 
trainer.save_model("/mnt/artifacts/amazon-sentiment/")