#Sequence Classification Model for App Reviews


On this notebook, we will fine-tune a [light-version of Bert](https://huggingface.co/distilbert/distilbert-base-cased), a encoder-only transformer model to do sentiment analysis over the [the App review dataset from Hugging Face](https://huggingface.co/datasets/sealuzh/app_reviews) from HuggingFace. We will be using the
transformers library from HugginFace, which is an open-source python library that provides a varierty of pre-trained deep-learning models.


Transformers support multiple backends, PyTorch and JAX among them.
<br><br>

<p align="center">

<img src="https://i.imgur.com/qDNYLpi.png" width="400" height="200">


<p>



# Install and Import

In [4]:
!pip install transformers




In [5]:
!pip install evaluate datasets


Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


In [6]:
!pip install datasets



On this notebook, we are going to use the following libraries :

1. Transformers from Hugging Face :  Transformers allow us to quickly import ready models and adapt them to our necesities.
  1. AutoTokenizer loads the correct type of tokenizer for a given pretrained model.
  2. AutoModelForSequenceClassification loads a pretrained transformer model for a specific classification task.
  3. DataCollatorWithPadding Dynamically pads input sequences ina batch to the legth of the longest sequence in that batch.
  4. Trainer handles training, evaluation, logs and checkpoints, and TrainingArguments is a config class for it.
2. Evaluate is a library from Hugging Face that loads evaluation metrics.
3. Numpy does standard numerical computing, which uses underlying C functions in Python, which brings more efficiency to our calculations.
4. Datasets loads specific datasets directly from Huggin Face.

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
from evaluate import load as load_metric
import numpy as np
from datasets import load_dataset
import random


# App reviews dataset

We will set a random seed and will download [the App review dataset from Hugging Face](https://huggingface.co/datasets/sealuzh/app_reviews). This dataset contains more than 280k rows of reviews.

The dataset comes with four columns : Package name, review, date and star.

Our objective will be to, based on the review, identify the sentiment related and understand identify the amount of start the review correlates to. To achieve this, we will split the dataset in two (training and evaluation). We will also encode star as a categorical variable, as we want the sentiment to be identified to a different class, having as many classes as possible stars in the ranting.

In [8]:
# Selecting a random seed to reproduce our results in any environment
SEED = 42

# Setting the seed for the random library to ensure consistent results
random.seed(SEED)

In [9]:
# Importing the ClassLabel module to represent categorical class labels
from datasets import ClassLabel

# From Hugging Face, we need to name the dataset type and the split it is defined as.
# This dataset only has train, when we check it, so we need to obtain this dataset
# and make further split to have a validation set.

dataset = load_dataset('app_reviews', split='train')

# We want to change the 'star' column in our dataset to a ClassLabel, which means
# that we will use this variable as a category to identify the sentiment, rather
# than a numerical value.

dataset = dataset.class_encode_column('star')

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/288065 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/288065 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/288065 [00:00<?, ? examples/s]

Let's have a look on how our dataset looks like now :

In [10]:
dataset

Dataset({
    features: ['package_name', 'review', 'date', 'star'],
    num_rows: 288065
})

As we mentioned, we will do a classic 80/20 partition to train our model. We will use stratify by column for our desired label star, as we want a homogeneous distribution of the classes (values) within star.

Within the train split, we will divide again a 80/20 partition to evaluate the model as we train it, and we will later on do a final evaluation on the first 20% validation set.

In [11]:

# First split, we have a dataset[train] with 80% of the data and dataset[test]
# with 20 of the data for a final test.
dataset = dataset.train_test_split(test_size=0.2, seed=SEED, stratify_by_column='star')

# As we train our model, we want to see its performance with a validation set to see
# how the model is generalizing to the problem domain.
# So we will further split the training set 80/20. In the end, 64% of data will be for training,
# 14% for evaluating the training performance and 20% for final validation.
dataset_second_split = dataset['train'].train_test_split(test_size=0.2, seed=SEED, stratify_by_column='star')

# Let's regroup on the main dataset
dataset['train'] = dataset_second_split['train']

dataset['val'] = dataset_second_split['test']

Taking a look to our current dataset:

We have 184k rows for training, 57k rows for testing and 46k rows for validation.

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['package_name', 'review', 'date', 'star'],
        num_rows: 184361
    })
    test: Dataset({
        features: ['package_name', 'review', 'date', 'star'],
        num_rows: 57613
    })
    val: Dataset({
        features: ['package_name', 'review', 'date', 'star'],
        num_rows: 46091
    })
})

We are going to use now [Distilbert](https://huggingface.co/distilbert/distilbert-base-cased), a distilled version of Bert base model.

To have a short description, [BERT]( https://arxiv.org/abs/1810.04805) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

Distilbert is a ligther and faster version : While the original model is composed by 12 layers and consistig of 110M of parameters, the distilled version has half the amount of layers and 66M of parameters. Created through "knowledge distillation", it means that the distilled model learns from the original's every second layer in a knowledge-sharing training.

We also need a tokenizer compatible for BERT, and that is how we have the AutoTokenizer's input being the name of the model we will soon instance.

In this case, BERT utilizes [WordPiece](https://huggingface.co/docs/transformers/en/tokenizer_summary#wordpiece) tokenizer, which is a subword tokenization algorithm.

In [13]:
MODEL = 'distilbert-base-cased'

tokenizer = AutoTokenizer.from_pretrained(MODEL)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Let's have a look at the tokenizer output when we pass it a string.

In [14]:
tokenizer('you are good')

{'input_ids': [101, 1128, 1132, 1363, 102], 'attention_mask': [1, 1, 1, 1, 1]}

In [15]:
dataset['train'][0]['review']

'I love it'

Wordpiece tokenization is whitespace sensitive, hence having in our reviews words and emojis separated by a space is treated differently :

In [16]:
tokenizer.decode(tokenizer('Nice😉')['input_ids'])

'[CLS] [UNK] [SEP]'

In [17]:
tokenizer.decode(tokenizer('Nice 😉')['input_ids'])

'[CLS] Nice [UNK] [SEP]'

In [18]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [19]:
# simple function to batch tokenize
def preprocess_tokenization(examples):
    return tokenizer(examples["review"], truncation=True)

For now, we are only interested on how the sentiment is valued on stars from the reviews, so we are cleaning our dataset to have only the columns we need.

In [20]:
dataset = dataset.map(preprocess_tokenization, batched=True)

dataset = dataset.rename_column("star", "label")
dataset = dataset.remove_columns(['package_name', 'date'])  # not removing review for later comparison against fine-tuned GPT 3.5
dataset

Map:   0%|          | 0/184361 [00:00<?, ? examples/s]

Map:   0%|          | 0/57613 [00:00<?, ? examples/s]

Map:   0%|          | 0/46091 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 184361
    })
    test: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 57613
    })
    val: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 46091
    })
})

We are using now [DataCollatorWithPadding](https://huggingface.co/docs/transformers/en/main_classes/data_collator) to create batches of data with padding. We add padding with this function as padding is added per batch, which means less interferance and padding time and padding is not stored in the variable, saving memory and computation time.

In [21]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [22]:
data_collator(tokenizer([dataset['train'][0]['review'], dataset['train'][1]['review']]))['input_ids']

tensor([[  101,   146,  1567,  1122,   102,     0,     0,     0],
        [  101,  2119,  3869,  1103,  4779, 16035, 12647,   102]])

In [23]:
data_collator(tokenizer([dataset['train'][0]['review'], dataset['train'][1]['review']]))['attention_mask']

tensor([[1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])

We can see our real tokens and the padding when visualizing the tensor from attention mask.

Now is the time we instance our DistilBert model from Hugging Face :

In [None]:
SCM_distilbert = AutoModelForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=5,
)

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Our Sequence classifier model made with Distilbert is composed by:

1. Embeddings : a layer that maps discrete tokens into a vector of numbers. Our embedding layer has 28.9k of vocabulary size with for a 768 extra embedding dimension.

2. Transformer encoder : Distilbert contains 6 Transformer blocks, which contains the typical transformer parameters (q,k,v) related to [self-attention ](https://arxiv.org/abs/1706.03762), a normalization layer, a Feed-forward Network make of 2 linear layers preceded by a dropout layer and an non-linear transformation layer (GELU) and an output normalization layer after the FFN.

3. A classifier layer that will output the label (star)classification of the result.

When we train our model, internally Pytorch applies log_softmax internally, which gives us the probability of being assigned to a specific class.

In [None]:

SCM_distilbert

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


We still can do the computation of the metrics to see the probability score:

In [24]:
def obtain_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"accuracy": (preds == p.label_ids).mean()}

In [None]:
pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.55.2-py3-none-any.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m120.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.55.1
    Uninstalling transformers-4.55.1:
      Successfully uninstalled transformers-4.55.1
Successfully installed transformers-4.55.2


In [None]:
epochs = 1

training_args = TrainingArguments(
    output_dir="./bert_clf_results",
    num_train_epochs=epochs,
    per_device_train_batch_size=64,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=128,
    load_best_model_at_end=True,

    # some deep learning parameters that the Trainer is able to take in
    warmup_ratio=0.1,
    weight_decay = 0.01,
    learning_rate=2e-5,

    logging_steps=1,
    log_level='info',
    eval_strategy='epoch',
    eval_steps=50,
    save_strategy='epoch'
)

# Define the trainer:

trainer = Trainer(
    model=SCM_distilbert,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['val'],
    compute_metrics=obtain_metrics,
    data_collator=data_collator
)

In [None]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: review. If review are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 184,361
  Num Epochs = 1
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 2
  Total optimization steps = 1,441
  Number of trainable parameters = 65,785,349
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdavalpez[0m ([33mdavid-valdivieso[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6096,0.833678,0.712395


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: review. If review are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 46091
  Batch size = 128
Saving model checkpoint to ./bert_clf_results/checkpoint-1441
Configuration saved in ./bert_clf_results/checkpoint-1441/config.json
Model weights saved in ./bert_clf_results/checkpoint-1441/model.safetensors
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in ./bert_clf_results/checkpoint-1441/tokenizer_config.json
Special tokens file saved in ./bert_clf_results/checkpoint-1441/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./bert_clf_results/checkpoint-1441 (score: 0.8336780667304993).


TrainOutput(global_step=1441, training_loss=0.8968047148067864, metrics={'train_runtime': 1983.9042, 'train_samples_per_second': 92.928, 'train_steps_per_second': 0.726, 'total_flos': 5958068458371300.0, 'train_loss': 0.8968047148067864, 'epoch': 1.0})

From our training recorded in Weights and biases, we can see the evolution curve of the loss value in our model as the training has progressed :

<img src="https://i.imgur.com/A6RQxel.png">

In [None]:
test_results = trainer.evaluate(dataset['test'])

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: review. If review are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 57613
  Batch size = 128


Let's see how our model is doing against the test set that hasn't seen before !

In [None]:
# Print the results
print("Test Set Evaluation Results:")
for key, value in test_results.items():
    print(f"{key}: {value}")


Test Set Evaluation Results:
eval_loss: 0.8281859755516052
eval_accuracy: 0.7116102268585215
eval_runtime: 240.9028
eval_samples_per_second: 239.155
eval_steps_per_second: 1.872
epoch: 1.0


So, our model recognizes correctly the correlation between the actual star score and the one guessed by the sentiment related to text a 71,16% percent of the times. It is quite interesting !

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `hf auth whoami` to get more information or `hf auth logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The tok

In [None]:
from huggingface_hub import HfApi

api = HfApi()
api.create_repo(repo_id="davalpez/distilbert-base-case-stars")

In [None]:
trainer.model.push_to_hub('distilbert-base-case-stars')

README.md: 0.00B [00:00, ?B/s]

Configuration saved in /tmp/tmpnyjp5fvj/config.json
Model weights saved in /tmp/tmpnyjp5fvj/model.safetensors
Uploading the following files to davalpez/distilbert-base-case-stars: config.json,model.safetensors,README.md


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpnyjp5fvj/model.safetensors    :   0%|          |  564kB /  263MB            

CommitInfo(commit_url='https://huggingface.co/davalpez/distilbert-base-case-stars/commit/71bb2c607341de9e75e787a57999a6f5e0a3bd85', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='71bb2c607341de9e75e787a57999a6f5e0a3bd85', pr_url=None, repo_url=RepoUrl('https://huggingface.co/davalpez/distilbert-base-case-stars', endpoint='https://huggingface.co', repo_type='model', repo_id='davalpez/distilbert-base-case-stars'), pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub('distilbert-base-case-stars')

tokenizer config file saved in /tmp/tmp0l2g0uyp/tokenizer_config.json
Special tokens file saved in /tmp/tmp0l2g0uyp/special_tokens_map.json
Uploading the following files to davalpez/distilbert-base-case-stars: vocab.txt,tokenizer_config.json,tokenizer.json,special_tokens_map.json,README.md
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/davalpez/distilbert-base-case-stars/commit/71bb2c607341de9e75e787a57999a6f5e0a3bd85', commit_message='Upload tokenizer', commit_description='', oid='71bb2c607341de9e75e787a57999a6f5e0a3bd85', pr_url=None, repo_url=RepoUrl('https://huggingface.co/davalpez/distilbert-base-case-stars', endpoint='https://huggingface.co', repo_type='model', repo_id='davalpez/distilbert-base-case-stars'), pr_revision=None, pr_num=None)

In [None]:
dataset['test'].select(range(5)).to_pandas()

Unnamed: 0,review,label,input_ids,attention_mask
0,Very good app easy to use... Some bugs though...,4,"[101, 6424, 1363, 12647, 3123, 1106, 1329, 119...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Wanted to Love it Really wanted to Love this g...,1,"[101, 20601, 1106, 2185, 1122, 8762, 1458, 110...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,Great,3,"[101, 2038, 102]","[1, 1, 1]"
3,Voice command is not do what you say its needs...,0,"[101, 7900, 2663, 1110, 1136, 1202, 1184, 1128...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,Nice app to impress the boss 😀,4,"[101, 8835, 12647, 1106, 19726, 1103, 6054, 10...","[1, 1, 1, 1, 1, 1, 1, 1, 1]"





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



I made a pause here, so let's try to get back our model !

In [1]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('davalpez/distilbert-base-case-stars').eval().to(0)
model.config.id2label = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}
tokenizer = AutoTokenizer.from_pretrained('davalpez/distilbert-base-case-stars')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/822 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [2]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Let's see how our model is performing. We are interested on seeing if our Fine-tuned transformer model can interpret the sentiment from the reviews and assign it to the star score (label) that user gave originally.

We are not only interesting in the original accuracy ( as if we can interpret the sentiment and define it as the same star value as the user), but also knowing how far was the model when predicting the value if it didn't label it correctly.

* Binary classification ( Okay review or Great review)

We will do now a specific binary classification, where we can tell the reviews are okay ( 3 stars or less) or great ( over 3), so we know how the model classify the sentiment on these two thresholds.

* Absolute error

We will also take the absolute value of the difference between truth values and predictive values so we know how far the prediction is from the actual value.

In [27]:
from tqdm import tqdm

batch_size = 128
clf_pipeline = pipeline('text-classification', model=model, tokenizer=tokenizer, device=0)

# run against dataset['test']['review']
clf_results = []

# Let's use tqdm in our loop so we can have a time estimation of the process
for i in tqdm(range(0, len(dataset['test']['review']), batch_size)):
    #Adjust the new batch values, jumping to the next batch on the i+1 iteration
    batch = dataset['test']['review'][i:i+batch_size]
    # clf_pipeline does the evaluation in the batch, then we add the results to
    # clf_resutls.
    clf_results.extend(clf_pipeline(batch))
    # Now, we just print some i-results to see the accuracy through the test batch
    if i % 100 == 0:
        # Accuracy
        star_preds = np.array([_['label'] for _ in clf_results])
        truth_preds = np.array(dataset['test']['label'][:len(star_preds)])
        acc = (np.array(star_preds) == truth_preds).mean()
        print(f'Accuracy is {acc}')
        # "Binary" classification accuracy
        Binary_accuracy = ((np.array(star_preds) <= 3) == (truth_preds <= 3)).mean()
        print(f'Binary accuracy is {Binary_accuracy}')
        # Accuracy off by one
        off_one = (np.abs((star_preds - truth_preds)) <= 1).mean()
        print(f'Accuracy off by one {off_one}')


Device set to use cuda:0
  0%|          | 1/451 [00:01<07:55,  1.06s/it]

Accuracy is 0.703125
Binary accuracy is 0.7578125
Accuracy off by one 0.90625


  6%|▌         | 26/451 [00:18<04:46,  1.48it/s]

Accuracy is 0.72265625
Binary accuracy is 0.8040865384615384
Accuracy off by one 0.875


 11%|█▏        | 51/451 [00:33<04:03,  1.64it/s]

Accuracy is 0.7162990196078431
Binary accuracy is 0.7981004901960784
Accuracy off by one 0.8670343137254902


 17%|█▋        | 76/451 [00:48<03:38,  1.72it/s]

Accuracy is 0.7152549342105263
Binary accuracy is 0.7970805921052632
Accuracy off by one 0.8657483552631579


 22%|██▏       | 101/451 [01:03<03:42,  1.57it/s]

Accuracy is 0.7148050742574258
Binary accuracy is 0.7955600247524752
Accuracy off by one 0.8651763613861386


 28%|██▊       | 126/451 [01:19<03:32,  1.53it/s]

Accuracy is 0.7132316468253969
Binary accuracy is 0.7953249007936508
Accuracy off by one 0.8654513888888888


 33%|███▎      | 151/451 [01:35<03:00,  1.66it/s]

Accuracy is 0.7147144039735099
Binary accuracy is 0.7978580298013245
Accuracy off by one 0.8665666390728477


 39%|███▉      | 176/451 [01:51<02:46,  1.65it/s]

Accuracy is 0.7137784090909091
Binary accuracy is 0.7966974431818182
Accuracy off by one 0.8655007102272727


 45%|████▍     | 201/451 [02:07<02:54,  1.43it/s]

Accuracy is 0.7136971393034826
Binary accuracy is 0.7955534825870647
Accuracy off by one 0.8650108830845771


 50%|█████     | 226/451 [02:23<02:18,  1.63it/s]

Accuracy is 0.7138066924778761
Binary accuracy is 0.7954922566371682
Accuracy off by one 0.8649751106194691


 56%|█████▌    | 251/451 [02:38<01:56,  1.72it/s]

Accuracy is 0.7139255478087649
Binary accuracy is 0.7951942231075697
Accuracy off by one 0.8647285856573705


 61%|██████    | 276/451 [02:53<01:42,  1.71it/s]

Accuracy is 0.7128623188405797
Binary accuracy is 0.7942425271739131
Accuracy off by one 0.8641587409420289


 67%|██████▋   | 301/451 [03:08<01:35,  1.56it/s]

Accuracy is 0.7135589700996677
Binary accuracy is 0.7949024086378738
Accuracy off by one 0.8645660299003323


 72%|███████▏  | 326/451 [03:23<01:14,  1.68it/s]

Accuracy is 0.7136694785276073
Binary accuracy is 0.7954131518404908
Accuracy off by one 0.864982745398773


 78%|███████▊  | 351/451 [03:38<00:57,  1.73it/s]

Accuracy is 0.7131187678062678
Binary accuracy is 0.7950053418803419
Accuracy off by one 0.8653846153846154


 83%|████████▎ | 376/451 [03:53<00:43,  1.72it/s]

Accuracy is 0.7124542885638298
Binary accuracy is 0.7947140957446809
Accuracy off by one 0.8654213763297872


 89%|████████▉ | 401/451 [04:08<00:32,  1.53it/s]

Accuracy is 0.7120285224438903
Binary accuracy is 0.7944396820448878
Accuracy off by one 0.8650444201995012


 94%|█████████▍| 426/451 [04:24<00:14,  1.67it/s]

Accuracy is 0.7122395833333334
Binary accuracy is 0.7942158157276995
Accuracy off by one 0.8649867957746479


100%|██████████| 451/451 [04:38<00:00,  1.62it/s]

Accuracy is 0.7116102268585215
Binary accuracy is 0.7938659677503341
Accuracy off by one 0.8653255341676358





This is how well our model is doing with the test set we set aside that has never seen before:
*  It has a 71.11 % accuracy assigning the same star rating to a review as the "truth" reviewer.

*  Up to 79.38 % of the time, the model scores the same star or just one star way from the truth reviwer.

* Doing a binary classification of "good vs great", the model would end up scoring up to 86.53 % of assigning the same ranting.

If we collected more data from other sources, we could have a good guess, setting some scores, how content are people with a certain product we are interested in without having to go through individual comments, and just scrapping the needed data from the sources we are interested in.

When we use programs like LM-Studio to connect our local LLM to our API, is interesting for us to see the amount of tokens per seconds consumed. When we use a model like Bert to benchmark its efficiency, we are interested in seeing the amount of samples per second.

As you have seen, when we train and we evaluate, we don't send individual tokens but a whole batch of samples. Bert is a encoder-only model, so it processes the entire input at once and for the classification task, it processes the entire parallel computation result into one final output.

In [None]:
import time
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm

def test_bert_throughput(tokenized_test_data,batch_size=32, time_bias_in_seconds=0):

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
    test_dataloader = DataLoader(tokenized_test_data, batch_size=batch_size, collate_fn=data_collator)

    model.eval()  # Set the model to evaluation mode
    total_time = 0
    total_samples = 0

    with torch.no_grad():
        for batch in tqdm(test_dataloader):
            start_time = time.time()
            outputs = model(**{k: v.to(model.device) for k, v in batch.items() if k in ['input_ids', 'attention_mask']})
            end_time = time.time()
            total_time += (end_time - start_time) + time_bias_in_seconds  # Add bias to account for lag
            total_samples += batch['input_ids'].size(0)  # Number of samples in the batch

    throughput = total_samples / total_time  # Samples per second
    print(f"Throughput: {throughput} samples/second")

In [None]:
dataset['test']

Dataset({
    features: ['review', 'label', 'input_ids', 'attention_mask'],
    num_rows: 57613
})

In [None]:
def tokenize_function(examples):
    # Truncate texts to the maximum length the model can accept
    return tokenizer(examples["review"], truncation=True)

# Apply the tokenization to all splits (train, test, val)
print("Tokenizing the dataset...")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
print("Tokenization complete.")

Tokenizing the dataset...


Map:   0%|          | 0/184361 [00:00<?, ? examples/s]

Map:   0%|          | 0/57613 [00:00<?, ? examples/s]

Map:   0%|          | 0/46091 [00:00<?, ? examples/s]

Tokenization complete.


In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 184361
    })
    test: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 57613
    })
    val: Dataset({
        features: ['review', 'label', 'input_ids', 'attention_mask'],
        num_rows: 46091
    })
})

In [None]:
final_dataset = tokenized_datasets.remove_columns('review')

In [None]:
test_bert_throughput(final_dataset['test'], batch_size=64)

100%|██████████| 901/901 [03:16<00:00,  4.58it/s]

Throughput: 309.11576459909054 samples/second





In [None]:
test_bert_throughput(final_dataset['test'], batch_size=32)

100%|██████████| 1801/1801 [02:42<00:00, 11.11it/s]

Throughput: 397.1082774108495 samples/second





In [None]:
test_bert_throughput(final_dataset['test'], batch_size=1)# same assumption as chatgpt that we do 1 at a time

100%|██████████| 57613/57613 [05:24<00:00, 177.40it/s]

Throughput: 195.9162041657233 samples/second





In [None]:
test_bert_throughput(final_dataset['test'],batch_size=32, time_bias_in_seconds=0.5) # even with a lag of a half second on each batch (that's a lot frankly) every batch, not bad

100%|██████████| 1801/1801 [02:39<00:00, 11.30it/s]

Throughput: 54.78276472078889 samples/second





In [None]:
test_bert_throughput(final_dataset['test'], 1, 0.02) # Assuming 20ms lag on each BERT call (pretty darn fast RPC call) we get about the throughput as ChatGPT at one sample at a time

100%|██████████| 57613/57613 [04:39<00:00, 206.38it/s]

Throughput: 40.9807586377695 samples/second





What if we used ChatGPT for this purpose ?

In [None]:
import openai
from openai import AsyncOpenAI
import os


Through the examples from the book LLMs, GPT, and Prompt Engineering for Developers, we can see throughput with the OpenAPI is about 34.896 samples per second, and the esstimated Cost to run test data through ChatGPT: $0.5574.

At the moment, ChatGPT models are all based on subscription ( example gpt-3.5-turbo).

## Why Bert is better than GPT for this use-case

Bert is a encoder-only transformer, which means its architecture it's designed more towards language understanding and correlation, understanding a large understanding of its context.

It is used commonly for text classification, spam detection and sentiment analysis.

On the other hand, GPT is primarily used for text generation as its architecture is decoder-only transformer, which means it is better for doing code generation/ augemented generation of code.

If we wanted text generation transformer, we could also use Gemma 2, which is an open-source decoder transformer model, based on Gemini architecture.

