<a href="https://colab.research.google.com/github/alexlimatds/PyTorch-examples/blob/main/text_classification_with_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text classification with Hugging Face


### Basic example of sentiment detection

This example uses a pretrained model to perform sentiment classification.

Source: https://huggingface.co/transformers/quicktour.html

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |▏                               | 10kB 25.0MB/s eta 0:00:01[K     |▍                               | 20kB 18.9MB/s eta 0:00:01[K     |▋                               | 30kB 16.3MB/s eta 0:00:01[K     |▉                               | 40kB 15.4MB/s eta 0:00:01[K     |█                               | 51kB 12.0MB/s eta 0:00:01[K     |█▎                              | 61kB 11.9MB/s eta 0:00:01[K     |█▌                              | 71kB 11.8MB/s eta 0:00:01[K     |█▊                              | 81kB 12.6MB/s eta 0:00:01[K     |██                              | 92kB 13.3MB/s eta 0:00:01[K     |██▏                             | 102kB 13.1MB/s eta 0:00:01[K     |██▍                             | 112kB 13.1MB/s eta 0:00:01[K     |██▋                             | 

Downloading a pipeline which encapsulates a tokenizer and a pretrained model. The pipeline use the tokenizer to preprocess the text. The pipeline also performs post-process.

In [2]:
from transformers import pipeline

# Loading a pipeline with pretrained model and its tokenizer
classifier = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [3]:
results = classifier(
    ["We are very happy to show you the 🤗 Transformers library.", 
     "We hope you don't hate it.", 
     "This movie sucks!"])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
label: NEGATIVE, with score: 0.9992


We are able to get the tokenizer and the model from the pipeline object.

In [4]:
tokenizer = classifier.tokenizer
print(tokenizer("Saving Private Ryan is a great movie."))

model = classifier.model
print("Model: ", type(model).__name__)

{'input_ids': [101, 7494, 2797, 4575, 2003, 1037, 2307, 3185, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Model:  DistilBertForSequenceClassification


### Fine tuning

We can perform fine tuning on a pretrained model. It's possible to change the number of labels and maintain the pretrained weights of the `core` model while the fine tuning will adjust the output weights.

Let's download a pretrained model and its tokenizer. The model will predict the star number (1 to 5) of a product review.

In [21]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=5) # 5 classes - one for each star
# Customizing the label's output
id2label = {
    0: '1 star', 
    1: '2 stars', 
    2: '3 stars', 
    3: '4 stars', 
    4: '5 stars'}
model.config.id2label = id2label


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Because we have set the number of classes, the weigths of output layer contain ramdom values, just the core layers are pretrained. Let's input some phrases and check the results.

In [6]:
from transformers import TextClassificationPipeline

sequences = [
    "We are very happy to show you the Transformers library.", 
    "We hope you don't hate it.", 
    "This movie sucks!", 
    "The device works fine and its battery longs for hours. I recommend it.", 
    "The movie has good and bad moments.", 
    "The main features work well but some details can improve such the eject button. It's up to you to decide if it worths.", 
    "The product has no flaws, I really recommend it.", 
    "I regret the buying. The main feature is ok, but the others don't perform good enough."]

classifier2 = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer)  # put the pipeline on the same model's device

results = classifier2(sequences)
for r, s in zip(results, sequences):
    print(f"label: {r['label']}, with score: {round(r['score'], 4)} - {s}")

label: 5 stars, with score: 0.2265 - We are very happy to show you the Transformers library.
label: 5 stars, with score: 0.239 - We hope you don't hate it.
label: 5 stars, with score: 0.2387 - This movie sucks!
label: 5 stars, with score: 0.2303 - The device works fine and its battery longs for hours. I recommend it.
label: 5 stars, with score: 0.2355 - The movie has good and bad moments.
label: 5 stars, with score: 0.2341 - The main features work well but some details can improve such the eject button. It's up to you to decide if it worths.
label: 5 stars, with score: 0.24 - The product has no flaws, I really recommend it.
label: 5 stars, with score: 0.2385 - I regret the buying. The main feature is ok, but the others don't perform good enough.


Installing Huggingface Dataset library.

In [7]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/ee/78/5873ac1e27bf25a2cbf3447d6704edd3136b1b3ff0eb3bfab38a45d2a1ff/datasets-1.2.0-py3-none-any.whl (159kB)
[K     |██                              | 10kB 26.2MB/s eta 0:00:01[K     |████▏                           | 20kB 30.3MB/s eta 0:00:01[K     |██████▏                         | 30kB 21.6MB/s eta 0:00:01[K     |████████▎                       | 40kB 25.4MB/s eta 0:00:01[K     |██████████▎                     | 51kB 23.9MB/s eta 0:00:01[K     |████████████▍                   | 61kB 17.1MB/s eta 0:00:01[K     |██████████████▍                 | 71kB 17.8MB/s eta 0:00:01[K     |████████████████▌               | 81kB 18.9MB/s eta 0:00:01[K     |██████████████████▌             | 92kB 17.7MB/s eta 0:00:01[K     |████████████████████▋           | 102kB 17.8MB/s eta 0:00:01[K     |██████████████████████▋         | 112kB 17.8MB/s eta 0:00:01[K     |████████████████████████▊       | 122kB 17

Loading the The Multilingual Amazon Reviews Corpus dataset. Details about it can be found at https://huggingface.co/datasets/

In [8]:
from datasets import load_dataset

dataset_full = load_dataset('amazon_reviews_multi', 'en')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2773.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3624.0, style=ProgressStyle(description…


Downloading and preparing dataset amazon_reviews_multi/en (download: 82.11 MiB, generated: 58.69 MiB, post-processed: Unknown size, total: 140.79 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=81989414.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2059600.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2045098.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd. Subsequent calls will reuse this data.


In [9]:
print('Number of data instances: ', len(dataset_full['train']))
dataset_full['train'][0]

Number of data instances:  200000


{'language': 'en',
 'product_category': 'furniture',
 'product_id': 'product_en_0740675',
 'review_body': "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
 'review_id': 'en_0964290',
 'review_title': "I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review ...",
 'reviewer_id': 'reviewer_en_0342986',
 'stars': 1}

In [10]:
def count_by_stars(ds):
  stars = [1, 2, 3, 4, 5]
  for l in stars:
    print("Instances for {} stars: {}".format(l, ds.filter(lambda record: record['stars'] == l).num_rows))

In [11]:
dataset_train = dataset_full['train']
dataset_val = dataset_full['validation']

print("Train dataset:")
count_by_stars(dataset_train)
print("Validation dataset:")
count_by_stars(dataset_val)

Train dataset:


HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))


Instances for 1 stars: 40000


HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))


Instances for 2 stars: 40000


HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))


Instances for 3 stars: 40000


HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))


Instances for 4 stars: 40000


HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))


Instances for 5 stars: 40000
Validation dataset:


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Instances for 1 stars: 1000


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Instances for 2 stars: 1000


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Instances for 3 stars: 1000


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Instances for 4 stars: 1000


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))


Instances for 5 stars: 1000


In [12]:
print('Features: ', dataset_train.features)
dataset_train[0]

Features:  {'review_id': Value(dtype='string', id=None), 'product_id': Value(dtype='string', id=None), 'reviewer_id': Value(dtype='string', id=None), 'stars': Value(dtype='int32', id=None), 'review_body': Value(dtype='string', id=None), 'review_title': Value(dtype='string', id=None), 'language': Value(dtype='string', id=None), 'product_category': Value(dtype='string', id=None)}


{'language': 'en',
 'product_category': 'furniture',
 'product_id': 'product_en_0740675',
 'review_body': "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
 'review_id': 'en_0964290',
 'review_title': "I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review ...",
 'reviewer_id': 'reviewer_en_0342986',
 'stars': 1}

In the following function, we use the tokenizer to preprocess the review text. The `map` method from the dataset object inserts the generated fields (`attention_mask` and `input_ids`) into the dataset.

In [13]:
def preprocess(ds):
  return ds.map(
      lambda batch: tokenizer(
          batch["review_body"], 
          truncation=True, 
          padding='max_length', 
          max_length=60), 
      batched=True)


In [14]:
dataset_train = preprocess(dataset_train)
dataset_val = preprocess(dataset_val)

# In the output, remark the inserted fields
dataset_train.features

HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'language': Value(dtype='string', id=None),
 'product_category': Value(dtype='string', id=None),
 'product_id': Value(dtype='string', id=None),
 'review_body': Value(dtype='string', id=None),
 'review_id': Value(dtype='string', id=None),
 'review_title': Value(dtype='string', id=None),
 'reviewer_id': Value(dtype='string', id=None),
 'stars': Value(dtype='int32', id=None)}

In [15]:
# The labels' ids must be numbered from 0 to N-1 (N stands for the number of classes)
def adjust_labels(data_instance):
  data_instance['labels'] = data_instance['stars'] - 1  # The model requires 'labels' as the name of the column contaning the label's ids
  return data_instance

In [16]:
dataset_train = dataset_train.map(adjust_labels)
dataset_val = dataset_val.map(adjust_labels)

dataset_val.features

HBox(children=(FloatProgress(value=0.0, max=200000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))




{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'labels': Value(dtype='int64', id=None),
 'language': Value(dtype='string', id=None),
 'product_category': Value(dtype='string', id=None),
 'product_id': Value(dtype='string', id=None),
 'review_body': Value(dtype='string', id=None),
 'review_id': Value(dtype='string', id=None),
 'review_title': Value(dtype='string', id=None),
 'reviewer_id': Value(dtype='string', id=None),
 'stars': Value(dtype='int32', id=None)}

We use the `set_format` method to indicates which columns will be used as input during training.

In [17]:
dataset_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
dataset_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print({key: val.shape for key, val in dataset_train[0].items()})

{'attention_mask': torch.Size([60]), 'input_ids': torch.Size([60]), 'labels': torch.Size([])}


  return torch.tensor(x, **format_kwargs)


The `compute_metrics` function below will be used by to get the model's performance.

In [18]:
from datasets import load_metric
import numpy as np

metric = load_metric('precision')

def compute_metrics(eval_pred):
  labels = eval_pred.label_ids
  predictions = eval_pred.predictions.argmax(-1)
  return metric.compute(predictions=predictions, references=labels, average='macro')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1817.0, style=ProgressStyle(description…




Let's perform training. We'll use the Huggingface facilities but it is possible to apply the PyTorch way.

In [22]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    logging_dir='./logs',            # directory for storing logs
    num_train_epochs=2,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,
    metric_for_best_model='precision', 
    evaluation_strategy='steps'
)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_val,            # evaluation dataset
    compute_metrics=compute_metrics
)

In [23]:
trainer.train()

Step,Training Loss,Validation Loss,Precision
500,1.258423,1.086396,0.516534
1000,1.070474,1.055209,0.532111
1500,1.04953,1.03292,0.542848
2000,1.02396,1.004182,0.567409
2500,1.003268,1.023817,0.563023
3000,1.003298,1.004298,0.548008
3500,0.990334,0.999029,0.551605
4000,0.98708,0.99528,0.584598
4500,0.976881,0.995796,0.556761
5000,0.981573,0.973358,0.571802


TrainOutput(global_step=12500, training_loss=0.9450304125976563)

In [24]:
trainer.evaluate()

{'epoch': 2.0,
 'eval_loss': 0.9714898467063904,
 'eval_precision': 0.5874171874701958}

The model is fine tuned now. Let's input some phrases again. We have to recreate the pipeline with the `device` argument. (Why? The model was moved to GPU during training?)

In [25]:
classifier2 = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer, 
    device=model.device.index)  # put the pipeline on the same model's device

results = classifier2(sequences)
for r, s in zip(results, sequences):
  print(f"label: {r['label']}, with score: {round(r['score'], 4)} - {s}")

label: 5 stars, with score: 0.6697 - We are very happy to show you the Transformers library.
label: 3 stars, with score: 0.3642 - We hope you don't hate it.
label: 1 star, with score: 0.9678 - This movie sucks!
label: 5 stars, with score: 0.6253 - The device works fine and its battery longs for hours. I recommend it.
label: 3 stars, with score: 0.5743 - The movie has good and bad moments.
label: 3 stars, with score: 0.4708 - The main features work well but some details can improve such the eject button. It's up to you to decide if it worths.
label: 5 stars, with score: 0.7908 - The product has no flaws, I really recommend it.
label: 2 stars, with score: 0.4981 - I regret the buying. The main feature is ok, but the others don't perform good enough.
