<a href="https://colab.research.google.com/github/alexlimatds/PyTorch-examples/blob/main/text_classification_with_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text classification with Hugging Face


### Basic example of sentiment detection

This example uses a pretrained model to perform sentiment classification.

Source: https://huggingface.co/transformers/quicktour.html

In [1]:
!pip install transformers



Downloading a pipeline which encapsulates a tokenizer and a pretrained model. The pipeline use the tokenizer to preprocess the text. The pipeline also performs post-process.

In [2]:
from transformers import pipeline

# Loading a pipeline with pretrained model and its tokenizer
classifier = pipeline('sentiment-analysis')

In [3]:
results = classifier(
    ["We are very happy to show you the 🤗 Transformers library.", 
     "We hope you don't hate it.", 
     "This movie sucks!"])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309
label: NEGATIVE, with score: 0.9992


We are able to get the tokenizer and the model from the pipeline object.

In [4]:
tokenizer = classifier.tokenizer
print(tokenizer("Saving Private Ryan is a great movie."))

model = classifier.model
print("Model: ", type(model).__name__)

{'input_ids': [101, 7494, 2797, 4575, 2003, 1037, 2307, 3185, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Model:  DistilBertForSequenceClassification


### Fine tuning

We can perform fine tuning on a pretrained model. It's possible to change the number of labels and maintain the pretrained weights in the first layers.

Let's download a pretrained model and its tokenizer. The model will predict the star number (1 to 5) of a product review.

In [5]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=5) # 5 classes - one for each star
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Installing Huggingface Dataset library.

In [6]:
!pip install datasets



Loading the The Multilingual Amazon Reviews Corpus dataset. Details about it can be found at https://huggingface.co/datasets/

In [7]:
from datasets import load_dataset

dataset_full = load_dataset('amazon_reviews_multi', 'en')

Reusing dataset amazon_reviews_multi (/root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd)


In [8]:
print('Number of data instances: ', len(dataset_full['train']))
dataset_full['train'][0]

Number of data instances:  200000


{'language': 'en',
 'product_category': 'furniture',
 'product_id': 'product_en_0740675',
 'review_body': "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
 'review_id': 'en_0964290',
 'review_title': "I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review ...",
 'reviewer_id': 'reviewer_en_0342986',
 'stars': 1}

In [9]:
def count_by_stars(ds):
  stars = [1, 2, 3, 4, 5]
  for l in stars:
    print("Instances for {} stars: {}".format(l, ds.filter(lambda record: record['stars'] == l).num_rows))

In [10]:
dataset_train = dataset_full['train']
dataset_val = dataset_full['validation']

print("Train dataset:")
count_by_stars(dataset_train)
print("Validation dataset:")
count_by_stars(dataset_val)

Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-74e409560f0bf636.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-714f51de2678f108.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-5c4e8200da3f95e7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-1741aa285ecc9000.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-5daa13abe1a9fb64.arrow
Loading cached processed datas

Train dataset:
Instances for 1 stars: 40000
Instances for 2 stars: 40000
Instances for 3 stars: 40000
Instances for 4 stars: 40000
Instances for 5 stars: 40000
Validation dataset:
Instances for 1 stars: 1000
Instances for 2 stars: 1000
Instances for 3 stars: 1000
Instances for 4 stars: 1000
Instances for 5 stars: 1000


In [11]:
print('Features: ', dataset_train.features)
dataset_train[0]

Features:  {'review_id': Value(dtype='string', id=None), 'product_id': Value(dtype='string', id=None), 'reviewer_id': Value(dtype='string', id=None), 'stars': Value(dtype='int32', id=None), 'review_body': Value(dtype='string', id=None), 'review_title': Value(dtype='string', id=None), 'language': Value(dtype='string', id=None), 'product_category': Value(dtype='string', id=None)}


{'language': 'en',
 'product_category': 'furniture',
 'product_id': 'product_en_0740675',
 'review_body': "Arrived broken. Manufacturer defect. Two of the legs of the base were not completely formed, so there was no way to insert the casters. I unpackaged the entire chair and hardware before noticing this. So, I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review of part of a chair I never got to sit in. I will go so far as to include a picture of what their injection molding and quality assurance process missed though. I will be hesitant to buy again. It makes me wonder if there aren't missing structures and supports that don't impede the assembly process.",
 'review_id': 'en_0964290',
 'review_title': "I'll spend twice the amount of time boxing up the whole useless thing and send it back with a 1-star review ...",
 'reviewer_id': 'reviewer_en_0342986',
 'stars': 1}

In the following function, we use the tokenizer to preprocess the review text. The `map` method from the dataset object inserts the generated fields (`attention_mask` and `input_ids`) into the dataset.

In [12]:
def preprocess(ds):
  return ds.map(
      lambda batch: tokenizer(
          batch["review_body"], 
          truncation=True, 
          padding='max_length', 
          max_length=60), 
      batched=True)


In [13]:
dataset_train = preprocess(dataset_train)
dataset_val = preprocess(dataset_val)

# In the output, check the inserted fields
dataset_train.features

Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-a28e4f3885e3776e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-a9c680c02a8d5621.arrow


{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'language': Value(dtype='string', id=None),
 'product_category': Value(dtype='string', id=None),
 'product_id': Value(dtype='string', id=None),
 'review_body': Value(dtype='string', id=None),
 'review_id': Value(dtype='string', id=None),
 'review_title': Value(dtype='string', id=None),
 'reviewer_id': Value(dtype='string', id=None),
 'stars': Value(dtype='int32', id=None)}

In [14]:
# The labels must be numbered from 0 to N-1 (N stands for the number of classes)
def adjust_labels(data_instance):
  data_instance['labels'] = data_instance['stars'] - 1  # The model requires 'labels' as the name of the column contaning the labels
  return data_instance

In [15]:
dataset_train = dataset_train.map(adjust_labels)
dataset_val = dataset_val.map(adjust_labels)

dataset_val.features

Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-181535b273002c6b.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/f3357bd271e187385a38574fe31b8fb10055303f67fa9fce55e84d08c4870efd/cache-c6c402349c94a3d2.arrow


{'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'labels': Value(dtype='int64', id=None),
 'language': Value(dtype='string', id=None),
 'product_category': Value(dtype='string', id=None),
 'product_id': Value(dtype='string', id=None),
 'review_body': Value(dtype='string', id=None),
 'review_id': Value(dtype='string', id=None),
 'review_title': Value(dtype='string', id=None),
 'reviewer_id': Value(dtype='string', id=None),
 'stars': Value(dtype='int32', id=None)}

We use the `set_format` method to indicates which columns will be used as input during training.

In [16]:
dataset_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
dataset_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print({key: val.shape for key, val in dataset_train[0].items()})

{'attention_mask': torch.Size([60]), 'input_ids': torch.Size([60]), 'labels': torch.Size([])}


  return torch.tensor(x, **format_kwargs)


Let's perform training. We'll use the Huggingface facilities but it is possible to apply the PyTorch way.

In [17]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset_train,         # training dataset
    eval_dataset=dataset_val             # evaluation dataset
)

In [18]:
trainer.train()

Step,Training Loss
500,1.254666
1000,1.07308
1500,1.04808
2000,1.019723
2500,0.998885
3000,0.995968
3500,0.984322
4000,0.976184
4500,0.963402
5000,0.968773


TrainOutput(global_step=6250, training_loss=1.0137391796875)

In [20]:
from transformers import TextClassificationPipeline

sequences = [
    "We are very happy to show you the Transformers library.", 
    "We hope you don't hate it.", 
    "This movie sucks!", 
    "The device works fine and its battery longs for hours. I recommend it.", 
    "The movie has good and bad moments.", 
    "The main features work well but some details can improve such the eject button"]

classifier2 = TextClassificationPipeline(
    model=model, 
    tokenizer=tokenizer, 
    device=model.device.index)  # put the pipeline on the same model's device

results = classifier2(sequences)
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: LABEL_4, with score: 0.644
label: LABEL_2, with score: 0.3281
label: LABEL_0, with score: 0.95
label: LABEL_4, with score: 0.5817
label: LABEL_2, with score: 0.4097
label: LABEL_3, with score: 0.6181
