# 4. Fine tuning a Text Classification model with Multiple Input Sentences 

In [3]:
!pip install --q transformers torch datasets

In [4]:
from datasets import load_dataset
import numpy as np

In [5]:
# The Recognizing Textual Entailment (RTE) datasets come from a series of annual
# textual entailment challenges. We combine the data from RTE1 (Dagan et al.,
# 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5
# (Bentivogli et al., 2009).4 Examples are constructed based on news and
# Wikipedia text. We convert all datasets to a two-class split, where for
# three-class datasets we collapse neutral and contradiction into not
# entailment, for consistency.
raw_datasets = load_dataset("glue", "rte")

Using the latest cached version of the dataset since glue couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'rte' at /home/jupyter/.cache/huggingface/datasets/glue/rte/1.0.0/fd8e86499fa5c264fcaad392a8f49ddf58bf4037 (last modified on Fri Feb  7 19:44:00 2025).


In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

In [7]:
raw_datasets['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['entailment', 'not_entailment'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [8]:
raw_datasets['train']['sentence1'][:10]

['No Weapons of Mass Destruction Found in Iraq Yet.',
 'A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.',
 'Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients.',
 'Judie Vivian, chief executive at ProMedica, a medical service company that helps sustain the 2-year-old Vietnam Heart Institute in Ho Chi Minh City (formerly Saigon), said that so far about 1,500 children have received treatment.',
 "A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One's Crimewatch. Colette Aram, 16, was walking to her boyfriend's house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found i

In [9]:
checkpoint = 'distilbert-base-cased'
# checkpoint = 'bert-base-cased'

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Test our tokenizer on the first pair of sentences in our dataset:

In [11]:
tokenizer(
    raw_datasets['train']['sentence1'][0],
    raw_datasets['train']['sentence2'][0])

{'input_ids': [101, 1302, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 6355, 119, 102, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
result = _

In [13]:
# distilbert doesn't use token_type_ids
result.keys()

dict_keys(['input_ids', 'attention_mask'])

Decode the input IDs, we'll see our input sentences concatenated into a string, separated by [SEP]

In [14]:
tokenizer.decode(result['input_ids'])

'[CLS] No Weapons of Mass Destruction Found in Iraq Yet. [SEP] Weapons of Mass Destruction Found in Iraq. [SEP]'

In [15]:
model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
training_args = TrainingArguments(
  output_dir='training_dir',
  evaluation_strategy='epoch',
  save_strategy='epoch',
  num_train_epochs=5,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=64,
  logging_steps=150, # otherwise, 'no log' will appear under training loss
)

In [17]:
from datasets import load_metric
metric = load_metric("glue", "rte")
metric.compute(predictions=[1, 0, 1], references=[1, 0, 0])

  metric = load_metric("glue", "rte")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'accuracy': 0.6666666666666666}

We only get the accuracy, so let's import F1 from scikit-learn and compute our metrics by defining a `compute_metrics` function:

In [18]:
from sklearn.metrics import f1_score

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    predictions = np.argmax(logits, axis=-1)
    acc = np.mean(predictions == labels)
    f1 = f1_score(labels, predictions)
    return {'accuracy': acc, 'f1': f1}

Now, let's define our tokenizer function. As always, the input to this function is a batch of data from our dataset.

In [19]:
def tokenize_fn(batch):
    return tokenizer(batch['sentence1'], batch['sentence2'], truncation=True)

Now let's create our tokenized dataset:

In [20]:
tokenized_datasets = raw_datasets.map(tokenize_fn, batched=True)

Map:   0%|          | 0/277 [00:00<?, ? examples/s]

Now let's create our trainer object and train our model:

In [21]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6968,0.69289,0.523466,0.043478
2,0.6766,0.738031,0.498195,0.603989
3,0.5272,0.84568,0.552347,0.465517
4,0.2737,1.328066,0.584838,0.572491
5,0.1112,1.827858,0.563177,0.550186


TrainOutput(global_step=780, training_loss=0.4446157678579673, metrics={'train_runtime': 195.568, 'train_samples_per_second': 63.661, 'train_steps_per_second': 3.988, 'total_flos': 543824207151168.0, 'train_loss': 0.4446157678579673, 'epoch': 5.0})

As we can see, our model overfits already since epoch 2, so in the real world we should work on this. However, for educational purposes, let's just ignore this and make some predictions with our model.

In [22]:
trainer.save_model('my_saved_model')

In [23]:
from transformers import pipeline

p = pipeline('text-classification', model='my_saved_model', device=0)

In [24]:
p({'text': 'I went to the store', 'text_pair': 'I am a bird'})

{'label': 'LABEL_1', 'score': 0.7081083655357361}

In [25]:
p({'text': 'Elena got the job :)', 'text_pair': 'A stalker got here to try to understand if Elena deserved the job ;)'})

{'label': 'LABEL_1', 'score': 0.9929063320159912}

**Conclusion:** As the "entailment" definition explains:

_Sentence A entails Sentence B if, necessarily, whenever Sentence A is true, Sentence B must also be true.  It's a strong logical connection.  It's not just that B is likely to be true if A is true; it must be true._

We've got a 0.99 score on the last pair of sentences... I'll leave it to the reader to conclude!!!