In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_from_disk
import evaluate

In [2]:
N_EPOCHS = 5
BATCH_SIZE = 8
LEARNING_RATE = 5e-5

# Fine-tuning a model

In [3]:
checkpoint = "bert-base-uncased"

In [16]:
%%time
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=5)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

CPU times: user 1.58 s, sys: 508 ms, total: 2.09 s
Wall time: 3.09 s


This warning appears because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head).

## Load datasets

In [5]:
preprocessed_datasets = load_from_disk("ratings_dataset")
preprocessed_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 2688
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 672
    })
})

The `Dataset.map()` method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). This will be essential to unlock the speed of the “fast” tokenizers.

When we specify `batched=True` the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value.

The function that is responsible for putting together samples inside a batch is called a collate function. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size. We only apply padding as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit.

`padding="longest"` pads the sequences up to the maximum sequence length. Padding makes sure all our sentences have the same length by adding a special word/token to the sentences with fewer values.

`truncation=True` will truncate the sequences that are longer than the model max length. With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. One solution is to truncate the sequences.

The tokenizer object can handle the conversion to specific framework tensors, which can then be directly sent to the model.

In [17]:
def tokenize_function(review):
    return tokenizer(review["text"], truncation=True)  # padding="longest"

tokenized_datasets = preprocessed_datasets.map(tokenize_function, batched=True, batch_size=512)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



In [18]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2688
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 672
    })
})

## Training with the Trainer API

In [9]:
from transformers import TrainingArguments, Trainer
import numpy as np

The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [None]:
training_args = TrainingArguments(output_dir="models/bert-trainer",
                                  evaluation_strategy='epoch',
                                  num_train_epochs=N_EPOCHS,
                                  per_device_train_batch_size=BATCH_SIZE,
                                  per_device_eval_batch_size=BATCH_SIZE,
                                  learning_rate=LEARNING_RATE)

In [None]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
recall_metric = evaluate.load("recall")
precision_metric = evaluate.load("precision")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    preds = np.argmax(logits, axis=-1)

    results = {}
    results.update(accuracy_metric.compute(predictions=preds, references=labels))
    results.update(f1_metric.compute(predictions=preds, references=labels, average="weighted"))
    results.update(recall_metric.compute(predictions=preds, references=labels, average="weighted"))
    results.update(precision_metric.compute(predictions=preds, references=labels, average="weighted"))
    return results

When we pass the tokenizer, the default data_collator used by the Trainer will be a `DataCollatorWithPadding` as defined previously, so we can skip the line `data_collator=data_collator` in this call.
- ¿Sigue siendo necesario inicializar data_collator o tampoco?

In [None]:
trainer = Trainer(model, training_args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"],
                  data_collator=data_collator,
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Recall,Precision
1,No log,1.093178,0.525298,0.521761,0.525298,0.557235
2,1.175100,1.153612,0.534226,0.526796,0.534226,0.555456
3,0.699700,1.329214,0.547619,0.548635,0.547619,0.568604
4,0.699700,1.659811,0.544643,0.550359,0.544643,0.572937
5,0.243400,1.971264,0.553571,0.558719,0.553571,0.576341


TrainOutput(global_step=1680, training_loss=0.6402489457811628, metrics={'train_runtime': 379.4533, 'train_samples_per_second': 35.419, 'train_steps_per_second': 4.427, 'total_flos': 860120874051888.0, 'train_loss': 0.6402489457811628, 'epoch': 5.0})

## Training with PyTorch

In [8]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from transformers import get_scheduler
from tqdm.auto import tqdm
import copy

In [19]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

- Remove the columns corresponding to values the model does not expect (`text`).
- Set the format of the datasets so they return PyTorch tensors instead of lists.

In [20]:
tokenized_datasets = tokenized_datasets.remove_columns(['text'])
tokenized_datasets.set_format("torch")
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2688
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 672
    })
})

In [21]:
train_dataloader = DataLoader(tokenized_datasets["train"], shuffle=True, batch_size=BATCH_SIZE, collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=BATCH_SIZE, collate_fn=data_collator)

dataLoaders = {
    'train': train_dataloader,
    'validation': eval_dataloader
}

dataLoaders

{'train': <torch.utils.data.dataloader.DataLoader at 0x7fe5661b27f0>,
 'validation': <torch.utils.data.dataloader.DataLoader at 0x7fe5661b22b0>}

To quickly check there is no mistake in the data processing, we can inspect a batch like this.

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 112]),
 'token_type_ids': torch.Size([8, 112]),
 'attention_mask': torch.Size([8, 112]),
 'labels': torch.Size([8])}

The actual shapes will vary since we set `shuffle=True` for the training dataloader and we are padding to the maximum length inside the batch.

To make sure that everything will go smoothly during training, we pass our batch to this model:

In [None]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(1.8351, grad_fn=<NllLossBackward0>) torch.Size([8, 5])


The learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader).

In [12]:
num_training_steps = N_EPOCHS * len(train_dataloader)
print(num_training_steps)

1680


In [22]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

### The training loop

#### Manual metrics

In [14]:
progress_bar = tqdm(range(num_training_steps))

  0%|          | 0/1680 [00:00<?, ?it/s]

In [15]:
patience = 7  # Number of epochs with no improvement after which training will be stopped
early_stopping = False
best_loss = float('inf')
epochs_with_no_improvement = 0

for epoch in range(N_EPOCHS):

    epoch_results = {}

    for phase in ["train", "validation"]:
        # This sets the execution mode and informs layers (e.g., Dropout, BatchNorm) designed to behave differently during training and evaluation
        if phase == "train":
            model.train()
        else:
            model.eval()

        running_loss = 0.0
        correct_in_dataset = 0
        
        # For each batch update model parameters / weights
        for batch in dataLoaders[phase]:
            batch = {k: v.to(device) for k, v in batch.items()}
            
            optimizer.zero_grad()               # Sets the gradients of all optimized tensors to zero. Same as model.zero_grad() if all model parameters are in the optimizer
            outputs = model(**batch)
            loss = outputs.loss            
            
            if phase == "train":
                loss.backward()                 # Computes the gradient of loss w.r.t all the parameters in loss that have requires_grad=True and store them in x.grad (x.grad += dloss/dx)
                optimizer.step()                # Performs a single optimization step (parameter update based on the gradients)
                lr_scheduler.step()
                progress_bar.update(1)
            
            running_loss += loss.item()            
            predictions = torch.argmax(outputs.logits, dim=-1)
            
            correct_in_dataset += (predictions == batch['labels']).sum().item()
        
        if phase == "train":
            epoch_results['train_loss'] = running_loss/len(dataLoaders[phase])
            epoch_results['train_accuracy'] = correct_in_dataset/len(tokenized_datasets[phase])

        else:
            epoch_results['val_loss'] = running_loss/len(dataLoaders[phase])
            epoch_results['val_accuracy'] = correct_in_dataset/len(tokenized_datasets[phase])
            
            if epoch_results['val_loss'] < best_loss:
                best_loss = epoch_results['val_loss']
                best_model = copy.deepcopy(model.state_dict())
                epochs_with_no_improvement = 0
            else:
                epochs_with_no_improvement += 1

            if epochs_with_no_improvement == patience:
                model.load_state_dict(best_model)
                early_stopping = True

    print(f"Epoch {epoch+1}/{N_EPOCHS} | loss: {epoch_results['train_loss']:.4} - accuracy: {epoch_results['train_accuracy']:.4} - val_loss: {epoch_results['val_loss']:.4} - val_accuracy: {epoch_results['val_accuracy']:.4}")

    if early_stopping:
        print('Early stopping!')
        break

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch 1/5 | loss: 1.333 - accuracy: 0.3925 - val_loss: 1.126 - val_accuracy: 0.4985
Epoch 2/5 | loss: 0.963 - accuracy: 0.5893 - val_loss: 1.059 - val_accuracy: 0.5387
Epoch 3/5 | loss: 0.6274 - accuracy: 0.7407 - val_loss: 1.22 - val_accuracy: 0.5432
Epoch 4/5 | loss: 0.3001 - accuracy: 0.8955 - val_loss: 1.505 - val_accuracy: 0.5699
Epoch 5/5 | loss: 0.1178 - accuracy: 0.9702 - val_loss: 1.71 - val_accuracy: 0.5536


#### HuggingFace metrics

In [23]:
progress_bar = tqdm(range(num_training_steps))

  0%|          | 0/1680 [00:00<?, ?it/s]

In [24]:
patience = 7  # Number of epochs with no improvement after which training will be stopped
early_stopping = False
best_loss = float('inf')
epochs_with_no_improvement = 0

for epoch in range(N_EPOCHS):

    metrics = {
        'train': {
            'accuracy': evaluate.load("accuracy")
         },
        'validation': {
            'accuracy': evaluate.load("accuracy")
        }
    }

    epoch_results = {}

    for phase in ["train", "validation"]:
        # This sets the execution mode and informs layers (e.g., Dropout, BatchNorm) designed to behave differently during training and evaluation
        if phase == "train":
            model.train()
        else:
            model.eval()

        running_loss = 0.0
        
        # For each batch update model parameters / weights
        for batch in dataLoaders[phase]:
            batch = {k: v.to(device) for k, v in batch.items()}
            
            optimizer.zero_grad()               # Sets the gradients of all optimized tensors to zero. Same as model.zero_grad() if all model parameters are in the optimizer
            outputs = model(**batch)
            loss = outputs.loss            
            
            if phase == "train":
                loss.backward()                 # Computes the gradient of loss w.r.t all the parameters in loss that have requires_grad=True and store them in x.grad (x.grad += dloss/dx)
                optimizer.step()                # Performs a single optimization step (parameter update based on the gradients)
                lr_scheduler.step()
                progress_bar.update(1)

            running_loss += loss.item()            
            predictions = torch.argmax(outputs.logits, dim=-1)
            metrics[phase]['accuracy'].add_batch(predictions=predictions, references=batch["labels"])
        
        if phase == "train":
            epoch_results['train_loss'] = running_loss/len(dataLoaders[phase])
            epoch_results['train_accuracy'] = metrics[phase]['accuracy'].compute()['accuracy']

        else:
            epoch_results['val_loss'] = running_loss/len(dataLoaders[phase])
            epoch_results['val_accuracy'] = metrics[phase]['accuracy'].compute()['accuracy']
            
            if epoch_results['val_loss'] < best_loss:
                best_loss = epoch_results['val_loss']
                best_model = copy.deepcopy(model.state_dict())
                epochs_with_no_improvement = 0
            else:
                epochs_with_no_improvement += 1

            if epochs_with_no_improvement == patience:
                model.load_state_dict(best_model)
                early_stopping = True

    print(f"Epoch {epoch+1}/{N_EPOCHS} | loss: {epoch_results['train_loss']:.4} - accuracy: {epoch_results['train_accuracy']:.4} - val_loss: {epoch_results['val_loss']:.4} - val_accuracy: {epoch_results['val_accuracy']:.4}")

    if early_stopping:
        print('Early stopping!')
        break

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch 1/5 | loss: 1.278 - accuracy: 0.4196 - val_loss: 1.108 - val_accuracy: 0.5179
Epoch 2/5 | loss: 0.9151 - accuracy: 0.6127 - val_loss: 1.066 - val_accuracy: 0.5372
Epoch 3/5 | loss: 0.5985 - accuracy: 0.7708 - val_loss: 1.26 - val_accuracy: 0.5461
Epoch 4/5 | loss: 0.3131 - accuracy: 0.8943 - val_loss: 1.547 - val_accuracy: 0.5357
Epoch 5/5 | loss: 0.1285 - accuracy: 0.9654 - val_loss: 1.646 - val_accuracy: 0.5461
