# Fine-tuning BERT

In this example, we will fine-tune BERTimbau (the main BERT model trained in Portuguese) for the Natural Language Inference (NLI) task. In this particular task, we are presented with a pair of texts: a premise and a hypothesis. The objective is to determine whether the hypothesis is a logical consequence (entailment) of the premise. To fine-tune the model, we will utilize ASSIN2, a prominent NLI dataset in Portuguese. It comprises a few thousand examples annotated into two classes: ENTAILMENT and NONE.



## We install the packages

In [1]:
!pip install transformers[torch] datasets evaluate



## We load the ASSIN2 dataset from the Hugging Face Hub

In [2]:
from datasets import load_dataset
dataset=load_dataset("assin2")

In [3]:
dataset=dataset.rename_columns({'entailment_judgment':'label'})

## We tokenize the texts

In [4]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased",
                                          model_max_length=512)

def tokenize_function(examples):
  return tokenizer([z for z in zip(examples["premise"], examples["hypothesis"])],
                   truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2448 [00:00<?, ? examples/s]

## We instantiate the Data Collector

In [5]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## In this example, we will employ a smaller sample of the dataset to reduce training time.

In [6]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
eval_dataset = tokenized_datasets["validation"].shuffle(seed=42)

We instantiate the Pre-trained BERTimbau model

In [7]:
from transformers import AutoModelForSequenceClassification, AutoConfig
label2id = {'NONE': 0, 'ENTAILMENT':1}
id2label = {0: 'NONE', 1: 'ENTAILMENT'}
config= AutoConfig.from_pretrained("neuralmind/bert-base-portuguese-cased",
                                   label2id=label2id, id2label=id2label,
                                   num_labels=2, seed=1)
model = AutoModelForSequenceClassification.from_pretrained("neuralmind/bert-base-portuguese-cased",
                                                           config=config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at neuralmind/bert-base-portuguese-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## We instantiate the Trainer and set up the evaluation metrics.

In [8]:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer", num_train_epochs=3.0,
                                  evaluation_strategy='epoch', save_strategy='epoch')

In [9]:
import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")

  metric = load_metric("accuracy")


In [10]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [11]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

## We train the model

In [12]:
trained=trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.556967,0.77
2,No log,0.332104,0.896
3,No log,0.388584,0.9


## We save the model

In [13]:
trainer.save_model("Bertinho")

## We load and use the model

In [14]:
bertinho = AutoModelForSequenceClassification.from_pretrained('/content/Bertinho/')

In [15]:
from transformers import pipeline
classifier = pipeline(task='text-classification', model=bertinho, tokenizer = tokenizer)
classifier("Não")

[{'label': 'NONE', 'score': 0.9976041913032532}]