Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# TRANSFORMERS

In this notebook we will explore [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).
You may also want to check the [Hugging Face course](https://huggingface.co/course/), which will explain you how to use this technology in a much greater depth.

Training transformer models is computationally expensive. Hugging Face makes available several pretrained [models](https://huggingface.co/models) that can be used as is, or fine-tuned to a specific NLP task, such as one of sentence classification. That's what we'll do in this notebook.

Hugging Face also makes available several [datasets](https://huggingface.co/datasets) that can be used to train or fine-tune a model.

See:
- https://huggingface.co/docs/transformers/tasks/sequence_classification#preprocess
- https://huggingface.co/docs/transformers/training#prepare-a-dataset
- https://huggingface.co/docs/transformers/accelerate

## Loading a dataset

In this notebook, we'll start by using a local dataset (instead of using a dataset stored at Hugging Face).
Let's load data for our classification task.

In [1]:
import pandas as pd

# Importing the dataset
dataset = pd.read_excel("./data/OpArticles_ADUs.xlsx")
dataset = dataset.drop(columns=['article_id', 'annotator', 'node','ranges'])
dataset['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

dataset.head()

Unnamed: 0,tokens,label
0,O facto não é apenas fruto da ignorância,0
1,havia no seu humor mais jornalismo (mais inves...,0
2,É tudo cómico na FIFA,0
3,o que todos nós permitimos que esta organizaçã...,0
4,não nos fazem rir à custa dos poderosos,0


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [7]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [8]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

In [9]:
train_valid_test_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label'],
        num_rows: 15068
    })
    validation: Dataset({
        features: ['tokens', 'label'],
        num_rows: 837
    })
    test: Dataset({
        features: ['tokens', 'label'],
        num_rows: 838
    })
})

## Fine-tuning a pretrained model

As a starting example, we'll use a lighter BERT-based model. We will need to load:
- the [tokenizer](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer) (which is used to [preprocess](https://huggingface.co/docs/transformers/preprocessing) the data before it can be used by the model)
- the [model](https://huggingface.co/docs/transformers/autoclass_tutorial#automodel) itself

In [5]:
model_name = "neuralmind/bert-base-portuguese-cased" # or neuralmind/bert-large-portuguese-cased

### Tokenizer

We first load the tokenizer for our model:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Now we need to [preprocess](https://huggingface.co/docs/transformers/preprocessing) our data. We will do it for the three partitions (train, validation and test) in a single step. For that, we'll make use of [map](https://huggingface.co/docs/datasets/process#map) with the help of an auxiliary function.

In [7]:
def preprocess_function(sample):
    return tokenizer(sample["tokens"], truncation=True)

In [8]:
tokenized_dataset = train_valid_test_dataset.map(preprocess_function, batched=True)

  0%|          | 0/16 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [9]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 15068
    })
    validation: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 837
    })
    test: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 838
    })
})

When preprocessing the text, we have actually translated the text into numbers, which is known as [encoding](https://huggingface.co/course/chapter2/4?fw=pt#encoding).

In [10]:
tokenized_dataset['train'][321]

{'tokens': 'Equiparar o reconhecimento do direito do criador impedir o aproveitamento económico não autorizado da sua própria expressão artística a qualquer forma de censura é um erro lógico, além de um argumento demagógico',
 'label': 2,
 'input_ids': [101,
  6717,
  1038,
  900,
  146,
  5035,
  171,
  2368,
  171,
  9857,
  8039,
  146,
  18059,
  16410,
  346,
  19115,
  180,
  327,
  2288,
  4587,
  7941,
  123,
  1569,
  547,
  125,
  11638,
  253,
  222,
  7441,
  219,
  9391,
  117,
  1166,
  125,
  222,
  10438,
  6461,
  22293,
  9391,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

Encoding is done in a two-step process: tokenization, followed by conversion to input IDs.

In [11]:
tokens = tokenizer.tokenize(tokenized_dataset['train'][321]['tokens'])
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['Equ', '##ipa', '##rar', 'o', 'reconhecimento', 'do', 'direito', 'do', 'criador', 'impedir', 'o', 'aproveitamento', 'económico', 'não', 'autorizado', 'da', 'sua', 'própria', 'expressão', 'artística', 'a', 'qualquer', 'forma', 'de', 'censura', 'é', 'um', 'erro', 'l', '##ógico', ',', 'além', 'de', 'um', 'argumento', 'dema', '##g', '##ógico']
[6717, 1038, 900, 146, 5035, 171, 2368, 171, 9857, 8039, 146, 18059, 16410, 346, 19115, 180, 327, 2288, 4587, 7941, 123, 1569, 547, 125, 11638, 253, 222, 7441, 219, 9391, 117, 1166, 125, 222, 10438, 6461, 22293, 9391]


The tokenizer actually adds two special tokens when preprocessing: one at the beginning, and one at the end.

In [12]:
inputs = tokenizer(tokenized_dataset['train'][321]['tokens'])
inputs['input_ids']   # or inputs.input_ids

[101,
 6717,
 1038,
 900,
 146,
 5035,
 171,
 2368,
 171,
 9857,
 8039,
 146,
 18059,
 16410,
 346,
 19115,
 180,
 327,
 2288,
 4587,
 7941,
 123,
 1569,
 547,
 125,
 11638,
 253,
 222,
 7441,
 219,
 9391,
 117,
 1166,
 125,
 222,
 10438,
 6461,
 22293,
 9391,
 102]

We can [decode](https://huggingface.co/course/chapter2/4?fw=pt#decoding) the sequence to check what are these tokens:

In [13]:
tokenizer.decode(inputs['input_ids'])

'[CLS] Equiparar o reconhecimento do direito do criador impedir o aproveitamento económico não autorizado da sua própria expressão artística a qualquer forma de censura é um erro lógico, além de um argumento demagógico [SEP]'

As with enconding, we can decode in two separate steps:

In [14]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))

['[CLS]', 'Equ', '##ipa', '##rar', 'o', 'reconhecimento', 'do', 'direito', 'do', 'criador', 'impedir', 'o', 'aproveitamento', 'económico', 'não', 'autorizado', 'da', 'sua', 'própria', 'expressão', 'artística', 'a', 'qualquer', 'forma', 'de', 'censura', 'é', 'um', 'erro', 'l', '##ógico', ',', 'além', 'de', 'um', 'argumento', 'dema', '##g', '##ógico', '[SEP]']
[CLS] Equiparar o reconhecimento do direito do criador impedir o aproveitamento económico não autorizado da sua própria expressão artística a qualquer forma de censura é um erro lógico, além de um argumento demagógico [SEP]


### Loading the model

We now load the pretrained model:

In [15]:
from transformers import AutoModel

model = AutoModel.from_pretrained(model_name)

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading the model in this way only gets us the base Transformer module: given some inputs, we obtain the hidden state of the model -- a high-dimensional vector representing the "contextual understanding" of that input by the Transformer model.

In other words, we are leaving out the *head* of the model, which is needed for whatever NLP task we want to address.

Let's look at a particular example:

In [16]:
inputs = tokenizer(train_valid_test_dataset['train'][321]['tokens'], padding=True, truncation=True, return_tensors="pt")

print(train_valid_test_dataset['train'][321])
print(inputs['input_ids'])
print(inputs['input_ids'].shape)

outputs = model(**inputs)
print(outputs.last_hidden_state)   # or outputs["last_hidden_state"]

print(outputs.last_hidden_state.shape)

{'tokens': 'Equiparar o reconhecimento do direito do criador impedir o aproveitamento económico não autorizado da sua própria expressão artística a qualquer forma de censura é um erro lógico, além de um argumento demagógico', 'label': 2}
tensor([[  101,  6717,  1038,   900,   146,  5035,   171,  2368,   171,  9857,
          8039,   146, 18059, 16410,   346, 19115,   180,   327,  2288,  4587,
          7941,   123,  1569,   547,   125, 11638,   253,   222,  7441,   219,
          9391,   117,  1166,   125,   222, 10438,  6461, 22293,  9391,   102]])
torch.Size([1, 40])
tensor([[[ 0.0079, -0.1170,  0.0945,  ..., -0.2194,  0.0439, -0.2412],
         [-0.2003, -0.6054,  0.8070,  ...,  0.3669,  0.1698, -0.4123],
         [ 0.0352, -0.7026,  0.5131,  ...,  0.8078,  0.1978, -0.2678],
         ...,
         [ 0.1231, -0.1651,  0.2532,  ..., -0.0884,  0.1565, -0.9631],
         [ 0.1218, -0.4501,  0.1388,  ...,  0.2905, -0.1073, -1.0777],
         [ 0.0112, -0.1124,  0.0967,  ..., -0.2200,  0.

As you can see, the hidden state representation has three dimensions:
- the *batch size* (in this case we are passing the model a single input sequence)
- the *sequence length*, that is, the number of tokens created by the tokenizer when encoding each input sequence
- the *hidden state size*, which is the vector dimension of each token (768 in the case of this model)

Since we want to use the model for classification, we should load it with an appropriate classification head:

In [17]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

Now the outputs of the model will be much different: we get *logits* with the prediction for each class.

In [18]:
outputs = model(**inputs)
print(outputs.logits)
print(outputs.logits.shape)

tensor([[-0.2825,  0.2218, -0.1212, -0.1823,  0.1341]],
       grad_fn=<AddmmBackward0>)
torch.Size([1, 5])


Logits are raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a *softmax* layer.

In [19]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

model.config.id2label

tensor([[0.1550, 0.2566, 0.1821, 0.1713, 0.2351]], grad_fn=<SoftmaxBackward0>)


{0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2', 3: 'LABEL_3', 4: 'LABEL_4'}

Now we can interpret the obtained values as probabilities, and identify the class for which the model assigns a higher probability for the input example.

Note, however, that for now the model is just guessing the output logits/probabilities, as it hasn't been trained with our dataset just yet. To better see this behavior, ask the user for some input, feed it to the model, and check its predictions.

In [20]:
# your code here


### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [21]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import load_metric
import numpy as np

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15068
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2826


Epoch,Training Loss,Validation Loss,Accuracy
1,1.0924,0.887649,0.634409
2,0.7794,0.945035,0.62724
3,0.6304,1.001289,0.617682


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results\checkpoint-942
Configuration saved in ./results\checkpoint-942\config.json
Model weights saved in ./results\checkpoint-942\pytorch_model.bin
tokenizer config file saved in ./results\checkpoint-942\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-942\special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Sa

TrainOutput(global_step=2826, training_loss=0.8102845436954701, metrics={'train_runtime': 7226.8242, 'train_samples_per_second': 6.255, 'train_steps_per_second': 0.391, 'total_flos': 1151400880201488.0, 'train_loss': 0.8102845436954701, 'epoch': 3.0})

We can check the model's performance in the evaluation set.

In [23]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16


{'eval_loss': 0.8876485228538513,
 'eval_accuracy': 0.6344086021505376,
 'eval_runtime': 28.1871,
 'eval_samples_per_second': 29.694,
 'eval_steps_per_second': 1.88,
 'epoch': 3.0}

And more importantly, we can check how the model fares in our test set.

In [24]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 16


PredictionOutput(predictions=array([[ 1.2462692 ,  0.5158316 , -1.884579  ,  0.6054781 , -0.56966305],
       [ 1.5887785 , -1.7311387 , -0.61572343, -0.2537465 ,  1.1889675 ],
       [ 1.8551832 , -0.6577077 , -0.22144026,  2.1115189 , -3.4015794 ],
       ...,
       [ 2.4489844 , -0.7960056 ,  0.21268278,  0.6089523 , -2.453919  ],
       [ 1.9917784 ,  0.24346623, -0.32904103,  0.6129986 , -2.1276066 ],
       [ 2.2117815 , -1.9057046 ,  1.1338502 ,  0.91913486, -2.7005472 ]],
      dtype=float32), label_ids=array([3, 4, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 2, 1, 3, 0, 1, 2, 0, 2, 0, 4,
       0, 0, 3, 0, 4, 0, 3, 3, 2, 0, 0, 2, 3, 0, 0, 0, 3, 0, 0, 4, 3, 3,
       0, 0, 0, 0, 3, 4, 0, 2, 3, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 3, 0, 0,
       0, 0, 0, 4, 2, 3, 2, 2, 3, 3, 2, 0, 0, 0, 1, 0, 0, 0, 0, 3, 3, 0,
       0, 2, 3, 0, 3, 0, 0, 3, 2, 3, 3, 3, 3, 0, 3, 0, 3, 2, 2, 0, 0, 1,
       3, 0, 3, 2, 2, 3, 0, 0, 2, 0, 0, 0, 0, 0, 4, 0, 1, 0, 0, 1, 0, 0,
       0, 3, 1, 1, 3, 0, 0, 0, 3, 0, 0, 3, 0

#### Saving the model

The model can be saved for future loading.

In [27]:
trainer.save_model()

Saving model checkpoint to ./results
Configuration saved in ./results\config.json
Model weights saved in ./results\pytorch_model.bin
tokenizer config file saved in ./results\tokenizer_config.json
Special tokens file saved in ./results\special_tokens_map.json


#### Loading and using a saved model

In [31]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=5)

Didn't find file ./results\added_tokens.json. We won't load it.
loading file ./results\vocab.txt
loading file ./results\tokenizer.json
loading file None
loading file ./results\special_tokens_map.json
loading file ./results\tokenizer_config.json
loading configuration file ./results\config.json
Model config BertConfig {
  "_name_or_path": "./results",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidd

To exploit the model, we can use a pipeline.

In [32]:
from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

In [33]:
pipe("Considero que a Praxe é muito boa")

[{'label': 'LABEL_1', 'score': 0.6879749298095703}]

We can also use the model in a step-by-step fashion, as follows.

In [34]:
import torch

inputs = "Considero que a Praxe é muito boa"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])

{'input_ids': tensor([[  101,  1158,  2776, 22280,   179,   123,  2485,  2650,   253,   785,
          3264,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[ 1.0468,  2.2777, -1.1819,  0.1349, -2.0844]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
Value(+)


Let's check again the performance of the model in the test set, possibly with additional metrics.

In [35]:
y_pred= []
for p in tokenized_dataset['test']['tokens']:
    ti = tokenizer2(p, return_tensors="pt")
    out = model2(**ti)
    pred = torch.argmax(out.logits)
    y_pred.append(pred)   # our labels are already 0 and 1

In [36]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

y_test = tokenized_dataset['test']['label']

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred, average='macro'))
print('Recall: ', recall_score(y_test, y_pred, average='macro'))
print('F1: ', f1_score(y_test, y_pred, average='macro'))

[[293  22  32  56  13]
 [ 27  27   0   7   1]
 [ 52   1  72  11   0]
 [ 51   7  21 111   1]
 [  8   2   0   0  23]]
Accuracy:  0.6276849642004774
Precision:  0.5837409323379233
Recall:  0.5894688176361955
F1:  0.5856343172939275


We can do the same using a Trainer, as before.

In [37]:
trainer2 = Trainer(
    model=model2,
    tokenizer=tokenizer2,
    compute_metrics=compute_metrics
)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [38]:
trainer2.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 8


PredictionOutput(predictions=array([[ 1.2462692 ,  0.5158316 , -1.884579  ,  0.6054781 , -0.56966305],
       [ 1.5887785 , -1.7311387 , -0.61572343, -0.2537465 ,  1.1889675 ],
       [ 1.8551832 , -0.65770715, -0.22144029,  2.1115184 , -3.4015794 ],
       ...,
       [ 2.4489844 , -0.7960056 ,  0.21268278,  0.6089523 , -2.453919  ],
       [ 1.9917784 ,  0.24346623, -0.32904103,  0.6129986 , -2.1276066 ],
       [ 2.2117815 , -1.9057046 ,  1.1338502 ,  0.91913486, -2.7005472 ]],
      dtype=float32), label_ids=array([3, 4, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 2, 1, 3, 0, 1, 2, 0, 2, 0, 4,
       0, 0, 3, 0, 4, 0, 3, 3, 2, 0, 0, 2, 3, 0, 0, 0, 3, 0, 0, 4, 3, 3,
       0, 0, 0, 0, 3, 4, 0, 2, 3, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 3, 0, 0,
       0, 0, 0, 4, 2, 3, 2, 2, 3, 3, 2, 0, 0, 0, 1, 0, 0, 0, 0, 3, 3, 0,
       0, 2, 3, 0, 3, 0, 0, 3, 2, 3, 3, 3, 3, 0, 3, 0, 3, 2, 2, 0, 0, 1,
       3, 0, 3, 2, 2, 3, 0, 0, 2, 0, 0, 0, 0, 0, 4, 0, 1, 0, 0, 1, 0, 0,
       0, 3, 1, 1, 3, 0, 0, 0, 3, 0, 0, 3, 0

## Translating to english and using distilbert-base-uncased-finetuned-sst-2-english

### Translate Text

First, translate all tokens:

In [9]:
from googletrans import Translator

translator = Translator()
translator.raise_Exception = True
datset['tokens'] = dataset['tokens'].apply(translator.translate, src='pt', dest='en').apply(getattr, args=('text',)).apply(sleep(1))

Exception: Unexpected status code "429" from ('translate.google.com',)

In [None]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [8]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

### Tokenizer

We first load the tokenizer for our model:

In [39]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)

loading configuration file https://huggingface.co/siebert/sentiment-roberta-large-english/resolve/main/config.json from cache at C:\Users\up201806451/.cache\huggingface\transformers\228e83e1ade2247aebc5f0725e330fa58dedee3d9eec36c9249f25084a946130.1aece0680a18a95d51d6e1a5f83631412da37b87db65380c52052161354505ba
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_

In [40]:
tokenized_dataset = train_valid_test_dataset.map(preprocess_function, batched=True)

tokens = tokenizer.tokenize(tokenized_dataset['train'][321]['tokens'])
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

In [41]:
inputs = tokenizer(tokenized_dataset['train'][321]['tokens'])
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])

[[108 308   0   0   0]
 [  2  60   0   0   0]
 [ 66  70   0   0   0]
 [ 48 143   0   0   0]
 [  3  30   0   0   0]]
Accuracy:  0.20047732696897375
Precision:  0.11479411955557799
Recall:  0.24547146401985112
F1:  0.10284628840941075


  _warn_prf(average, modifier, msg_start, len(result))


As before, we can do the same via a Trainer.

In [42]:
from transformers import Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
trainer = Trainer(model=model, compute_metrics=compute_metrics)

loading configuration file https://huggingface.co/siebert/sentiment-roberta-large-english/resolve/main/config.json from cache at C:\Users\up201806451/.cache\huggingface\transformers\228e83e1ade2247aebc5f0725e330fa58dedee3d9eec36c9249f25084a946130.1aece0680a18a95d51d6e1a5f83631412da37b87db65380c52052161354505ba
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_

In [43]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    load_best_model_at_end=True,
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [1]:
trainer.train()

NameError: name 'train_valid_test_dataset' is not defined

In [None]:
trainer.evaluate()

Note that we can still fine-tune the model with our training data, but the performance of the model is already quite good without any further training!

In [2]:
trainer.predict(test_dataset=tokenized_dataset["test"])

NameError: name 'trainer' is not defined

In [3]:
trainer.save_model()

NameError: name 'trainer' is not defined

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=5)

In [5]:
from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

In [None]:
import torch

inputs = "Considero que a Praxe é muito boa"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])