Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# TRANSFORMERS

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In this notebook we will explore [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).
You may also want to check the [Hugging Face course](https://huggingface.co/course/), which will explain you how to use this technology in a much greater depth.

Training transformer models is computationally expensive. Hugging Face makes available several pretrained [models](https://huggingface.co/models) that can be used as is, or fine-tuned to a specific NLP task, such as one of sentence classification. That's what we'll do in this notebook.

Hugging Face also makes available several [datasets](https://huggingface.co/datasets) that can be used to train or fine-tune a model.

See:
- https://huggingface.co/docs/transformers/tasks/sequence_classification#preprocess
- https://huggingface.co/docs/transformers/training#prepare-a-dataset
- https://huggingface.co/docs/transformers/accelerate
- https://huggingface.co/docs/transformers/model_summary#autoencoding-models

## Loading a dataset

In this notebook, we'll start by using a local dataset (instead of using a dataset stored at Hugging Face).
Let's load data for our classification task.

In [2]:
!pip install pandas
!pip install datasets
!pip install transformers
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu113
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [3]:
import pandas as pd
from datasets import Dataset
from datasets import DatasetDict

model_name = "neuralmind/bert-base-portuguese-cased" # or neuralmind/bert-large-portuguese-cased
# Importing the dataset
train = pd.read_csv("/content/drive/Shareddrives/PLN/Assignment 2/data/augmented/OpArticles_ADUs_train_aug.csv")
train = train.drop(columns=['article_id', 'annotator', 'node','ranges'])
train['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

test_valid = pd.read_csv("/content/drive/Shareddrives/PLN/Assignment 2/data/augmented/OpArticles_ADUs_test.csv")
test_valid = test_valid.drop(columns=['article_id', 'annotator', 'node','ranges'])
test_valid['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

train = Dataset.from_pandas(train)
dataset_hf = Dataset.from_pandas(test_valid)

# Split the 10% test+validation set in half test, half validation
valid_test = dataset_hf.train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train,
    'validation': valid_test['train'],
    'test': valid_test['test']
})

## Fine-tuning a pretrained model

As a starting example, we'll use a lighter BERT-based model. We will need to load:
- the [tokenizer](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer) (which is used to [preprocess](https://huggingface.co/docs/transformers/preprocessing) the data before it can be used by the model)
- the [model](https://huggingface.co/docs/transformers/autoclass_tutorial#automodel) itself

In [4]:
model_name = "neuralmind/bert-base-portuguese-cased" # or neuralmind/bert-large-portuguese-cased

### Tokenizer

We first load the tokenizer for our model:

In [5]:
from transformers import AutoTokenizer

def get_tokenizer(name):
    return AutoTokenizer.from_pretrained(name, model_max_len=512, use_fast=True)

tokenizer = get_tokenizer(model_name)

Now we need to [preprocess](https://huggingface.co/docs/transformers/preprocessing) our data. We will do it for the three partitions (train, validation and test) in a single step. For that, we'll make use of [map](https://huggingface.co/docs/datasets/process#map) with the help of an auxiliary function.

In [6]:
def preprocess_function(sample):
    return tokenizer(sample["tokens"], truncation=True)

In [7]:
def get_tokenized_data(dataset, function):
    return dataset.map(function, batched=True)

tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

  0%|          | 0/33 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [8]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 32405
    })
    validation: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1674
    })
    test: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1675
    })
})

When preprocessing the text, we have actually translated the text into numbers, which is known as [encoding](https://huggingface.co/course/chapter2/4?fw=pt#encoding).

In [9]:
tokenized_dataset['train'][321]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'input_ids': [101,
  15807,
  17337,
  5254,
  146,
  873,
  183,
  117,
  17337,
  5254,
  820,
  146,
  8684,
  310,
  125,
  3233,
  179,
  117,
  538,
  3401,
  173,
  2645,
  117,
  4024,
  185,
  123,
  6716,
  5462,
  19317,
  22290,
  125,
  6437,
  102],
 'label': 0,
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'tokens': 'Quem festejou o veto, festejou apenas o adiamento de algo que, nos termos em presença, repete a coreografia negocial de 2018'}

Encoding is done in a two-step process: tokenization, followed by conversion to input IDs.

In [10]:
tokens = tokenizer.tokenize(tokenized_dataset['train'][321]['tokens'])
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['Quem', 'feste', '##jou', 'o', 've', '##to', ',', 'feste', '##jou', 'apenas', 'o', 'adia', '##mento', 'de', 'algo', 'que', ',', 'nos', 'termos', 'em', 'presença', ',', 'repe', '##te', 'a', 'core', '##ografia', 'negocia', '##l', 'de', '2018']
[15807, 17337, 5254, 146, 873, 183, 117, 17337, 5254, 820, 146, 8684, 310, 125, 3233, 179, 117, 538, 3401, 173, 2645, 117, 4024, 185, 123, 6716, 5462, 19317, 22290, 125, 6437]


The tokenizer actually adds two special tokens when preprocessing: one at the beginning, and one at the end.

In [11]:
inputs = tokenizer(tokenized_dataset['train'][321]['tokens'])
inputs['input_ids']   # or inputs.input_ids

[101,
 15807,
 17337,
 5254,
 146,
 873,
 183,
 117,
 17337,
 5254,
 820,
 146,
 8684,
 310,
 125,
 3233,
 179,
 117,
 538,
 3401,
 173,
 2645,
 117,
 4024,
 185,
 123,
 6716,
 5462,
 19317,
 22290,
 125,
 6437,
 102]

We can [decode](https://huggingface.co/course/chapter2/4?fw=pt#decoding) the sequence to check what are these tokens:

In [12]:
tokenizer.decode(inputs['input_ids'])

'[CLS] Quem festejou o veto, festejou apenas o adiamento de algo que, nos termos em presença, repete a coreografia negocial de 2018 [SEP]'

As with enconding, we can decode in two separate steps:

In [13]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))

['[CLS]', 'Quem', 'feste', '##jou', 'o', 've', '##to', ',', 'feste', '##jou', 'apenas', 'o', 'adia', '##mento', 'de', 'algo', 'que', ',', 'nos', 'termos', 'em', 'presença', ',', 'repe', '##te', 'a', 'core', '##ografia', 'negocia', '##l', 'de', '2018', '[SEP]']
[CLS] Quem festejou o veto, festejou apenas o adiamento de algo que, nos termos em presença, repete a coreografia negocial de 2018 [SEP]


### Loading the model

We now load the pretrained model:

Since we want to use the model for classification, we should load it with an appropriate classification head:

In [14]:
from transformers import AutoModelForSequenceClassification
import torch

def get_model(name):
    return AutoModelForSequenceClassification.from_pretrained(name, num_labels=5)

model = get_model(model_name)

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [23]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
from transformers import DataCollatorWithPadding

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

def get_trainingArgs():
    return TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=5,
        weight_decay=0.01,
        data_seed=42,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1"
    )

def get_trainer(model_, args_, dataset_, tokenizer_, data_collator_, compute_metrics_):
    return Trainer(
        model=model_,
        args=args_,
        train_dataset=dataset_["train"],
        eval_dataset=dataset_["validation"],
        tokenizer=tokenizer_,
        data_collator=data_collator_,
        compute_metrics=compute_metrics_
    )

## Testing for augmented data

###Load train and test

In [24]:
tokenizer = get_tokenizer(model_name)
model = get_model(model_name)
tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

loading configuration file https://huggingface.co/neuralmind/bert-base-portuguese-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/e716e2151985ba669e7197b64cdde2552acee146494d40ffaf0688a3f152e6ed.18a0b8b86f3ebd4c8a1d8d6199178feae9971ff5420f1d12f0ed8326ffdff716
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_tran

  0%|          | 0/33 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [25]:
trainer = get_trainer(
    model,
    get_trainingArgs(),
    tokenized_dataset,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [26]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 32405
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 10130


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6692,1.1322,0.553763,0.548304,0.515201,0.628744
2,0.4612,1.264729,0.544205,0.549324,0.526501,0.591261
3,0.3219,1.745684,0.516129,0.528278,0.498591,0.593628
4,0.2495,1.694092,0.553763,0.542477,0.530946,0.55891
5,0.1976,2.019264,0.552569,0.544868,0.528394,0.567529


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-2026
Configuration saved in ./results/checkpoint-2026/config.json
Model weights saved in ./results/checkpoint-2026/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-2026/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2026/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size 

TrainOutput(global_step=10130, training_loss=0.4062878637888119, metrics={'train_runtime': 1771.4873, 'train_samples_per_second': 91.463, 'train_steps_per_second': 5.718, 'total_flos': 4069945859108826.0, 'train_loss': 0.4062878637888119, 'epoch': 5.0})

In [27]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16


{'epoch': 5.0,
 'eval_accuracy': 0.544205495818399,
 'eval_f1': 0.5493243911565953,
 'eval_loss': 1.264729380607605,
 'eval_precision': 0.52650127947884,
 'eval_recall': 0.5912605485990066,
 'eval_runtime': 5.0133,
 'eval_samples_per_second': 333.914,
 'eval_steps_per_second': 20.944}

In [28]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1675
  Batch size = 16


PredictionOutput(predictions=array([[ 1.786876  ,  0.3544545 , -1.4735757 ,  3.2015884 , -3.473822  ],
       [ 3.9415853 , -2.2783604 , -0.3741341 ,  1.3784047 , -1.7755595 ],
       [ 3.7397726 , -2.0315673 ,  1.1947795 ,  1.382461  , -3.536887  ],
       ...,
       [ 3.1577275 , -1.6957397 , -0.20974092,  3.495463  , -4.0020957 ],
       [ 2.9930718 , -3.201207  ,  3.4517844 , -0.46747637, -2.3843017 ],
       [ 1.810084  , -3.3927107 ,  4.28492   ,  1.1870044 , -3.4373345 ]],
      dtype=float32), label_ids=array([3, 0, 0, ..., 3, 0, 3]), metrics={'test_loss': 1.2017401456832886, 'test_accuracy': 0.5737313432835821, 'test_f1': 0.5843443635929594, 'test_precision': 0.5554221324326652, 'test_recall': 0.6347991614631235, 'test_runtime': 5.1493, 'test_samples_per_second': 325.288, 'test_steps_per_second': 20.391})

In [29]:
!rm -rf ./results/

In [30]:
from transformers import AutoModelForSequenceClassification

model_name = '/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
model.cuda()

loading configuration file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_l

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [31]:
tokenizer = get_tokenizer(model_name)
tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

Didn't find file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/added_tokens.json. We won't load it.
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/vocab.txt
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer.json
loading file None
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/special_tokens_map.json
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer_config.json


  0%|          | 0/33 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [32]:
trainer = get_trainer(
    model,
    get_trainingArgs(),
    tokenized_dataset,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [33]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 32405
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 10130


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6614,1.148045,0.54779,0.54764,0.512878,0.636104
2,0.4523,1.235134,0.553166,0.553422,0.528962,0.595537
3,0.3125,1.643723,0.543608,0.542056,0.51383,0.59223
4,0.2471,1.663551,0.554958,0.550053,0.543776,0.560313
5,0.1959,2.003325,0.551971,0.546808,0.534209,0.563293


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-2026
Configuration saved in ./results/checkpoint-2026/config.json
Model weights saved in ./results/checkpoint-2026/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-2026/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2026/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size 

TrainOutput(global_step=10130, training_loss=0.3983003787674297, metrics={'train_runtime': 1763.3599, 'train_samples_per_second': 91.884, 'train_steps_per_second': 5.745, 'total_flos': 4069945859108826.0, 'train_loss': 0.3983003787674297, 'epoch': 5.0})

In [34]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16


{'epoch': 5.0,
 'eval_accuracy': 0.5531660692951016,
 'eval_f1': 0.5534221858863111,
 'eval_loss': 1.2351343631744385,
 'eval_precision': 0.5289616410637142,
 'eval_recall': 0.5955371370680844,
 'eval_runtime': 4.9805,
 'eval_samples_per_second': 336.111,
 'eval_steps_per_second': 21.082}

In [35]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1675
  Batch size = 16


PredictionOutput(predictions=array([[ 2.1459687 , -0.76345384, -1.1445857 ,  3.82321   , -3.6292984 ],
       [ 3.9712572 , -2.6089475 ,  0.36192417,  0.24639913, -2.02225   ],
       [ 3.699785  , -1.1340754 , -0.64084035,  1.7017245 , -3.3674002 ],
       ...,
       [ 3.2491605 , -2.4131625 ,  0.27471036,  3.4092333 , -4.0491557 ],
       [ 2.970577  , -3.156329  ,  2.96932   , -0.492973  , -2.9572744 ],
       [ 1.8632498 , -3.8464572 ,  4.0206547 ,  0.70495933, -3.2639804 ]],
      dtype=float32), label_ids=array([3, 0, 0, ..., 3, 0, 3]), metrics={'test_loss': 1.2006856203079224, 'test_accuracy': 0.5761194029850746, 'test_f1': 0.5795670613595094, 'test_precision': 0.5530626159235299, 'test_recall': 0.6220527945466914, 'test_runtime': 5.1602, 'test_samples_per_second': 324.598, 'test_steps_per_second': 20.348})

In [36]:
!rm -rf ./results/

## Majority dataset

In [50]:
train = pd.read_csv("/content/drive/Shareddrives/PLN/Assignment 2/data/augmented/OpArticles_ADUs_train_aug.csv")
train = train.drop(columns=['article_id',  'node','ranges'])
train['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

df_tmp = train.groupby(['tokens', 'label']).agg({'annotator': 'count'}).reset_index()

dataset_nodups = df_tmp.groupby(['tokens'], as_index=False).agg({'annotator': 'max', 'label': 'first'})
dataset_nodups = dataset_nodups.drop('annotator', 1)

train = dataset_nodups

train = Dataset.from_pandas(train)
model_name = "neuralmind/bert-base-portuguese-cased" # or neuralmind/bert-large-portuguese-cased

  


In [51]:
# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train,
    'validation': valid_test['train'],
    'test': valid_test['test']
})

In [52]:
tokenizer = get_tokenizer(model_name)
model = get_model(model_name)
tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

loading configuration file https://huggingface.co/neuralmind/bert-base-portuguese-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/e716e2151985ba669e7197b64cdde2552acee146494d40ffaf0688a3f152e6ed.18a0b8b86f3ebd4c8a1d8d6199178feae9971ff5420f1d12f0ed8326ffdff716
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_tran

  0%|          | 0/25 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [53]:
trainer = get_trainer(
    model,
    get_trainingArgs(),
    tokenized_dataset,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [54]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 24192
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7560


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7669,1.292887,0.476105,0.497654,0.491151,0.613053
2,0.4822,1.207808,0.583632,0.560594,0.54457,0.586652
3,0.3017,1.662657,0.559737,0.550222,0.535522,0.568474
4,0.176,2.219794,0.568698,0.545761,0.539701,0.552589
5,0.1176,2.667409,0.567503,0.548371,0.537417,0.561271


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-1512
Configuration saved in ./results/checkpoint-1512/config.json
Model weights saved in ./results/checkpoint-1512/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1512/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1512/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size 

TrainOutput(global_step=7560, training_loss=0.3973042343659376, metrics={'train_runtime': 1373.8829, 'train_samples_per_second': 88.042, 'train_steps_per_second': 5.503, 'total_flos': 3163855881616416.0, 'train_loss': 0.3973042343659376, 'epoch': 5.0})

In [55]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16


{'epoch': 5.0,
 'eval_accuracy': 0.5836320191158901,
 'eval_f1': 0.5605942079056739,
 'eval_loss': 1.207808256149292,
 'eval_precision': 0.5445700818427405,
 'eval_recall': 0.5866524091057681,
 'eval_runtime': 4.9769,
 'eval_samples_per_second': 336.351,
 'eval_steps_per_second': 21.097}

In [56]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1675
  Batch size = 16


PredictionOutput(predictions=array([[ 2.0763144 ,  0.14716604, -1.1590062 ,  2.853799  , -4.025324  ],
       [ 4.327012  , -1.1268412 , -1.0708095 , -0.21249309, -1.75091   ],
       [ 3.9430997 , -1.5910578 ,  0.04863966,  0.5821205 , -3.1503556 ],
       ...,
       [ 3.5016127 , -1.794759  , -0.58938354,  2.7072716 , -3.7692893 ],
       [ 3.6832633 , -2.6793127 ,  1.7017089 ,  0.19842085, -2.9193635 ],
       [ 1.6597028 , -3.092201  ,  4.159952  ,  0.6669837 , -3.1617677 ]],
      dtype=float32), label_ids=array([3, 0, 0, ..., 3, 0, 3]), metrics={'test_loss': 1.202857494354248, 'test_accuracy': 0.582089552238806, 'test_f1': 0.5697100759909098, 'test_precision': 0.5523877241263656, 'test_recall': 0.6041656056818742, 'test_runtime': 5.1848, 'test_samples_per_second': 323.058, 'test_steps_per_second': 20.251})

In [57]:
!rm -rf ./results/

### Domain

In [58]:
from transformers import AutoModelForSequenceClassification

model_name = '/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
model.cuda()

loading configuration file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_l

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [59]:
tokenizer = get_tokenizer(model_name)
tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

Didn't find file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/added_tokens.json. We won't load it.
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/vocab.txt
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer.json
loading file None
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/special_tokens_map.json
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/domain/pretrained/tokenizer_config.json


  0%|          | 0/25 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [60]:
trainer = get_trainer(
    model,
    get_trainingArgs(),
    tokenized_dataset,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [61]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 24192
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7560


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.7648,1.242527,0.497013,0.520345,0.507512,0.627792
2,0.4818,1.184624,0.577061,0.557969,0.544283,0.577845
3,0.2967,1.615035,0.571685,0.556705,0.549433,0.564893
4,0.1775,2.201149,0.569295,0.550824,0.547268,0.55506
5,0.1155,2.60808,0.569892,0.552207,0.547492,0.557675


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-1512
Configuration saved in ./results/checkpoint-1512/config.json
Model weights saved in ./results/checkpoint-1512/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1512/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1512/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size 

TrainOutput(global_step=7560, training_loss=0.39255117569020187, metrics={'train_runtime': 1373.6439, 'train_samples_per_second': 88.058, 'train_steps_per_second': 5.504, 'total_flos': 3163855881616416.0, 'train_loss': 0.39255117569020187, 'epoch': 5.0})

In [62]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1674
  Batch size = 16


{'epoch': 5.0,
 'eval_accuracy': 0.5770609318996416,
 'eval_f1': 0.5579694445837335,
 'eval_loss': 1.1846240758895874,
 'eval_precision': 0.5442829021549208,
 'eval_recall': 0.5778451920881258,
 'eval_runtime': 5.0246,
 'eval_samples_per_second': 333.164,
 'eval_steps_per_second': 20.897}

In [63]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1675
  Batch size = 16


PredictionOutput(predictions=array([[ 2.1742306 , -0.04732339, -1.1106657 ,  3.1417248 , -3.7453046 ],
       [ 4.4682956 , -1.6531224 , -0.5391372 ,  0.1264146 , -2.6272156 ],
       [ 3.9498026 , -0.8253517 ,  0.05461192,  0.29590997, -3.5723584 ],
       ...,
       [ 3.6948116 , -1.6892214 , -0.5456111 ,  2.7757246 , -3.7745152 ],
       [ 3.2638645 , -2.751662  ,  2.0042229 ,  0.09867891, -3.0983286 ],
       [ 1.8335099 , -3.2961693 ,  3.9080126 ,  0.5510907 , -3.3416407 ]],
      dtype=float32), label_ids=array([3, 0, 0, ..., 3, 0, 3]), metrics={'test_loss': 1.1897457838058472, 'test_accuracy': 0.5934328358208956, 'test_f1': 0.5740467671247271, 'test_precision': 0.5655486133619395, 'test_recall': 0.5948024366293968, 'test_runtime': 5.2241, 'test_samples_per_second': 320.627, 'test_steps_per_second': 20.099})

In [64]:
!rm -rf ./results/