Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# TRANSFORMERS

In this notebook we will explore [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).
You may also want to check the [Hugging Face course](https://huggingface.co/course/), which will explain you how to use this technology in a much greater depth.

Training transformer models is computationally expensive. Hugging Face makes available several pretrained [models](https://huggingface.co/models) that can be used as is, or fine-tuned to a specific NLP task, such as one of sentence classification. That's what we'll do in this notebook.

Hugging Face also makes available several [datasets](https://huggingface.co/datasets) that can be used to train or fine-tune a model.

See:
- https://huggingface.co/docs/transformers/tasks/sequence_classification#preprocess
- https://huggingface.co/docs/transformers/training#prepare-a-dataset
- https://huggingface.co/docs/transformers/accelerate
- https://huggingface.co/docs/transformers/model_summary#autoencoding-models

## Loading a dataset

In this notebook, we'll start by using a local dataset (instead of using a dataset stored at Hugging Face).
Let's load data for our classification task.

In [1]:
!pip install pandas
!pip install datasets
!pip install transformers
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu113


In [2]:
import pandas as pd

# Importing the dataset
dataset = pd.read_excel("./data/OpArticles_ADUs.xlsx")
dataset = dataset.drop(columns=['article_id', 'annotator', 'node','ranges'])
dataset['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

dataset.head()

Unnamed: 0,tokens,label
0,O facto não é apenas fruto da ignorância,0
1,havia no seu humor mais jornalismo (mais inves...,0
2,É tudo cómico na FIFA,0
3,o que todos nós permitimos que esta organizaçã...,0
4,não nos fazem rir à custa dos poderosos,0


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [3]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [4]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

In [5]:
train_valid_test_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label'],
        num_rows: 15068
    })
    validation: Dataset({
        features: ['tokens', 'label'],
        num_rows: 837
    })
    test: Dataset({
        features: ['tokens', 'label'],
        num_rows: 838
    })
})

## Fine-tuning a pretrained model

As a starting example, we'll use a lighter BERT-based model. We will need to load:
- the [tokenizer](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer) (which is used to [preprocess](https://huggingface.co/docs/transformers/preprocessing) the data before it can be used by the model)
- the [model](https://huggingface.co/docs/transformers/autoclass_tutorial#automodel) itself

In [6]:
model_name = "neuralmind/bert-base-portuguese-cased" # or neuralmind/bert-large-portuguese-cased

### Tokenizer

We first load the tokenizer for our model:

In [7]:
from transformers import AutoTokenizer

def get_tokenizer(name):
    return AutoTokenizer.from_pretrained(name, model_max_len=512)

tokenizer = get_tokenizer(model_name)

Now we need to [preprocess](https://huggingface.co/docs/transformers/preprocessing) our data. We will do it for the three partitions (train, validation and test) in a single step. For that, we'll make use of [map](https://huggingface.co/docs/datasets/process#map) with the help of an auxiliary function.

In [8]:
def preprocess_function(sample):
    return tokenizer(sample["tokens"], truncation=True)

In [9]:
def get_tokenized_data(dataset, function):
    return dataset.map(function, batched=True)

tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

  0%|          | 0/16 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [10]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 15068
    })
    validation: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 837
    })
    test: Dataset({
        features: ['tokens', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 838
    })
})

When preprocessing the text, we have actually translated the text into numbers, which is known as [encoding](https://huggingface.co/course/chapter2/4?fw=pt#encoding).

In [11]:
tokenized_dataset['train'][321]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'input_ids': [101,
  785,
  1695,
  744,
  262,
  7302,
  173,
  3401,
  125,
  12883,
  179,
  6028,
  22303,
  123,
  333,
  15893,
  285,
  1839,
  366,
  5080,
  298,
  10069,
  19849,
  358,
  125,
  10325,
  6861,
  102],
 'label': 3,
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'tokens': 'muito pouco ainda foi definido em termos de exportação que continuará a ser permitida dentro das regras dos mercados adquirentes de talentos britânicos'}

Encoding is done in a two-step process: tokenization, followed by conversion to input IDs.

In [12]:
tokens = tokenizer.tokenize(tokenized_dataset['train'][321]['tokens'])
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['muito', 'pouco', 'ainda', 'foi', 'definido', 'em', 'termos', 'de', 'exportação', 'que', 'continuar', '##á', 'a', 'ser', 'permiti', '##da', 'dentro', 'das', 'regras', 'dos', 'mercados', 'adquire', '##ntes', 'de', 'talentos', 'britânicos']
[785, 1695, 744, 262, 7302, 173, 3401, 125, 12883, 179, 6028, 22303, 123, 333, 15893, 285, 1839, 366, 5080, 298, 10069, 19849, 358, 125, 10325, 6861]


The tokenizer actually adds two special tokens when preprocessing: one at the beginning, and one at the end.

In [13]:
inputs = tokenizer(tokenized_dataset['train'][321]['tokens'])
inputs['input_ids']   # or inputs.input_ids

[101,
 785,
 1695,
 744,
 262,
 7302,
 173,
 3401,
 125,
 12883,
 179,
 6028,
 22303,
 123,
 333,
 15893,
 285,
 1839,
 366,
 5080,
 298,
 10069,
 19849,
 358,
 125,
 10325,
 6861,
 102]

We can [decode](https://huggingface.co/course/chapter2/4?fw=pt#decoding) the sequence to check what are these tokens:

In [14]:
tokenizer.decode(inputs['input_ids'])

'[CLS] muito pouco ainda foi definido em termos de exportação que continuará a ser permitida dentro das regras dos mercados adquirentes de talentos britânicos [SEP]'

As with enconding, we can decode in two separate steps:

In [15]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))

['[CLS]', 'muito', 'pouco', 'ainda', 'foi', 'definido', 'em', 'termos', 'de', 'exportação', 'que', 'continuar', '##á', 'a', 'ser', 'permiti', '##da', 'dentro', 'das', 'regras', 'dos', 'mercados', 'adquire', '##ntes', 'de', 'talentos', 'britânicos', '[SEP]']
[CLS] muito pouco ainda foi definido em termos de exportação que continuará a ser permitida dentro das regras dos mercados adquirentes de talentos britânicos [SEP]


### Loading the model

We now load the pretrained model:

In [16]:
from transformers import AutoModel

model = AutoModel.from_pretrained(model_name)
model.cuda()

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(29794, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

Loading the model in this way only gets us the base Transformer module: given some inputs, we obtain the hidden state of the model -- a high-dimensional vector representing the "contextual understanding" of that input by the Transformer model.

In other words, we are leaving out the *head* of the model, which is needed for whatever NLP task we want to address.

Since we want to use the model for classification, we should load it with an appropriate classification head:

In [17]:
from transformers import AutoModelForSequenceClassification
import torch

def get_model(name):
    return AutoModelForSequenceClassification.from_pretrained(name, num_labels=5)

model = get_model(model_name)
model.cuda()

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(29794, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [18]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import load_metric
import numpy as np

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def get_trainingArgs():
    return TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=10,
        weight_decay=0.01,
        evaluation_strategy="epoch", # run validation at the end of each epoch
        save_strategy="epoch",
        load_best_model_at_end=True,
    )

training_args = get_trainingArgs()

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def get_trainer(model_, args_, dataset_, tokenizer_, data_collator_, compute_metrics_):
    return Trainer(
        model=model_,
        args=args_,
        train_dataset=dataset_["train"],
        eval_dataset=dataset_["validation"],
        tokenizer=tokenizer_,
        data_collator=data_collator_,
        compute_metrics=compute_metrics_
    )

trainer = get_trainer(model,training_args,tokenized_dataset,tokenizer,data_collator,compute_metrics)

In [19]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15068
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9420


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1103,0.911923,0.60693
2,0.7863,0.925328,0.609319
3,0.6202,1.034465,0.593787
4,0.5119,1.179615,0.574671
5,0.4406,1.278867,0.584229
6,0.3888,1.483417,0.583035
7,0.3454,1.597407,0.565114
8,0.3113,1.651575,0.574671
9,0.2693,1.821028,0.573477
10,0.2406,1.883662,0.567503


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Sa

TrainOutput(global_step=9420, training_loss=0.4949805897512254, metrics={'train_runtime': 1654.402, 'train_samples_per_second': 91.078, 'train_steps_per_second': 5.694, 'total_flos': 3807465469798728.0, 'train_loss': 0.4949805897512254, 'epoch': 10.0})

We can check the model's performance in the evaluation set.

In [20]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16


{'epoch': 10.0,
 'eval_accuracy': 0.6069295101553166,
 'eval_loss': 0.9119234085083008,
 'eval_runtime': 2.7416,
 'eval_samples_per_second': 305.294,
 'eval_steps_per_second': 19.332}

And more importantly, we can check how the model fares in our test set.

In [21]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 16


PredictionOutput(predictions=array([[ 0.3760135 , -1.5194662 ,  2.8067956 ,  0.48898378, -1.9309182 ],
       [ 1.3340023 ,  2.0924182 , -1.9254823 , -0.24892047, -1.7757592 ],
       [ 0.65512687, -0.85574234, -0.71937895,  3.2252967 , -2.5290272 ],
       ...,
       [ 2.8552709 , -1.5310497 ,  0.7302812 ,  1.508096  , -2.8344948 ],
       [ 1.3395143 , -0.9876522 , -1.1002288 , -1.0114254 ,  2.6395814 ],
       [ 0.7530727 , -1.8906747 ,  3.2568834 ,  0.32288778, -2.2073815 ]],
      dtype=float32), label_ids=array([2, 1, 2, 3, 0, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 1, 2,
       0, 3, 0, 0, 0, 2, 2, 0, 3, 3, 0, 2, 2, 0, 0, 2, 0, 0, 2, 2, 2, 3,
       0, 3, 0, 0, 2, 0, 0, 1, 3, 0, 2, 0, 3, 3, 3, 0, 3, 0, 0, 1, 2, 2,
       3, 2, 0, 0, 3, 0, 2, 3, 0, 0, 0, 3, 3, 0, 0, 0, 0, 2, 0, 4, 0, 0,
       0, 3, 0, 3, 0, 0, 0, 0, 3, 0, 3, 0, 3, 4, 0, 0, 0, 0, 0, 2, 3, 0,
       0, 0, 0, 0, 0, 2, 3, 0, 0, 3, 2, 0, 3, 4, 2, 3, 0, 0, 0, 0, 2, 0,
       0, 0, 3, 3, 0, 2, 3, 0, 0, 2, 0, 1, 0

#### Saving the model

The model can be saved for future loading.

In [22]:
trainer.save_model()

Saving model checkpoint to ./results
Configuration saved in ./results/config.json
Model weights saved in ./results/pytorch_model.bin
tokenizer config file saved in ./results/tokenizer_config.json
Special tokens file saved in ./results/special_tokens_map.json


#### Loading and using a saved model

In [23]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=5)

Didn't find file ./results/added_tokens.json. We won't load it.
loading file ./results/vocab.txt
loading file ./results/tokenizer.json
loading file None
loading file ./results/special_tokens_map.json
loading file ./results/tokenizer_config.json
loading configuration file ./results/config.json
Model config BertConfig {
  "_name_or_path": "./results",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidd

To exploit the model, we can use a pipeline.

In [24]:
from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

In [25]:
pipe("Considero que a Praxe é muito boa")

[{'label': 'LABEL_1', 'score': 0.6839435696601868}]

We can also use the model in a step-by-step fashion, as follows.

In [26]:
import torch

inputs = "Considero que a Praxe é muito boa"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])

{'input_ids': tensor([[  101,  1158,  2776, 22280,   179,   123,  2485,  2650,   253,   785,
          3264,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[ 0.8035,  1.9816, -1.2588, -0.4059, -1.7807]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
Value(+)


Let's check again the performance of the model in the test set, possibly with additional metrics.

In [27]:
y_pred= []
for p in tokenized_dataset['test']['tokens']:
    ti = tokenizer2(p, return_tensors="pt")
    out = model2(**ti)
    pred = torch.argmax(out.logits)
    y_pred.append(pred)   # our labels are already 0 and 1

In [28]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

y_test = tokenized_dataset['test']['label']

print(confusion_matrix(y_test, y_pred))
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred, average='macro'))
print('Recall: ', recall_score(y_test, y_pred, average='macro'))
print('F1: ', f1_score(y_test, y_pred, average='macro'))

[[309  22  54  29  11]
 [ 27  25   2   4   1]
 [ 30   0 103   5   2]
 [ 78   9  24  69   0]
 [ 11   1   2   1  19]]
Accuracy:  0.6264916467780429
Precision:  0.5778241183504341
Recall:  0.5657317571096236
F1:  0.5626968419297291


We can do the same using a Trainer, as before.

In [29]:
trainer2 = Trainer(
    model=model2,
    tokenizer=tokenizer2,
    compute_metrics=compute_metrics
)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [30]:
trainer2.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 8


PredictionOutput(predictions=array([[ 0.37601352, -1.5194659 ,  2.8067954 ,  0.48898408, -1.9309189 ],
       [ 1.3340023 ,  2.0924182 , -1.9254825 , -0.24892056, -1.7757592 ],
       [ 0.6551271 , -0.8557426 , -0.7193791 ,  3.2252967 , -2.5290272 ],
       ...,
       [ 2.8552709 , -1.5310497 ,  0.7302812 ,  1.508096  , -2.8344948 ],
       [ 1.3395143 , -0.9876522 , -1.1002288 , -1.0114254 ,  2.6395814 ],
       [ 0.7530727 , -1.8906747 ,  3.2568834 ,  0.32288778, -2.2073815 ]],
      dtype=float32), label_ids=array([2, 1, 2, 3, 0, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 1, 2,
       0, 3, 0, 0, 0, 2, 2, 0, 3, 3, 0, 2, 2, 0, 0, 2, 0, 0, 2, 2, 2, 3,
       0, 3, 0, 0, 2, 0, 0, 1, 3, 0, 2, 0, 3, 3, 3, 0, 3, 0, 0, 1, 2, 2,
       3, 2, 0, 0, 3, 0, 2, 3, 0, 0, 0, 3, 3, 0, 0, 0, 0, 2, 0, 4, 0, 0,
       0, 3, 0, 3, 0, 0, 0, 0, 3, 0, 3, 0, 3, 4, 0, 0, 0, 0, 0, 2, 3, 0,
       0, 0, 0, 0, 0, 2, 3, 0, 0, 3, 2, 0, 3, 4, 2, 3, 0, 0, 0, 0, 2, 0,
       0, 0, 3, 3, 0, 2, 3, 0, 0, 2, 0, 1, 0

Now to try with large

In [31]:
model_name = "neuralmind/bert-large-portuguese-cased"
tokenizer = get_tokenizer(model_name)
model = get_model(model_name)
model.cuda()

training_args = get_trainingArgs()

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = get_trainer(model,training_args,tokenized_dataset,tokenizer,data_collator,compute_metrics)
trainer.train()

https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp5b4h7ekx


Downloading:   0%|          | 0.00/155 [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/3a44fa9a74e90f509368a7f2789df38e1fedd153a52c62ef5cc5f4b0f5c99c2a.d61b68f744aef2741575c270d4ba0228cd35693bfa15d8babfb5c1079062d5d7
creating metadata file for /root/.cache/huggingface/transformers/3a44fa9a74e90f509368a7f2789df38e1fedd153a52c62ef5cc5f4b0f5c99c2a.d61b68f744aef2741575c270d4ba0228cd35693bfa15d8babfb5c1079062d5d7
https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpekxwjydm


Downloading:   0%|          | 0.00/648 [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/c534071830642050813fa94003dbf1234413b3f1d5dc66d259fbc82ff7d5fd59.c8340a82acfbbcd2dd960b86d2886ee120b21896ef0294150f0391918ae6ced5
creating metadata file for /root/.cache/huggingface/transformers/c534071830642050813fa94003dbf1234413b3f1d5dc66d259fbc82ff7d5fd59.c8340a82acfbbcd2dd960b86d2886ee120b21896ef0294150f0391918ae6ced5
loading configuration file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c534071830642050813fa94003dbf1234413b3f1d5dc66d259fbc82ff7d5fd59.c8340a82acfbbcd2dd960b86d2886ee120b21896ef0294150f0391918ae6ced5
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-large-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/9cfcd25de0a333b1b5f4a3db227e93a806cfb041d93a49221eeaee6773eaa41c.af25fb1e29ad0175300146695fd80069be69b211c52fa5486fa8aae2754cc814
creating metadata file for /root/.cache/huggingface/transformers/9cfcd25de0a333b1b5f4a3db227e93a806cfb041d93a49221eeaee6773eaa41c.af25fb1e29ad0175300146695fd80069be69b211c52fa5486fa8aae2754cc814
https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/added_tokens.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpgh16e7j2


Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/added_tokens.json in cache at /root/.cache/huggingface/transformers/6a3aa038873b8f0d0ab3a4de0a658f063b89e3afd815920a5f393c0e4ae84259.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
creating metadata file for /root/.cache/huggingface/transformers/6a3aa038873b8f0d0ab3a4de0a658f063b89e3afd815920a5f393c0e4ae84259.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpam1_v_5t


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/d5b721c156180bbbcc4a1017e8c72a18f8f96cdc178acec5ddcd45905712b4cf.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /root/.cache/huggingface/transformers/d5b721c156180bbbcc4a1017e8c72a18f8f96cdc178acec5ddcd45905712b4cf.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/9cfcd25de0a333b1b5f4a3db227e93a806cfb041d93a49221eeaee6773eaa41c.af25fb1e29ad0175300146695fd80069be69b211c52fa5486fa8aae2754cc814
loading file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/added_tokens.json from cache

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

storing https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/016fb7702039667c9fb9dd2ceffaf04027b13e525a6248cda2a4a87dbb8687af.881d7200bce807f871637ac9d552c541b2d4b00146a0bf1ab0360f3640031273
creating metadata file for /root/.cache/huggingface/transformers/016fb7702039667c9fb9dd2ceffaf04027b13e525a6248cda2a4a87dbb8687af.881d7200bce807f871637ac9d552c541b2d4b00146a0bf1ab0360f3640031273
loading weights file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/016fb7702039667c9fb9dd2ceffaf04027b13e525a6248cda2a4a87dbb8687af.881d7200bce807f871637ac9d552c541b2d4b00146a0bf1ab0360f3640031273
Some weights of the model checkpoint at neuralmind/bert-large-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictio

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0555,0.907356,0.612903
2,0.7463,0.906321,0.634409
3,0.572,1.07037,0.608124
4,0.4611,1.188678,0.585424
5,0.3953,1.261466,0.592593
6,0.3525,1.525894,0.586619
7,0.3064,1.691817,0.593787
8,0.2691,1.826676,0.587814
9,0.2216,2.081511,0.57945
10,0.1999,2.230173,0.577061


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Sa

TrainOutput(global_step=9420, training_loss=0.4522808690486187, metrics={'train_runtime': 5526.3998, 'train_samples_per_second': 27.265, 'train_steps_per_second': 1.705, 'total_flos': 1.3485700452682056e+16, 'train_loss': 0.4522808690486187, 'epoch': 10.0})

In [32]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16


{'epoch': 10.0,
 'eval_accuracy': 0.5770609318996416,
 'eval_loss': 2.230172634124756,
 'eval_runtime': 8.7223,
 'eval_samples_per_second': 95.961,
 'eval_steps_per_second': 6.076}

In [33]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 16


PredictionOutput(predictions=array([[-2.1547718 , -2.7671638 ,  7.155498  , -0.11752454, -1.8404052 ],
       [ 2.8251185 ,  4.594569  , -3.5549378 , -2.374643  , -2.850478  ],
       [-4.315672  ,  1.159939  , -2.0939956 ,  6.453194  , -1.1730695 ],
       ...,
       [ 7.499666  , -2.3837373 , -2.1279833 , -1.7322022 , -1.741272  ],
       [ 2.9533944 , -3.6260717 , -0.3390447 , -2.8510425 ,  3.8079016 ],
       [-4.0607533 , -1.186632  ,  0.9396547 ,  6.576129  , -1.6154824 ]],
      dtype=float32), label_ids=array([2, 1, 2, 3, 0, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 1, 2,
       0, 3, 0, 0, 0, 2, 2, 0, 3, 3, 0, 2, 2, 0, 0, 2, 0, 0, 2, 2, 2, 3,
       0, 3, 0, 0, 2, 0, 0, 1, 3, 0, 2, 0, 3, 3, 3, 0, 3, 0, 0, 1, 2, 2,
       3, 2, 0, 0, 3, 0, 2, 3, 0, 0, 0, 3, 3, 0, 0, 0, 0, 2, 0, 4, 0, 0,
       0, 3, 0, 3, 0, 0, 0, 0, 3, 0, 3, 0, 3, 4, 0, 0, 0, 0, 0, 2, 3, 0,
       0, 0, 0, 0, 0, 2, 3, 0, 0, 3, 2, 0, 3, 4, 2, 3, 0, 0, 0, 0, 2, 0,
       0, 0, 3, 3, 0, 2, 3, 0, 0, 2, 0, 1, 0

In [34]:
trainer.save_model()

Saving model checkpoint to ./results
Configuration saved in ./results/config.json
Model weights saved in ./results/pytorch_model.bin
tokenizer config file saved in ./results/tokenizer_config.json
Special tokens file saved in ./results/special_tokens_map.json


In [35]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=5)

from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

import torch

inputs = "I consider that this class is great"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])

Didn't find file ./results/added_tokens.json. We won't load it.
loading file ./results/vocab.txt
loading file ./results/tokenizer.json
loading file None
loading file ./results/special_tokens_map.json
loading file ./results/tokenizer_config.json
loading configuration file ./results/config.json
Model config BertConfig {
  "_name_or_path": "./results",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hid

{'input_ids': tensor([[  101,   290,  4747, 12230,   352, 12230,   145,  1548,   847,  2498,
           352,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[ 5.3657, -2.2033, -2.5468,  0.7432, -1.9797]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
Value


## Translating to english and using distilbert-base-uncased-finetuned-sst-2-english

### Translate Text

First, translate all tokens:

In [36]:
import pandas as pd

# Importing the dataset
dataset = pd.read_excel("./data/OpArticles_ADUs_translated.xlsx")

dataset.drop(['article_id', 'annotator', 'node','ranges'], axis=1, inplace=True)
dataset['label'].replace(['Value', 'Value(+)', 'Value(-)', 'fact', 'policy'],[0,1,2,3,4], inplace=True)

dataset.head()

Unnamed: 0,tokens,label
0,The fact is not just the result of ignorance,0
1,there was more journalism in his humor (more i...,0
2,It's all comical in FIFA,0
3,what we all allow this organization to do is u...,0
4,do not make us laugh at the expense of the pow...,0


In [37]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [38]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

model_name = "distilbert-base-uncased"

### Tokenizer

We first load the tokenizer for our model:

In [39]:
from transformers import AutoTokenizer

tokenizer = get_tokenizer(model_name)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.19.2",
  "vocab_size": 30522
}

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10

In [40]:
tokenized_dataset = get_tokenized_data(train_valid_test_dataset,preprocess_function)

tokens = tokenizer.tokenize(tokenized_dataset['train'][321]['tokens'])
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

['read', '##ability', 'is', 'complete']
[3191, 8010, 2003, 3143]


In [41]:
inputs = tokenizer(tokenized_dataset['train'][321]['tokens'])
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'])

As before, we can do the same via a Trainer.

In [42]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import load_metric
from transformers import AutoModelForSequenceClassification

model = get_model(model_name)
model.cuda()

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": t

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [43]:
training_args = get_trainingArgs()

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = get_trainer(model,training_args,tokenized_dataset,tokenizer,data_collator,compute_metrics)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [44]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15068
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9420


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1205,1.018119,0.57945
2,0.8327,1.030167,0.57945
3,0.6777,1.103121,0.572282
4,0.5739,1.211983,0.563919
5,0.4997,1.380552,0.557945
6,0.4212,1.429933,0.555556
7,0.3797,1.573027,0.551971
8,0.3477,1.716951,0.549582
9,0.2964,1.788908,0.544803
10,0.2687,1.850411,0.547192


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 

TrainOutput(global_step=9420, training_loss=0.5340723365735096, metrics={'train_runtime': 816.165, 'train_samples_per_second': 184.62, 'train_steps_per_second': 11.542, 'total_flos': 1804236258808320.0, 'train_loss': 0.5340723365735096, 'epoch': 10.0})

In [45]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16


{'epoch': 10.0,
 'eval_accuracy': 0.5794504181600956,
 'eval_loss': 1.0181193351745605,
 'eval_runtime': 1.2094,
 'eval_samples_per_second': 692.063,
 'eval_steps_per_second': 43.822}

Note that we can still fine-tune the model with our training data, but the performance of the model is already quite good without any further training!

In [46]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 16


PredictionOutput(predictions=array([[ 2.25409   , -2.1270857 ,  1.6971582 ,  0.07195746, -2.58587   ],
       [ 1.8886373 , -2.1731105 ,  2.256175  , -0.07686514, -2.718885  ],
       [ 0.9941894 , -0.44745404, -0.7372538 ,  2.5214257 , -2.5947573 ],
       ...,
       [ 1.7395928 ,  0.24864103, -1.8731446 , -0.8908734 ,  0.561481  ],
       [ 2.2570903 , -0.2970044 , -1.1959621 ,  0.259524  , -1.4488842 ],
       [ 1.7744514 ,  0.4688517 , -0.94809073,  0.6491876 , -2.1379273 ]],
      dtype=float32), label_ids=array([0, 3, 3, 4, 0, 0, 0, 1, 3, 1, 0, 2, 3, 3, 0, 3, 1, 3, 3, 0, 0, 3,
       1, 2, 0, 0, 3, 4, 0, 0, 3, 0, 0, 0, 2, 0, 3, 0, 2, 3, 0, 0, 3, 2,
       4, 0, 3, 4, 0, 0, 3, 3, 1, 0, 2, 0, 3, 0, 4, 3, 2, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 3, 3, 3, 3, 0, 0, 3,
       0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 3, 3, 0, 3, 0, 0, 2,
       0, 2, 3, 1, 0, 3, 0, 0, 0, 4, 3, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 2, 2, 1, 2, 0, 3, 0, 0, 2, 0

In [47]:
trainer.save_model()

Saving model checkpoint to ./results
Configuration saved in ./results/config.json
Model weights saved in ./results/pytorch_model.bin
tokenizer config file saved in ./results/tokenizer_config.json
Special tokens file saved in ./results/special_tokens_map.json


In [48]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=5)

Didn't find file ./results/added_tokens.json. We won't load it.
loading file ./results/vocab.txt
loading file ./results/tokenizer.json
loading file None
loading file ./results/special_tokens_map.json
loading file ./results/tokenizer_config.json
loading configuration file ./results/config.json
Model config DistilBertConfig {
  "_name_or_path": "./results",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropo

In [49]:
from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

In [50]:
import torch

inputs = "I consider that this class is great"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])

{'input_ids': tensor([[ 101, 1045, 5136, 2008, 2023, 2465, 2003, 2307,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[ 1.6240,  1.5464, -1.6092, -0.5582, -0.8779]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
Value


#### Testing for other models

In [51]:
model_name = "YituTech/conv-bert-base"
tokenizer = get_tokenizer(model_name)
model = get_model(model_name)

training_args = get_trainingArgs()

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = get_trainer(model,training_args,tokenized_dataset,tokenizer,data_collator,compute_metrics)
trainer.train()

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/YituTech/conv-bert-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/7651fc6ae3906f28c62923bc7c76b0436327540c1ebb62a60b454ec79e102dd1.2a398d65585c12446cf5e632a1839e1754dc16cbbf6b87ccf28ba24c8536394e
Model config ConvBertConfig {
  "_name_or_path": "YituTech/conv-bert-base",
  "architectures": [
    "ConvBertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "conv_kernel_size": 9,
  "embedding_size": 768,
  "eos_token_id": 2,
  "head_ratio": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "convbert",
  "num_attention_heads": 12,
  "num_groups": 1,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "tr

Epoch,Training Loss,Validation Loss,Accuracy
1,1.1368,0.999008,0.592593
2,0.7995,1.021791,0.583035
3,0.6578,1.09619,0.585424
4,0.5492,1.220061,0.574671
5,0.4747,1.385403,0.557945
6,0.4049,1.473258,0.565114
7,0.3553,1.677509,0.557945
8,0.3318,1.760551,0.554361
9,0.2789,1.876332,0.562724
10,0.247,1.95364,0.55675


The following columns in the evaluation set don't have a corresponding argument in `ConvBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `ConvBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `ConvBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `ConvBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Ba

TrainOutput(global_step=9420, training_loss=0.5133514671568657, metrics={'train_runtime': 1808.32, 'train_samples_per_second': 83.326, 'train_steps_per_second': 5.209, 'total_flos': 3449186822939136.0, 'train_loss': 0.5133514671568657, 'epoch': 10.0})

In [52]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `ConvBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `ConvBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16


{'epoch': 10.0,
 'eval_accuracy': 0.5925925925925926,
 'eval_loss': 0.9990077018737793,
 'eval_runtime': 2.5681,
 'eval_samples_per_second': 325.922,
 'eval_steps_per_second': 20.638}

In [53]:
trainer.predict(test_dataset=tokenized_dataset["test"])

The following columns in the test set don't have a corresponding argument in `ConvBertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `ConvBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 16


PredictionOutput(predictions=array([[ 2.6817122 , -1.2507981 ,  0.5966191 ,  0.5713838 , -2.0820625 ],
       [ 1.0768876 , -1.9822626 ,  2.4556732 , -0.0561501 , -1.8347462 ],
       [ 1.0974739 , -0.6719573 , -0.4694883 ,  3.0028603 , -2.6525857 ],
       ...,
       [ 1.5312034 , -0.7468105 , -1.3785452 , -1.0388649 ,  1.5676636 ],
       [ 2.6586072 , -0.15121448, -0.71098953,  0.49706525, -1.424864  ],
       [ 1.9450687 ,  0.81677234, -1.168717  ,  1.3485212 , -2.1756992 ]],
      dtype=float32), label_ids=array([0, 3, 3, 4, 0, 0, 0, 1, 3, 1, 0, 2, 3, 3, 0, 3, 1, 3, 3, 0, 0, 3,
       1, 2, 0, 0, 3, 4, 0, 0, 3, 0, 0, 0, 2, 0, 3, 0, 2, 3, 0, 0, 3, 2,
       4, 0, 3, 4, 0, 0, 3, 3, 1, 0, 2, 0, 3, 0, 4, 3, 2, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 3, 3, 3, 3, 0, 0, 3,
       0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 3, 3, 0, 3, 0, 0, 2,
       0, 2, 3, 1, 0, 3, 0, 0, 0, 4, 3, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 2, 2, 1, 2, 0, 3, 0, 0, 2, 0

In [54]:
trainer.save_model()

Saving model checkpoint to ./results
Configuration saved in ./results/config.json
Model weights saved in ./results/pytorch_model.bin
tokenizer config file saved in ./results/tokenizer_config.json
Special tokens file saved in ./results/special_tokens_map.json


In [55]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer2 = AutoTokenizer.from_pretrained("./results")
model2 = AutoModelForSequenceClassification.from_pretrained("./results", num_labels=5)

from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(model=model2, tokenizer=tokenizer2) #, return_all_scores=True)

import torch

inputs = "I consider that this class is great"

# tokenize inputs
tokenized_inputs = tokenizer2(inputs, return_tensors="pt")
print(tokenized_inputs)

# obtain model outputs
outputs = model2(**tokenized_inputs)
print(outputs)

# get the most likely label
labels = ['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy']
prediction = torch.argmax(outputs.logits)
print(labels[prediction])

Didn't find file ./results/added_tokens.json. We won't load it.
loading file ./results/vocab.txt
loading file None
loading file ./results/special_tokens_map.json
loading file ./results/tokenizer_config.json
loading configuration file ./results/config.json
Model config ConvBertConfig {
  "_name_or_path": "./results",
  "architectures": [
    "ConvBertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "conv_kernel_size": 9,
  "embedding_size": 768,
  "eos_token_id": 2,
  "head_ratio": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_ty

{'input_ids': tensor([[ 101, 1045, 4632, 1504, 1519, 1961, 1499, 1803,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
SequenceClassifierOutput(loss=None, logits=tensor([[ 2.0991, -1.7663,  1.1772,  1.3581, -2.6614]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
Value
