In this notebook we will explore [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).
You may also want to check the [Hugging Face course](https://huggingface.co/course/), which will explain you how to use this technology in a much greater depth.

Training transformer models is computationally expensive. Hugging Face makes available several pretrained [models](https://huggingface.co/models) that can be used as is, or fine-tuned to a specific NLP task, such as one of sentence classification. That's what we'll do in this notebook.

Hugging Face also makes available several [datasets](https://huggingface.co/datasets) that can be used to train or fine-tune a model.

See:
- https://huggingface.co/docs/transformers/tasks/sequence_classification#preprocess
- https://huggingface.co/docs/transformers/training#prepare-a-dataset
- https://huggingface.co/docs/transformers/accelerate
- https://huggingface.co/docs/transformers/model_summary#autoencoding-models

In [1]:
!pip install pandas
!pip install datasets
!pip install transformers
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu113


## Preparing our Data

In this notebook, we'll start by using a local dataset (instead of using a dataset stored at Hugging Face).
Let's load data for our classification task.

### Loading dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

# Importing the dataset
dataset = pd.read_excel('/content/drive/Shareddrives/PLN/Assignment 2/data/OpArticles_ADUs.xlsx')
dataset = dataset.drop(columns=['article_id', 'annotator', 'node','ranges'])
dataset['label'].replace(['Value', 'Value(+)', 'Value(-)', 'Fact', 'Policy'],[0,1,2,3,4], inplace=True)

print(dataset.info())
print(dataset.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16743 entries, 0 to 16742
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tokens  16743 non-null  object
 1   label   16743 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 261.7+ KB
None
                                              tokens  label
0           O facto não é apenas fruto da ignorância      0
1  havia no seu humor mais jornalismo (mais inves...      0
2                              É tudo cómico na FIFA      0
3  o que todos nós permitimos que esta organizaçã...      0
4            não nos fazem rir à custa dos poderosos      0


For ease of usage with Transformer models, we convert the dataset into a Hugging Face dataset and split it into train, validation and test sets.

In [4]:
from datasets import Dataset

dataset_hf = Dataset.from_pandas(dataset)

In [5]:
from datasets import DatasetDict

# 90% train, 10% test+validation
train_test = dataset_hf.train_test_split(test_size=0.1, shuffle=True, seed=42)

# Split the 10% test+validation set in half test, half validation
valid_test = train_test['test'].train_test_split(test_size=0.5, shuffle=True, seed=42)

# gather everyone if you want to have a single DatasetDict
train_valid_test_dataset = DatasetDict({
    'train': train_test['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

In [6]:
train_valid_test_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'label'],
        num_rows: 15068
    })
    validation: Dataset({
        features: ['tokens', 'label'],
        num_rows: 837
    })
    test: Dataset({
        features: ['tokens', 'label'],
        num_rows: 838
    })
})

## Fine-tuning a pretrained model

### Tokenizer

We first load the tokenizer for our model:

In [7]:
from transformers import AutoTokenizer

def get_tokenizer(name):
    return AutoTokenizer.from_pretrained(name)

Now we need to [preprocess](https://huggingface.co/docs/transformers/preprocessing) our data.

Obtaining the length of the longest sequences in our data splits

In [8]:
def find_max_length(dataset):
    return len(max(dataset, key=lambda x: len(x.split())).split())

train_max_length = find_max_length(train_valid_test_dataset["train"]["tokens"])
val_max_length = find_max_length(train_valid_test_dataset["validation"]["tokens"])
test_max_length = find_max_length(train_valid_test_dataset["test"]["tokens"])

print(f"Longest sequence in train set has {train_max_length} words")
print(f"Longest sequence in val set has {val_max_length} words")
print(f"Longest sequence in test set has {test_max_length} words")

Longest sequence in train set has 81 words
Longest sequence in val set has 81 words
Longest sequence in test set has 56 words


Tokenize entire dataset

In [9]:
# Define tokenizer later in
tokenizer = None

def tokenize_dataset(sample):
    return tokenizer(sample["tokens"], truncation=True, max_length=81, padding="max_length")

def get_tokenized_data(dataset):
    return dataset.map(tokenize_dataset, batched=True)

### Loading the model

Since we want to use the model for classification, we should load it with an appropriate classification head:

In [10]:
from transformers import AutoModelForSequenceClassification
import torch

def get_model(name):
    model = AutoModelForSequenceClassification.from_pretrained(name, num_labels=5)
    model.cuda() # Use GPU

    return model

### Fine-tuning

The next step is to [fine-tune](https://huggingface.co/docs/transformers/training) the model with our train data. To do so, we can make use of a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer).
There are several aspects of training that you can specify via [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [11]:
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import load_metric
import numpy as np

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def get_trainingArgs():
    return TrainingArguments(
        output_dir="./results",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3  ,
        weight_decay=0.01,
        evaluation_strategy="epoch", # run validation at the end of each epoch
        save_strategy="epoch",
        load_best_model_at_end=True,
    )

def get_trainer(model_, args_, dataset_, tokenizer_, data_collator_, compute_metrics_):
    return Trainer(
        model=model_,
        args=args_,
        train_dataset=dataset_["train"],
        eval_dataset=dataset_["validation"],
        tokenizer=tokenizer_,
        data_collator=data_collator_,
        compute_metrics=compute_metrics_
    )

#### Train, evaluate, predict, save

In [12]:
from IPython.display import display

def train_model(model_name):
  global tokenizer
  tokenizer = get_tokenizer(model_name)
  tokenized_dataset = get_tokenized_data(train_valid_test_dataset)
  model = get_model(model_name)

  trainer = get_trainer(
    model,
    get_trainingArgs(),
    tokenized_dataset,
    tokenizer,
    DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics
    )
  
  # Train Model
  display(trainer.train())

  # Check performance in validation set
  display(trainer.evaluate())

  # Check how the model fares in our test set.
  display(trainer.predict(test_dataset=tokenized_dataset["test"]))

  # Save model for future use
  trainer.save_model('/content/drive/Shareddrives/PLN/Assignment 2/models/' + model_name)

In [13]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def check_metrics(model_name):
  global tokenizer
  tokenizer = AutoTokenizer.from_pretrained('/content/drive/Shareddrives/PLN/Assignment 2/models/' + model_name)
  tokenized_dataset = get_tokenized_data(train_valid_test_dataset)
  model = AutoModelForSequenceClassification.from_pretrained('/content/drive/Shareddrives/PLN/Assignment 2/models/' + model_name, num_labels=5)

  y_pred= []
  for p in tokenized_dataset['test']['tokens']:
      ti = tokenizer(p, return_tensors="pt")
      out = model(**ti)
      pred = torch.argmax(out.logits)
      y_pred.append(pred)

  y_test = tokenized_dataset['test']['label']

  print("\n***** Metrics *****")
  print(confusion_matrix(y_test, y_pred))
  print('Accuracy: ', accuracy_score(y_test, y_pred))
  print('Precision: ', precision_score(y_test, y_pred, average='macro'))
  print('Recall: ', recall_score(y_test, y_pred, average='macro'))
  print('F1: ', f1_score(y_test, y_pred, average='macro'))

### Bert - neuralmind/bert-base-portuguese-cased

As a starting example, we'll use a lighter BERT-based model. We will need to load:
- the [tokenizer](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer) (which is used to [preprocess](https://huggingface.co/docs/transformers/preprocessing) the data before it can be used by the model)
- the [model](https://huggingface.co/docs/transformers/autoclass_tutorial#automodel) itself

In [14]:
model_name = "neuralmind/bert-base-portuguese-cased"

In [15]:
train_model(model_name)

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

Epoch,Training Loss,Validation Loss,Accuracy
1,1.0798,0.883323,0.630824
2,0.7896,0.8996,0.62724
3,0.6351,0.976512,0.633214


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Sa

TrainOutput(global_step=2826, training_loss=0.8115092909935089, metrics={'train_runtime': 735.8232, 'train_samples_per_second': 61.433, 'train_steps_per_second': 3.841, 'total_flos': 1881666784115928.0, 'train_loss': 0.8115092909935089, 'epoch': 3.0})

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16


{'epoch': 3.0,
 'eval_accuracy': 0.6308243727598566,
 'eval_loss': 0.883323073387146,
 'eval_runtime': 4.2545,
 'eval_samples_per_second': 196.731,
 'eval_steps_per_second': 12.457}

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 838
  Batch size = 16


PredictionOutput(predictions=array([[ 1.7813975 , -0.37949938,  0.21879035,  1.3190378 , -2.9983008 ],
       [ 2.079851  , -1.9285269 ,  1.7446489 ,  0.4217967 , -2.88921   ],
       [ 1.2571154 ,  1.6242653 , -1.6400881 ,  1.313356  , -2.1580842 ],
       ...,
       [ 1.6846246 , -1.8602873 ,  2.048341  ,  0.42598867, -2.7963405 ],
       [ 2.1397846 , -1.0519289 , -0.11578311,  1.7926372 , -2.928538  ],
       [ 2.0890992 ,  0.76624304, -1.65464   ,  0.720956  , -1.7319709 ]],
      dtype=float32), label_ids=array([3, 2, 0, 0, 2, 0, 0, 2, 0, 1, 3, 0, 4, 2, 3, 0, 0, 0, 0, 0, 0, 0,
       3, 1, 0, 0, 2, 0, 0, 0, 1, 0, 3, 3, 3, 3, 1, 4, 0, 3, 0, 0, 0, 3,
       1, 3, 3, 0, 3, 1, 3, 2, 4, 3, 2, 1, 0, 0, 3, 0, 3, 0, 0, 2, 1, 1,
       2, 2, 0, 3, 0, 0, 0, 0, 0, 0, 0, 2, 0, 3, 2, 3, 0, 1, 0, 3, 0, 0,
       3, 0, 2, 2, 2, 0, 3, 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 3, 4, 0, 2, 3,
       3, 2, 3, 3, 0, 1, 3, 3, 1, 0, 3, 1, 0, 2, 3, 1, 3, 2, 3, 1, 0, 3,
       0, 0, 1, 0, 0, 2, 0, 2, 3, 0, 3, 0, 0

Saving model checkpoint to /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased
Configuration saved in /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/config.json
Model weights saved in /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/special_tokens_map.json


In [16]:
!rm -rf ./results/

In [17]:
check_metrics(model_name)

Didn't find file /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/added_tokens.json. We won't load it.
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/vocab.txt
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/tokenizer.json
loading file None
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/special_tokens_map.json
loading file /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/tokenizer_config.json


  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

loading configuration file /content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased/config.json
Model config BertConfig {
  "_name_or_path": "/content/drive/Shareddrives/PLN/Assignment 2/models/neuralmind/bert-base-portuguese-cased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,


***** Metrics *****
[[295  44  34  36   7]
 [ 27  46   1   2   1]
 [ 53   1  79   7   1]
 [ 55  15  14  92   0]
 [  9   4   0   0  15]]
Accuracy:  0.6288782816229117
Precision:  0.6007767883325045
Recall:  0.5850524918344068
F1:  0.5868360371593153


### Bert - "neuralmind/bert-large-portuguese-cased"

In [18]:
model_name = "neuralmind/bert-large-portuguese-cased"

In [None]:
train_model(model_name)

loading configuration file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c534071830642050813fa94003dbf1234413b3f1d5dc66d259fbc82ff7d5fd59.c8340a82acfbbcd2dd960b86d2886ee120b21896ef0294150f0391918ae6ced5
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-large-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_t

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/neuralmind/bert-large-portuguese-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c534071830642050813fa94003dbf1234413b3f1d5dc66d259fbc82ff7d5fd59.c8340a82acfbbcd2dd960b86d2886ee120b21896ef0294150f0391918ae6ced5
Model config BertConfig {
  "_name_or_path": "neuralmind/bert-large-portuguese-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads

Epoch,Training Loss,Validation Loss,Accuracy
1,1.01,0.830431,0.658303
2,0.7183,0.894269,0.630824


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-942
Configuration saved in ./results/checkpoint-942/config.json
Model weights saved in ./results/checkpoint-942/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-942/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-942/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 837
  Batch size = 16
Sa

In [None]:
!rm -rf ./results/

In [None]:
check_metrics(model_name)