# Notebook de referência

Nome: Fabio Grassiotto RA 890441

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante:
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

## Imports

In [1]:
%pip install transformers[torch]
%pip install accelerate

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [19]:
import random
import os
import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np
import time
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score

## Variáveis Globais e inicialização

In [3]:
# Seed
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

# Training 
TRAIN_HUGFACE = False
TRAIN_LOOP = True

# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_4_5"
    os.chdir(project_folder)
    !ls -la

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


## Preparando Dados

Primeiro, fazemos download do dataset:

In [4]:
if not os.path.exists("aclImdb.tgz"):
    print("Downloading Imdb dataset")
    !wget -nc http://files.fast.ai/data/aclImdb.tgz
    !tar -xzf aclImdb.tgz

## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [5]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
#y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
#y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)
y_train = [1] * len(x_train_pos) + [0] * len(x_train_pos)
y_test = [1] * len(x_test_pos) + [0] * len(x_test_pos)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
0 POSSIBLE SPOILERS<br /><br />The Spy Who Shagged Me is a muchly overrated and over-hyped sequel. Int
0 The long list of "big" names in this flick (including the ubiquitous John Mills) didn't bowl me over
1 Bette Midler showcases her talents and beauty in "Diva Las Vegas". I am thrilled that I taped it and
3 últimas amostras treino:
0 I was previously unaware that in the early 1990's Devry University (or was it ITT Tech?) added Film 
1 The story and music (George Gershwin!) are wonderful, as are Levant, Guetary, Foch, and, of course, 
1 This is my favorite show. I think it is utterly brilliant. Thanks to David Chase for bringing this i
3 primeiras amostras validação:
1 Why has this not been released? I kind of thought it must be a bit rubbish since it hasn't been. How
1 I was amazingly impressed by this movie. It contained fundamental elements of depression, grief, lon
1 p

In [6]:
print(x_train[0])
print(y_train[0])

POSSIBLE SPOILERS<br /><br />The Spy Who Shagged Me is a muchly overrated and over-hyped sequel. International Man of Mystery came straight out of the blue. It was a lone star that few people had heard of. But it was stunningly original, had sophisticated humour and ample humour, always kept in good taste, and had a brilliant cast. The Spy Who Shagged Me was a lot more commercially advertised and hyped about.<br /><br />OK I'll admit, the first time I saw this film I thought it was very funny, but it's only after watching it two or three times that you see all the flaws. The acting was OK, but Heather Graham cannot act. Her performance didn't seem very convincing and she wasn't near as good as Liz Hurley was in the first one. Those characters who bloomed in the first one, (Scott Evil, Number 2 etc.) are thrown into the background hear and don't get many stand-alone scenes. The film is simply overrun with cameos.<br /><br />In particular, I hated the way they totally disregarded some of

# Tokenização do dataset
### Nota: O código abaixo é basead no tutorial "How to Fine Tune BERT for Text Classification using Transformers in Python" em https://thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python. O modelo foi fine-tuned em GPU.

In [7]:
# Bert base model
model_name = "bert-base-uncased"
# max sequence length
max_length = 512
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

In [8]:
train_encodings = tokenizer(x_train, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(x_valid, truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(x_test, truncation=True, padding=True, max_length=max_length)

In [9]:
print(train_encodings[0])

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


# Classe do Dataset

In [10]:
class ImdbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = ImdbDataset(train_encodings, y_train)
valid_dataset = ImdbDataset(valid_encodings, y_valid)
test_dataset = ImdbDataset(test_encodings, y_test)

# Modelo do HuggingFace

In [21]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

# saving the pre-trained model and tokenizer
pretrained_path = "model/pretrained"
model.save_pretrained(pretrained_path)
tokenizer.save_pretrained(pretrained_path)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


('model/pretrained\\tokenizer_config.json',
 'model/pretrained\\special_tokens_map.json',
 'model/pretrained\\vocab.txt',
 'model/pretrained\\added_tokens.json',
 'model/pretrained\\tokenizer.json')

# Treinamento do Modelo

## Com Laço de treinamento

In [16]:
batch_size = 20
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)

sample = next(iter(train_loader))
print(sample)

{'input_ids': tensor([[  101,  1045,  2001,  ...,     0,     0,     0],
        [  101,  1045,  2018,  ...,     0,     0,     0],
        [  101, 13970,  7352,  ...,     0,     0,     0],
        ...,
        [  101,  1037,  2186,  ...,     0,     0,     0],
        [  101,  2026,  2034,  ...,     0,     0,     0],
        [  101,  2023,  2003,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[1],
        [1],
        [1],
        [0],
        [0],
        [1],
        [0],
        [1],
        [1],
        [0],
        [

In [17]:
epochs = 3

def train_loop(model, criterion, optimizer, device):
      # Training Loop
      model.train()
      for epoch in range(epochs):

            epoch_start = time.time()
            # Metrics
            epoch_loss = 0
            epoch_correct = 0
            epoch_samples = 0

            for inputs, targets in train_loader:
                  inputs = inputs.to(device)  # Move input data to the device
                  targets = targets.to(device)

                  # Forward pass
                  logits = model(inputs)
                  loss = criterion(logits, targets)

                  # Backward pass and optimization
                  optimizer.zero_grad()
                  loss.backward()
                  optimizer.step()

                  # Loss
                  epoch_loss += loss.item()

                  # Predicted
                  _, predicted = torch.max(logits, 1)
                  epoch_correct += (predicted == targets).sum().item()
                  epoch_samples += targets.size(0)

            # Calculate average loss and accuracy for epoch
            avg_loss = epoch_loss / len(train_loader)
            acc = epoch_correct / epoch_samples

            # Perplexity
            perp = torch.exp(torch.tensor(avg_loss))

            epoch_end = time.time()
            epoch_time = epoch_end - epoch_start
            # Print epoch statistics
            print(f'Epoch [{epoch+1}/{epochs}], Time:{epoch_time:.2f}, Loss: {avg_loss:.4f}, Accuracy: {acc:.2f}%, Perplexity: {perp:.4f}')

In [20]:
model_loop = BertForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

lr = 0.001

# Cross Entropy
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.AdamW(model_loop.parameters(), lr)

train_loop(model_loop, criterion, optimizer, device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ValueError: too many values to unpack (expected 2)

### Salva os pesos do modelo (Treinado em Laço) e do tokenizer treinado.

In [None]:
# saving the fine tuned model
loop_path = "model/loop"
model_loop.save_pretrained(loop_path)

## Biblioteca HuggingFace

In [13]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    save_strategy="epoch",              # Do not save intermediary steps
    evaluation_strategy="epoch",     # evaluate each epoch
)

trainer = Trainer(
    model=model,                         # Model
    args=training_args,                  # training args
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/7500 [00:00<?, ?it/s]

{'loss': 0.4633, 'grad_norm': 5.360031604766846, 'learning_rate': 5e-05, 'epoch': 0.2}
{'loss': 0.37, 'grad_norm': 3.8223471641540527, 'learning_rate': 4.642857142857143e-05, 'epoch': 0.4}
{'loss': 0.3632, 'grad_norm': 6.134415626525879, 'learning_rate': 4.2857142857142856e-05, 'epoch': 0.6}
{'loss': 0.3172, 'grad_norm': 3.2651147842407227, 'learning_rate': 3.928571428571429e-05, 'epoch': 0.8}
{'loss': 0.2884, 'grad_norm': 7.221029281616211, 'learning_rate': 3.571428571428572e-05, 'epoch': 1.0}


  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 0.3110387325286865, 'eval_accuracy': 0.8846, 'eval_runtime': 182.5949, 'eval_samples_per_second': 27.383, 'eval_steps_per_second': 1.369, 'epoch': 1.0}
{'loss': 0.2182, 'grad_norm': 0.07648228108882904, 'learning_rate': 3.2142857142857144e-05, 'epoch': 1.2}
{'loss': 0.1884, 'grad_norm': 0.6571389436721802, 'learning_rate': 2.857142857142857e-05, 'epoch': 1.4}
{'loss': 0.1835, 'grad_norm': 14.305499076843262, 'learning_rate': 2.5e-05, 'epoch': 1.6}
{'loss': 0.1949, 'grad_norm': 0.09540243446826935, 'learning_rate': 2.1428571428571428e-05, 'epoch': 1.8}
{'loss': 0.1797, 'grad_norm': 3.4848520755767822, 'learning_rate': 1.785714285714286e-05, 'epoch': 2.0}


  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 0.34693506360054016, 'eval_accuracy': 0.9244, 'eval_runtime': 180.5322, 'eval_samples_per_second': 27.696, 'eval_steps_per_second': 1.385, 'epoch': 2.0}
{'loss': 0.0825, 'grad_norm': 0.028008712455630302, 'learning_rate': 1.4285714285714285e-05, 'epoch': 2.2}
{'loss': 0.0765, 'grad_norm': 60.4325065612793, 'learning_rate': 1.0714285714285714e-05, 'epoch': 2.4}
{'loss': 0.0907, 'grad_norm': 0.04618758708238602, 'learning_rate': 7.142857142857143e-06, 'epoch': 2.6}
{'loss': 0.0639, 'grad_norm': 0.021327907219529152, 'learning_rate': 3.5714285714285714e-06, 'epoch': 2.8}
{'loss': 0.088, 'grad_norm': 0.028519421815872192, 'learning_rate': 0.0, 'epoch': 3.0}


  0%|          | 0/250 [00:00<?, ?it/s]

{'eval_loss': 0.3816882371902466, 'eval_accuracy': 0.9286, 'eval_runtime': 180.5304, 'eval_samples_per_second': 27.696, 'eval_steps_per_second': 1.385, 'epoch': 3.0}
{'train_runtime': 1996.9425, 'train_samples_per_second': 30.046, 'train_steps_per_second': 3.756, 'train_loss': 0.21122881825764975, 'epoch': 3.0}


TrainOutput(global_step=7500, training_loss=0.21122881825764975, metrics={'train_runtime': 1996.9425, 'train_samples_per_second': 30.046, 'train_steps_per_second': 3.756, 'train_loss': 0.21122881825764975, 'epoch': 3.0})

### Salva os pesos do modelo (HF) e do tokenizer treinado.

In [15]:
# saving the fine tuned model
hf_path = "model/hf"
model.save_pretrained(hf_path)

('model/hf\\tokenizer_config.json',
 'model/hf\\special_tokens_map.json',
 'model/hf\\vocab.txt',
 'model/hf\\added_tokens.json',
 'model/hf\\tokenizer.json')

# Avaliação do Modelo (base de validação)

In [None]:
# evaluate the current model after training
trainer.evaluate()

# Avaliação do Modelo (base de teste)

In [None]:
trainer.evaluate(test_dataset)

# Testando inferência com textos comuns

In [None]:
target_names = ["Negative", "Positive"]

def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

test_neg = "This movie was like the worst thing ever."
test_pos_1 = "One of this generation's best"
test_pos_2 = "This is a movie to die for."

print("Testing predictions:\n")
print(f'{test_neg} => {get_prediction(test_neg)}')
print(f'{test_pos_1} => {get_prediction(test_pos_1)}')
print(f'{test_pos_2} => {get_prediction(test_pos_2)}')