# Notebook de referência

Nome: Fabio Grassiotto RA 890441

## Instruções:


Treinar e medir a acurácia de um modelo BERT (ou variantes) para classificação binária usando o dataset do IMDB (20k/5k amostras de treino/validação).

Importante:
- Deve-se implementar o próprio laço de treinamento.
- Implementar o acumulo de gradiente.

Dicas:
- BERT geralmente costuma aprender bem uma tarefa com poucas épocas (de 3 a 5 épocas). Se tiver demorando mais de 5 épocas para chegar em 80% de acurácia, ajuste os hiperparametros.

- Solução para erro de memória:
  - Usar bfloat16 permite quase dobrar o batch size

Opcional:
- Pode-se usar a função trainer da biblioteca Transformers/HuggingFace para verificar se seu laço de treinamento está correto. Note que ainda assim é obrigatório implementar o laço próprio.

## Imports

In [1]:
%%capture
%pip install transformers[torch]
%pip install accelerate

In [2]:
import random
import os
import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import numpy as np
import time
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score
from tqdm import tqdm

## Variáveis Globais e inicialização

In [3]:
# Seed
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)

# Load Models from filesystem
LOAD_PRETRAINED = True
LOAD_HUGFACE = True
LOAD_LOOP = False

# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_4_5"
    os.chdir(project_folder)
    !ls -la

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


## Preparando Dados

Primeiro, fazemos download do dataset:

In [4]:
if not os.path.exists("aclImdb.tgz"):
    print("Downloading Imdb dataset")
    !wget -nc http://files.fast.ai/data/aclImdb.tgz
    !tar -xzf aclImdb.tgz

## Carregando o dataset

Criaremos uma divisão de treino (20k exemplos) e validação (5k exemplos) artificialmente.

In [5]:
import os

max_valid = 5000

def load_texts(folder):
    texts = []
    for path in os.listdir(folder):
        with open(os.path.join(folder, path)) as f:
            texts.append(f.read())
    return texts

x_train_pos = load_texts('aclImdb/train/pos')
x_train_neg = load_texts('aclImdb/train/neg')
x_test_pos = load_texts('aclImdb/test/pos')
x_test_neg = load_texts('aclImdb/test/neg')

x_train = x_train_pos + x_train_neg
x_test = x_test_pos + x_test_neg
#y_train = [True] * len(x_train_pos) + [False] * len(x_train_neg)
#y_test = [True] * len(x_test_pos) + [False] * len(x_test_neg)
# Use 0/1 for classes
y_train = [1] * len(x_train_pos) + [0] * len(x_train_pos)
y_test = [1] * len(x_test_pos) + [0] * len(x_test_pos)

# Embaralhamos o treino para depois fazermos a divisão treino/valid.
c = list(zip(x_train, y_train))
random.shuffle(c)
x_train, y_train = zip(*c)

x_valid = x_train[-max_valid:]
y_valid = y_train[-max_valid:]
x_train = x_train[:-max_valid]
y_train = y_train[:-max_valid]

print(len(x_train), 'amostras de treino.')
print(len(x_valid), 'amostras de desenvolvimento.')
print(len(x_test), 'amostras de teste.')

print('3 primeiras amostras treino:')
for x, y in zip(x_train[:3], y_train[:3]):
    print(y, x[:100])

print('3 últimas amostras treino:')
for x, y in zip(x_train[-3:], y_train[-3:]):
    print(y, x[:100])

print('3 primeiras amostras validação:')
for x, y in zip(x_valid[:3], y_test[:3]):
    print(y, x[:100])

print('3 últimas amostras validação:')
for x, y in zip(x_valid[-3:], y_valid[-3:]):
    print(y, x[:100])

20000 amostras de treino.
5000 amostras de desenvolvimento.
25000 amostras de teste.
3 primeiras amostras treino:
0 POSSIBLE SPOILERS<br /><br />The Spy Who Shagged Me is a muchly overrated and over-hyped sequel. Int
0 The long list of "big" names in this flick (including the ubiquitous John Mills) didn't bowl me over
1 Bette Midler showcases her talents and beauty in "Diva Las Vegas". I am thrilled that I taped it and
3 últimas amostras treino:
0 I was previously unaware that in the early 1990's Devry University (or was it ITT Tech?) added Film 
1 The story and music (George Gershwin!) are wonderful, as are Levant, Guetary, Foch, and, of course, 
1 This is my favorite show. I think it is utterly brilliant. Thanks to David Chase for bringing this i
3 primeiras amostras validação:
1 Why has this not been released? I kind of thought it must be a bit rubbish since it hasn't been. How
1 I was amazingly impressed by this movie. It contained fundamental elements of depression, grief, lon
1 p

In [6]:
# Checking output
print(x_train[0])
print(y_train[0])

POSSIBLE SPOILERS<br /><br />The Spy Who Shagged Me is a muchly overrated and over-hyped sequel. International Man of Mystery came straight out of the blue. It was a lone star that few people had heard of. But it was stunningly original, had sophisticated humour and ample humour, always kept in good taste, and had a brilliant cast. The Spy Who Shagged Me was a lot more commercially advertised and hyped about.<br /><br />OK I'll admit, the first time I saw this film I thought it was very funny, but it's only after watching it two or three times that you see all the flaws. The acting was OK, but Heather Graham cannot act. Her performance didn't seem very convincing and she wasn't near as good as Liz Hurley was in the first one. Those characters who bloomed in the first one, (Scott Evil, Number 2 etc.) are thrown into the background hear and don't get many stand-alone scenes. The film is simply overrun with cameos.<br /><br />In particular, I hated the way they totally disregarded some of

# Tokenização do dataset
#### Nota: O código abaixo tomou como base o tutorial "How to Fine Tune BERT for Text Classification using Transformers in Python" em https://thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python para a utilização das APIs do HuggingFace.

In [7]:
if (LOAD_PRETRAINED):# Bert base model
    model_path = "model/pretrained"
    tokenizer = BertTokenizerFast.from_pretrained(model_path)
else:    
    model_name = "bert-base-uncased"
    tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

In [8]:
# max sequence length
max_length = 512

train_encodings = tokenizer(x_train, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(x_valid, truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(x_test, truncation=True, padding=True, max_length=max_length)

In [9]:
print(train_encodings[0])

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


# Classe do Dataset

In [10]:
class ImdbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = ImdbDataset(train_encodings, y_train)
valid_dataset = ImdbDataset(valid_encodings, y_valid)
test_dataset = ImdbDataset(test_encodings, y_test)

# Modelo do HuggingFace

In [11]:
if (LOAD_PRETRAINED):# Bert base model
    model_path = "model/pretrained"
    model = BertForSequenceClassification.from_pretrained(model_path, local_files_only=True).to(device)
else:
    model_name = "bert-base-uncased"
    model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

    # saving the pre-trained model and tokenizer
    pretrained_path = "model/pretrained"
    model.save_pretrained(pretrained_path)
    tokenizer.save_pretrained(pretrained_path)

# Treinamento e Avaliação do Modelo

## Com Laço de treinamento

In [12]:
batch_size = 8
epochs = 3
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)

def train_loop(model, optimizer, device):
      # Training Loop
      model.train()
      for epoch in range(epochs):
            for batch in tqdm(train_loader):
                  batch = {k: v.cuda() for k, v in batch.items()}
                  
                  # Forward pass
                  outputs = model(**batch)
                  loss = outputs.loss

                  # Backward pass and optimization
                  optimizer.zero_grad()
                  loss.backward()
                  optimizer.step()


In [13]:
lr = 0.001

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr)

model_path = "model/loop"
if (LOAD_LOOP):    
    model = BertForSequenceClassification.from_pretrained(model_path, local_files_only=True).to(device)
else:  
    train_loop(model, optimizer, device)
    # saving the fine tuned model
    model.save_pretrained(model_path)

100%|██████████| 2500/2500 [07:24<00:00,  5.62it/s]
 14%|█▍        | 353/2500 [01:02<06:28,  5.53it/s]

: 

In [None]:
def eval(model):
    model.eval()

    loss_sum = 0
    total_sum = 0
    correct_sum = 0
    eval_round = 0

    loss = 0

    with torch.no_grad():
        for batch in tqdm(val_loader):
            batch = {k: v.cuda() for k, v in batch.items()}
            targets = batch['labels']

            outputs = model(**batch)
            loss = outputs.loss   

            loss_sum += loss

            # Get the predicted labels
            predicted = torch.max(outputs[0], -1)

            total_sum += targets.size(0)
            correct_sum += (predicted == targets).sum().item()
            eval_round += 1

    # Calculate accuracy
    acc = 100 * correct_sum / total_sum

    # Calculate average perplexity
    average_loss = loss_sum / len(val_loader)
    average_perplexity = torch.exp(average_loss)

    print(f'Test Accuracy: {acc:.2f}%')
    print(f'Average Loss: {average_loss:.2f}')
    print(f'Average Perplexity: {average_perplexity:.2f}')

eval(model)

## Biblioteca HuggingFace

In [None]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    save_strategy="epoch",              # Do not save intermediary steps
    evaluation_strategy="epoch",     # evaluate each epoch
)

trainer = Trainer(
    model=model,                         # Model
    args=training_args,                  # training args
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

model_path = "model/hf"
if (LOAD_HUGFACE):    
    model = BertForSequenceClassification.from_pretrained(model_path, local_files_only=True).to(device)
else:  
    trainer.train()
    # saving the fine tuned model
    model.save_pretrained(model_path)

### Avaliação do Modelo (base de validação)

In [None]:
# evaluate the current model after training
trainer.evaluate()

### Avaliação do Modelo (base de teste)

In [None]:
trainer.evaluate(test_dataset)

### Testando inferência com textos comuns

In [None]:
target_names = ["Negative", "Positive"]

def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

test_neg = "This movie was like the worst thing ever."
test_pos_1 = "One of this generation's best"
test_pos_2 = "This is a movie to die for."

print("Testing predictions:\n")
print(f'{test_neg} => {get_prediction(test_neg)}')
print(f'{test_pos_1} => {get_prediction(test_pos_1)}')
print(f'{test_pos_2} => {get_prediction(test_pos_2)}')