# NLP TP2 - Victor Henrique Silva Ribeiro


## Introdução
Nesse trabalho, irei utilizar o modelo pré-treinado `bert-base-portuguese-cased` para a tarefa downstream de POS tagging. Para isso, ultilizo o dataset `macmorpho`, que é um dataset de POS tagging para o português.

Primeiramente importo as bibliotecas necessárias para o trabalho.

In [3]:
%pip install torchtext==0.6.0
%pip install transformers
%pip install numpy
%pip install torch
%pip install datasets

import torch

import torch.nn as nn
import torch.optim as optim

from torchtext import data
from torchtext.data import Example, Dataset

from transformers import BertTokenizer, BertModel

import numpy as np

import functools
from datasets import load_dataset
from collections import defaultdict

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


O primeiro passo é importar o tokenizador em português utilizando a biblioteca `transformers` do HuggingFace. É importante lembrar que é necessário utilizar em nossos inputs os tokens de começo de frase, token desconhecido e padding que foram utilizados no treinamento do `BERT`. Além disso precisamos truncar nossos inputs para o tamanho máximo de tokens que o `BERT` suporta, que é 512.

In [4]:
tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

init_token = tokenizer.cls_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

tokenizer_config.json: 100%|██████████| 43.0/43.0 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
vocab.txt: 100%|██████████| 210k/210k [00:00<00:00, 3.17MB/s]
added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<?, ?B/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<?, ?B/s] 
config.json: 100%|██████████| 647/647 [00:00<00:00, 644kB/s]


Agora definimos como os inputs e labels serão pré-processados para o formato que o `BERT` espera. Todo o processo é feito utilizando tensores `PyTorch`.

In [5]:
def inputProcessor(tokens, tokenizer, max_input_length):
    tokens = tokens[:max_input_length-1]
    tokens = [tokenizer.convert_tokens_to_ids(token) 
              if token in tokenizer.vocab 
              else tokenizer.convert_tokens_to_ids('<unk>') 
              for token in tokens]
    return tokens

def labelProcessor(tokens, max_input_length):
    tokens = tokens[:max_input_length-1]
    return tokens

text_preprocessor = functools.partial(inputProcessor,
                                      tokenizer = tokenizer,
                                      max_input_length = max_input_length)

tag_preprocessor = functools.partial(labelProcessor,
                                     max_input_length = max_input_length)

TEXT = data.Field(use_vocab = False,
                  lower = True,
                  preprocessing = text_preprocessor,
                  init_token = init_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

UD_TAGS = data.Field(unk_token = None,
                     init_token = '<pad>',
                     preprocessing = tag_preprocessor)

fields = (("tokens", TEXT), ("pos_tags", UD_TAGS))

Aqui importamos o dataset `macmorpho` utilizando a biblioteca `datasets` do HuggingFace. O dataset é dividido em treino, validação e teste. O dataset de treino é utilizado para treinar o modelo, o de validação é utilizado para escolher o melhor modelo e o de teste é utilizado para avaliar o modelo final.

Depois de definidas as divisões transformo elas em tensores `PyTorch` usando os procedimentos definidos anteriormente.

In [6]:
def toPytorchDataset(dataset, train_set=None):
    dataset = [(example['tokens'], example['pos_tags']) for example in dataset]

    examples = [Example.fromlist([text, tags], fields=[('text', TEXT), ('udtags', UD_TAGS)]) for text, tags in dataset]
    dataset = Dataset(examples, fields=[('text', TEXT), ('udtags', UD_TAGS)])

    return dataset


dataset = load_dataset('mac_morpho')
train_data_raw = dataset['train']
valid_data_raw = dataset['validation']
test_data_raw = dataset['test']

train_data = toPytorchDataset(train_data_raw)
valid_data = toPytorchDataset(valid_data_raw, train_set=train_data_raw)
test_data = toPytorchDataset(test_data_raw, train_set=train_data_raw)

print(len(train_data.examples))
print(len(valid_data.examples))
print(len(test_data.examples))

Downloading builder script: 100%|██████████| 6.33k/6.33k [00:00<?, ?B/s]
Downloading metadata: 100%|██████████| 3.36k/3.36k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 6.49k/6.49k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 2.46M/2.46M [00:00<00:00, 7.46MB/s]
Generating train split: 100%|██████████| 37948/37948 [00:06<00:00, 5746.80 examples/s]
Generating test split: 100%|██████████| 9987/9987 [00:01<00:00, 6620.87 examples/s]
Generating validation split: 100%|██████████| 1997/1997 [00:00<00:00, 5972.10 examples/s]


37948
1997
9987


É necessário construir o vocabulário para as tags, para que elas possam ser indexadas durante o treinamento.

In [7]:
UD_TAGS.build_vocab(train_data)
print(UD_TAGS.vocab.stoi)

defaultdict(None, {'<pad>': 0, 14: 1, 24: 2, 19: 3, 3: 4, 15: 5, 25: 6, 9: 7, 12: 8, 23: 9, 5: 10, 21: 11, 7: 12, 8: 13, 10: 14, 11: 15, 6: 16, 16: 17, 18: 18, 22: 19, 0: 20, 13: 21, 4: 22, 17: 23, 1: 24, 2: 25, 20: 26})


Importando o modelo pré-treinado `bert-base-portuguese-cased` e adicionando a camada linear no final para classificar as tags.

In [8]:
class BERTPoSTagger(nn.Module):
    def __init__(self,
                 bert,
                 output_dim, 
                 dropout):
        
        super().__init__()
        self.bert = bert
        embedding_dim = bert.config.to_dict()['hidden_size']
        self.fc = nn.Linear(embedding_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        text = text.permute(1, 0)

        embedded = self.dropout(self.bert(text)[0])
        embedded = embedded.permute(1, 0, 2)

        predictions = self.fc(self.dropout(embedded))
        
        return predictions
    
bert = BertModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

OUTPUT_DIM = len(UD_TAGS.vocab)
DROPOUT = 0.25

model = BERTPoSTagger(bert,
                      OUTPUT_DIM, 
                      DROPOUT)

pytorch_model.bin: 100%|██████████| 438M/438M [00:09<00:00, 47.5MB/s] 


Agora defino o procedimento de treinamento da camada linear. Todo o processo será realizado na CPU.

In [10]:
def sort_key(example):
    return len(example.text)

BATCH_SIZE = 32
device = torch.device('cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = sort_key)

LEARNING_RATE = 5e-5
optimizer = optim.Adam(model.parameters(), lr = LEARNING_RATE)

TAG_PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

model = model.to(device)
criterion = criterion.to(device)

Definindo as funções de treino e avaliação do modelo.

In [11]:
def getAccuracy(preds, y, tag_pad_idx):
    max_preds = preds.argmax(dim = 1, keepdim = True)
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]]).to(device)

def train(model, iterator, optimizer, criterion, tag_pad_idx):
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        try:
            text = batch.text
            tags = batch.udtags
                    
            optimizer.zero_grad()
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            acc = getAccuracy(predictions, tags, tag_pad_idx)

            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()

        except KeyError:
            continue
            
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion, tag_pad_idx):
    epoch_loss = 0
    epoch_acc = 0
    confusion_matrix = defaultdict(lambda: {'correct': 0, 'total': 0})

    model.eval()

    with torch.no_grad():
        for batch in iterator:
            try:
                text = batch.text
                tags = batch.udtags

                predictions = model(text)

                predictions = predictions.view(-1, predictions.shape[-1])
                tags = tags.view(-1)

                loss = criterion(predictions, tags)

                acc = getAccuracy(predictions, tags, tag_pad_idx)

                epoch_loss += loss.item()
                epoch_acc += acc.item()

                # Update confusion matrix
                max_preds = predictions.argmax(dim=1)
                for pred, actual in zip(max_preds, tags):
                    pred_tag = UD_TAGS.vocab.itos[pred.item()]
                    actual_tag = UD_TAGS.vocab.itos[actual.item()]
                    confusion_matrix[actual_tag]['total'] += 1
                    if pred_tag == actual_tag:
                        confusion_matrix[actual_tag]['correct'] += 1

            except KeyError:
                continue

    # Calculate accuracy for each tag
    for tag, data in confusion_matrix.items():
        accuracy = data['correct'] / data['total']
        print(f"Accuracy for tag {tag}: {accuracy}")

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Treinando o modelo, cada época levou cerca de 22 minutos para ser concluída.

In [30]:
model_path = 'models/pos-tagging-model.pt'

N_EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), model_path)
    
    print('Epoch: %02d' % (epoch+1))
    print('\tTrain Loss: %.3f | Train Acc: %.2f%%' % (train_loss, train_acc*100))
    print('\t Val. Loss: %.3f |  Val. Acc: %.2f%%' % (valid_loss, valid_acc*100))

Carregando o modelo com a melhor acurácia no dataset de validação e testando no dataset de teste. Obtendo uma acurácia de 93.29%. 

Com base nesses resultados podemos ver que a acurácia para as tags 24, 9 e 25 chegam perto de 99%. Essas tags representam pontuação, preposição-artigo e artigo, respectivamente.

As tags 17, 20 e 4 tem a menor acurácia, certa de 70%. Essas tags representam preposição-pronome pessoal, preposição-advérbio e preposição-pronome substantivo, respectivamente.

In [39]:
model.load_state_dict(torch.load(model_path))

test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)
print('Test Loss: %.3f | Test Acc: %.2f%%' % (test_loss, test_acc*100))

Accuracy for tag <pad>: 0.0
Accuracy for tag 14: 0.9421761261014723
Accuracy for tag 3: 0.8829066265060241
Accuracy for tag 24: 0.997992863514719
Accuracy for tag 19: 0.9713865354370655
Accuracy for tag 23: 0.8883584282041865
Accuracy for tag 9: 0.9881593110871906
Accuracy for tag 12: 0.8860182370820668
Accuracy for tag 18: 0.8901098901098901
Accuracy for tag 25: 0.992368839427663
Accuracy for tag 1: 0.7142857142857143
Accuracy for tag 16: 0.8876834716017868
Accuracy for tag 21: 0.817032967032967
Accuracy for tag 8: 0.9578824217607488
Accuracy for tag 15: 0.9799141733222076
Accuracy for tag 6: 0.9429763560500696
Accuracy for tag 10: 0.8944050433412135
Accuracy for tag 7: 0.9252262888626525
Accuracy for tag 17: 0.7936507936507936
Accuracy for tag 22: 0.8682432432432432
Accuracy for tag 5: 0.9792540278084308
Accuracy for tag 20: 0.7096774193548387
Accuracy for tag 4: 0.7115384615384616
Accuracy for tag 11: 0.9658314350797267
Accuracy for tag 0: 0.9870550161812298
Accuracy for tag 13: 0.8