# Processamento de Linguagem Natural - Minicurso do SBBD 2022

# Etiquetagem morfossintática

Esse código foi desenvolvido para o minicurso de PLN no SBBD 2022.

Autoras: Helena Caseli, Cláudia Freitas e Roberta Viola

https://sites.google.com/view/brasileiras-pln/

Fontes:
* Curso de Linguística Computacional da UFMG - Prof. Thiago Castro Ferreira https://www.youtube.com/playlist?list=PLLrlHSmC0Mw73a1t73DEjgGMPyu8QssWT => Esse código está baseado na aula 7.7
* https://huggingface.co/
* https://github.com/neuralmind-ai/portuguese-bert
* https://pytorch.org/

Esse código:
* Utiliza um modelo neural pré-treinado (BERTimbau), baseado na arquitetura Transformer, com fine-tuning para categorização de texto.
* Categorização de um texto com suas categorias morfossintáticas (part-of-speech tagging).

Dataset/corpus:
* Mac-Morpho: http://nilc.icmc.usp.br/macmorpho/

**IMPORTANTE:** Setar a GPU do Colab.

## Instalando as dependências

In [None]:
!pip3 install transformers

## Baixando o corpus

Para esses experimentos será utilizado o Mac-Morpho: http://nilc.icmc.usp.br/macmorpho/

In [None]:
!wget http://nilc.icmc.usp.br/macmorpho/macmorpho-v3.tgz
!tar -xvf macmorpho-v3.tgz

## Carregando o corpus

Carregando o corpus já nas suas divisões oficiais de treinamento (train) e validação (dev).

In [None]:
with open('macmorpho-train.txt') as f:
  traindata = [[tuple(w.split('_'))for w in snt.split()] for snt in f.read().split('\n')]

with open('macmorpho-dev.txt') as f:
  devdata = [[tuple(w.split('_'))for w in snt.split()] for snt in f.read().split('\n')]

In [None]:
def parse(data):
  X = [' '.join([w[0] for w in snt]) for snt in data]
  y = [[w[1] for w in snt] for snt in data]

  tags = []
  for snt in y:
    tags.extend(snt)
  tags = list(set(tags))
  tags.append('<pad>')
  tag2id = { tag:i for i, tag in enumerate(tags) }
  id2tag = { i:tag for i, tag in enumerate(tags) }
  return X, y, tag2id, id2tag

train_X, train_y, tag2id, id2tag = parse(traindata)
dev_X, dev_y, _, _ = parse(devdata)

Dando uma olhada no conteúdo do *corpus*.

In [None]:
print(dev_X[0])
print(dev_y[0])

Ainda em dezembro de 1990 , foi editada a famosa 289 , que modificava a sistemática da arrecadação do ITR e alterava suas alíquotas .
['ADV', 'PREP', 'N', 'PREP', 'N', 'PU', 'V', 'PCP', 'ART', 'ADJ', 'N', 'PU', 'PRO-KS', 'V', 'ART', 'N', 'PREP+ART', 'N', 'PREP+ART', 'NPROP', 'KC', 'V', 'PROADJ', 'N', 'PU']


## Ajuste de tokenização

Como o tokenizador do BERTimbau pode segmentar o texto em sub-tokens, precisamos alinhá-los as tags que marcam as classes gramaticais.

In [None]:
# Alinhando etiquetas a sub-palavras
from transformers import AutoTokenizer

def align(X, y):
  tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased', do_lower_case=False)
  
  procdata = []
  for (X_, y_) in zip(X, y):
    inputs = tokenizer(X_, return_tensors="pt")
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    try:
      new_tags = ['<pad>']
      pos = 0
      for token in tokens[1:-1]:
        if '##' in token:
          new_tags.append(y_[pos-1])
        else:
          new_tags.append(y_[pos])
          pos += 1
      new_tags.append('<pad>')

      procdata.append({ 'X': X_, 'y': ' '.join(new_tags) })
    except:
      pass
  return procdata

trainset = align(train_X, train_y)
devset = align(dev_X, dev_y)

len(trainset), len(devset)

Downloading tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

(29418, 1573)

Importando dependências.

Para mais informações sobre essas dependências/bibliotecas, consulte: https://github.com/neuralmind-ai/portuguese-bert

In [None]:
import os
import torch
import torch.nn as nn
from torch import optim
from transformers import AutoTokenizer, AutoModelForTokenClassification
from sklearn.metrics import accuracy_score, f1_score, classification_report  

Setando os parâmetros da rede neural.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
nclasses = len(tag2id)
nepochs = 3
batch_size = 16
batch_status = 32
learning_rate = 1e-5 # usar uma bem baixa para o caso dos modelos pré-treinados

early_stop = 2
max_length = 180
write_path = 'model'

Separando os dados de treinamento e teste em lotes.

In [None]:
from torch.utils.data import DataLoader, Dataset

traindata = DataLoader(trainset, batch_size=batch_size, shuffle=True)
devdata = DataLoader(devset, batch_size=batch_size, shuffle=True)

Inicializando tokenizador, modelo, função de erro e otimizador.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased', do_lower_case=False)
model = AutoModelForTokenClassification.from_pretrained('neuralmind/bert-base-portuguese-cased', num_labels=nclasses).to(device)

optimizer = optim.AdamW(model.parameters(), lr=learning_rate)

Downloading pytorch_model.bin:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at neuralmind/bert-base-portuguese-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model check

Definindo o método de avaliação.

In [None]:
def evaluate(model, testdata):
  model.eval()
  y_real, y_pred = [], []
  for batch_idx, inp in enumerate(testdata):
    texts = inp['X']
    
    labels = []
    for tags in inp['y']:
      tag_idxs = [tag2id[tag] for tag in tags.split()]
      labels.append(tag_idxs)

    # classificando
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
    output = model(**inputs)
                
    pred_labels = torch.argmax(output.logits, 2).tolist()

    for i in range(len(labels)):
      y_real.extend(labels[i][1:-1])
      seq_size = len(labels[i][1:-1])
      y_pred.extend(pred_labels[i][1:seq_size+1])
    
    if (batch_idx+1) % batch_status == 0:
      print('Progresso:', round(batch_idx / len(testdata), 2), batch_idx)
  
  print(classification_report(y_real, y_pred))
  f1 = f1_score(y_real, y_pred, average='weighted')
  acc = accuracy_score(y_real, y_pred)
  return f1, acc

## Treinamento

**IMPORTANTE:** Setar a GPU do Colab.

In [None]:
from torch.nn.utils.rnn import pad_sequence

max_f1, repeat = 0, 0
for epoch in range(nepochs):
  model.train()
  losses = []
  for batch_idx, inp in enumerate(traindata):
    texts = inp['X']
    
    labels = []
    for tags in inp['y']:
      tag_idxs = [tag2id[tag] for tag in tags.split()]
      labels.append(torch.tensor(tag_idxs[:max_length]))
    
    # faz o padding para que todas as sequencias desse lote tenham o mesmo tamanho
    labels= pad_sequence(labels, padding_value=tag2id['<pad>']).transpose(0, 1).unsqueeze(0).contiguous()

    # classificando
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
    output = model(**inputs, labels=labels.to(device))

    # Calculando a loss (erro)
    loss = output.loss
    losses.append(float(loss))

    # Backpropagation com base no erro
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # Imprimindo o andamento
    if (batch_idx+1) % batch_status == 0:
      print('Epoca de treinamento: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tTotal Loss: {:.6f}'.format(epoch, \
        batch_idx+1, len(traindata), 100. * batch_idx / len(traindata), 
        float(loss), round(sum(losses) / len(losses), 5)))
  
  f1, acc = evaluate(model, devdata)
  print('F1: ', f1, 'Accuracy: ', acc)
  if f1 > max_f1:
    model.save_pretrained(os.path.join(write_path, 'model'))
    max_f1 = f1
    repeat = 0
    print('Salvando o melhor modelo ...')
  else:
    repeat += 1
  
  if repeat == early_stop:
    break

Progress: 0.31 31
Progress: 0.64 63
Progress: 0.96 95


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.98      0.99      0.98      2771
           1       0.96      0.97      0.97      1934
           2       0.94      0.92      0.93       915
           3       0.67      0.64      0.65        25
           4       0.99      1.00      1.00      3938
           5       0.98      0.98      0.98       489
           6       0.97      1.00      0.98       295
           8       0.98      0.98      0.98      7793
           9       0.00      0.00      0.00         6
          10       1.00      1.00      1.00      4081
          11       0.90      0.92      0.91       339
          12       0.96      0.95      0.95        56
          13       0.98      0.97      0.98       945
          14       0.94      0.90      0.92       193
          15       0.97      0.96      0.97       363
          16       1.00      0.52      0.69        23
          17       0.99      0.99      0.99      1811
          18       0.00    

## Predição

Testando o modelo para a sentença de exemplo do Capítulo que acompanha o minicurso.

In [None]:
#inputs = tokenizer("O menino foi para a escola de ônibus", return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
inputs = tokenizer("Com estas palavras, André Coruja, além de quebrar o gelo que havia esfriado o clima, devolveu ao recinto a eloquência necessária para que a sessão continuasse.", return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(device)
output = model(**inputs)
                
pred_labels = torch.argmax(output.logits, 2).tolist()

pred_labels

labels = []
for ids in pred_labels:
  tag_idxs = [id2tag[id] for id in ids]  
  labels.append(tag_idxs)

#print(inputs)
print(labels)

{'input_ids': tensor([[  101,   761,  3769,  3724,   117,  5845,  1553, 22288,   524,   117,
          1166,   125, 14195,   146,  8096,   179,  1021, 15518, 14996,   146,
          4885,   117,  9865,  1352,   320, 21355,   123,  4129,  2768,  7528,
           221,   179,   123,  8729,  4390, 22281,   236,   119,   102]],
       device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
[['<pad>', 'PREP', 'PROADJ', 'N', 'PU', 'NPROP', 'NPROP', 'NPROP', 'NPROP', 'PU', 'PREP', 'PREP', 'V', 'ART', 'N', 'PRO-KS', 'V', 'PCP', 'PCP', 'ART', 'N', 'PU', 'V', 'V', 'PREP+ART', 'N', 'ART', 'N', 'N', 'ADJ', 'PREP', 'KS', 'ART', 'N', 'V', 'V', 'V', 'PU', '<pad>']]


Fim deste exemplo.