# Fine-tuning Pretrained Transformers for PoS Tagging

## Introduction

In the previous notebook we showed how to use a BiLSTM with pretrained GloVe embeddings for PoS tagging. In this notebook we'll be using a pretrained [Transformer](https://arxiv.org/abs/1706.03762) model, specifically the pre-trained [BERT](https://arxiv.org/abs/1810.04805) model. Our model will be composed of the Transformer and a simple linear layer.


First, let's import the necessary Python modules.
In google colab you will need to install the `transformers` library.

```
!pip install transformers
```

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 4.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 44.7MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 42.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=75d9da5ea010b4cdba4

### Portuguese Corpus from Linguateca

* The corpus: https://www.linguateca.pt/CETENFolha/
* We'll download the corpus and extract only the sentences with their Part-Of-Speech Tag


In [None]:
import numpy as np
import pandas as pd
import io
import re
import gc

from sklearn.model_selection import train_test_split

In [None]:
!wget https://www.linguateca.pt/cetenfolha/download/CETENFolha-1.0_jan2014.cg.gz
!gunzip CETENFolha-1.0_jan2014.cg.gz

--2020-12-25 13:48:46--  https://www.linguateca.pt/cetenfolha/download/CETENFolha-1.0_jan2014.cg.gz
Resolving www.linguateca.pt (www.linguateca.pt)... 193.137.199.21, 2001:690:a00:3038:218::21
Connecting to www.linguateca.pt (www.linguateca.pt)|193.137.199.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 264395706 (252M) [application/x-gzip]
Saving to: ‘CETENFolha-1.0_jan2014.cg.gz’


2020-12-25 13:49:15 (9.47 MB/s) - ‘CETENFolha-1.0_jan2014.cg.gz’ saved [264395706/264395706]



* Read the corpus

In [None]:
file = io.open("CETENFolha-1.0_jan2014.cg", mode="r", encoding="utf-8")
file.seek(0)
corpus = file.read()

* Preprocess, extract sentences and clean the dataset

In [None]:
def clean_and_split(corpus):
    aux = re.sub('<.*?>', '', corpus)
    aux = re.sub('<.*?>', '', aux)
    aux = re.sub('\[.*?\]', '', aux)
    aux = re.sub('  +', ' ', aux)
    aux = re.sub('\n\n+','\n\n', aux)
    return aux.split("\n\n")
    
sentences = clean_and_split(corpus)
del corpus

PUNCT_TAG = "__PUNCT__"
phrases = [[
        #(l.split("\t")[0].lower().strip() if len(l.split("\t")) > 1 else l.split(" ")[0],
        (l.split("\t")[0].strip() if len(l.split("\t")) > 1 else l.split(" ")[0],
        l.split("\t")[1].split(" ")[1] if len(l.split("\t")) > 1 else PUNCT_TAG)
        for l in s.split("\n")] 
        for s in list(filter(None, sentences))]
del sentences

#Remove frases vazias 
phrases = list(filter(lambda p: (len(p) != 1 or p[0][0] != ''), phrases))

### Remove qualquer sentença que possua uma etiqueta que não seja uma dessas tags
#allowed_tags = ['N','DET','PRP','V','PROP','$,','ADJ','$.','ADV','NUM','KC','$"','SPEC','PERS','KS','$)','$(','$:','$--','$?','$;',"$'",'$!','IN','$...','EC','$pause','$`','$+','$=','PRON','$$','$]','$[','PP','$±','$~','GER','PU','$|','M','$_']
allowed_tags = ['N','DET','PRP','V','PROP','ADJ','ADV','NUM','KC','SPEC','PERS','KS','IN','EC','PRON',PUNCT_TAG]
phrases = (list(filter(lambda s: all(tk[1] in allowed_tags for tk in s), phrases)))

print('Amount of sentences: ', len(phrases))

Amount of sentences:  1692979


In [None]:
phrases[102]

[('Os', 'DET'),
 ('Papéis=De=FLAVIO-SHIRÓ', 'PROP'),
 ('$--', '__PUNCT__'),
 ('A', 'PERS'),
 ('mostra', 'V'),
 ('$,', '__PUNCT__'),
 ('que', 'KS'),
 ('faz', 'V'),
 ('parte', 'N'),
 ('de', 'PRP'),
 ('as', 'DET'),
 ('comemorações', 'N'),
 ('de', 'PRP'),
 ('os', 'DET'),
 ('50', 'NUM'),
 ('anos', 'N'),
 ('de', 'PRP'),
 ('pintura', 'N'),
 ('de', 'PRP'),
 ('o', 'DET'),
 ('artista', 'N'),
 ('japonês', 'ADJ'),
 ('$,', '__PUNCT__'),
 ('reúne', 'V'),
 ('25', 'NUM'),
 ('obras', 'N'),
 ('em', 'PRP'),
 ('pequenos', 'ADJ'),
 ('formatos', 'N'),
 ('$.', '__PUNCT__')]

* Split the dataset into Train, Validation and Test

In [None]:
phrases_train , phrases_test = train_test_split(phrases,test_size=0.2, random_state=42)
phrases_train , phrases_valid = train_test_split(phrases_train,test_size=0.2, random_state=42)

* Save the dataset again as separate files into the expected format for TorchText SequenceTaggingDataset

In [None]:
import random
random.seed(42)

In [None]:
sep_word="\t"
sep_sentence="\n"

f = open("train.txt","w")
for p in phrases_train: 
  for w in p:
    f.write(w[0]+sep_word+w[1]+sep_sentence)
  f.write(sep_sentence)
f.close()


f = open("valid.txt","w")
for p in phrases_valid: 
  for w in p:
    f.write(w[0]+sep_word+w[1]+sep_sentence)
  f.write(sep_sentence)
f.close()

f = open("test.txt","w")
for p in phrases_test: 
  for w in p:
    f.write(w[0]+sep_word+w[1]+sep_sentence)
  f.write(sep_sentence)
f.close()

### Preparing the corpus as a TorchText Dataset





In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchtext import data
from torchtext import datasets

from transformers import BertTokenizer, BertModel

import numpy as np

import time
import random
import functools

Next, we'll set the random seeds for reproducability.

In [None]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Then, we'll import the BERT tokenizer. This defines how text into the model should be processed, but more importantly contains the vocabulary that the BERT model was pretrained with. We'll be using the `bert-base-uncased` tokenizer and model. This was trained on text that has been lowercased.

In order to use pretrained models for NLP the vocabulary used needs to exactly match that of the pretrained model.

In [None]:
### Here I'll be using a different pre-trained model: https://github.com/neuralmind-ai/portuguese-bert
tokenizer = BertTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=209528.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=43.0, style=ProgressStyle(description_w…




Another thing that we need to do is make sure the input sequence is formatted in the same way in which the BERT model was trained. 

BERT was trained on sequences that begin with a `[CLS]` token.

So the sequence of tokens

```python
text = ['jack', 'went', 'to', 'the', 'shop']
```

should become:

```python
text = ['[CLS]', 'jack', 'went', 'to', 'the', 'shop']
```

Along with making our vocabularies match we also need to make sure our padding and unk tokens match those used in the pretrained model. By default TorchText uses `<pad>` and `<unk>`, but the BERT model uses `[PAD]` and `[UNK]`.

Let's get the special tokens:

In [None]:
init_token = tokenizer.cls_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, pad_token, unk_token)

[CLS] [PAD] [UNK]


We are mainly interested in the actual integer representations of the special tokens. This is because we aren't using TorchText's vocabulary module, but using the one provided by the pretrained model. 

We get the indexes of the special tokens by passing them through the tokenizer's `convert_tokens_to_ids` function.

In [None]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, pad_token_idx, unk_token_idx)

101 0 100


One other thing is that the pretrained model was trained on sequences up to a maximum length and we need to ensure that our sequences are also trimmed to this length.

In [None]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-cased']

print(max_input_length)

512


Next, we'll define two helper functions that make use of our vocabulary.

The first will cut the sequence of tokens to the desired maximum length, specified by our pretrained model, and then convert the tokens into indexes by passing them through the vocabulary. This is what we will use on our input sequence we want to tag.

Note that we actually cut tokens to `max_input_length-1`, this is because we need to add the special `[CLS]` token to the start of the sequence.

In [None]:
def cut_and_convert_to_id(tokens, tokenizer, max_input_length):
    tokens = tokens[:max_input_length-1]
    tokens = tokenizer.convert_tokens_to_ids(tokens)
    return tokens

The second helper function simply cuts the sequence to the maximum length. This is used for our tags. We do not pass the tags through pretrained model's vocabulary as the vocab was only built for English sentences, and not for part-of-speech tags. We will be building the tag vocabulary ourselves.

In [None]:
def cut_to_max_length(tokens, max_input_length):
    tokens = tokens[:max_input_length-1]
    return tokens

We need to pass the above two functions to the `Field`, the TorchText abstraction that handles a lot of the data processing for us. We make use of Python's `functools` that allow us to pass functions which already have some of their arguments supplied. 

In [None]:
text_preprocessor = functools.partial(cut_and_convert_to_id,
                                      tokenizer = tokenizer,
                                      max_input_length = max_input_length)

tag_preprocessor = functools.partial(cut_to_max_length,
                                     max_input_length = max_input_length)

Next, we define our fields.

For the `TEXT` field, which will be processing the sequences we want to tag, we first tell TorchText that we do not want to use a vocabulary with `use_vocab = False`. As our model is `uncased`, we also want to ensure all text is lowercased with `lower=True`. The `preprocessing` argument is a function applied to sequences after they have been tokenized, but before they are numericalized. As we have set `use_vocab` to false, they will never actually be numericalized, and as we are using TorchText's POS datasets they have also already been tokenized - so the argument to this will just be applied to the sequence of tokens. This is where our help functions from above come in handy and `text_preprocessor` will both numericalize our data using the pretrained model's vocabulary, as well as cutting it to the maximum length. The remaining four arguments define the special tokens required by the pretrained model.

For the `UD_TAGS` field, we need to ensure the length of our tags matches the length of our text sequence. As we have added a `[CLS]` token to the beginning of the text sequence, we need to do the same with the sequence of tags. We do this by adding a `<pad>` token to the beginning which we will later tell our model to not use when calculating losses or accuracy. We won't have unknown tags in our sequence of tags, so we set the `unk_token` to `None`. Finally, we pass our `tag_preprocessor` defined above, which simply cuts the tags to the maximum length our pretrained model can handle.

In [None]:
TEXT = data.Field(use_vocab = False,
                  lower = False,
                  preprocessing = text_preprocessor,
                  init_token = init_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

UD_TAGS = data.Field(unk_token = None,
                     init_token = '<pad>',
                     preprocessing = tag_preprocessor)

Then, we define which of our fields defined above correspond to which fields in the dataset.

In [None]:
fields = (("text", TEXT), ("udtags", UD_TAGS))

Next, we load the data using our fields.

In [None]:
train_data, valid_data, test_data = datasets.UDPOS.splits(
    fields=fields,
    path="./",
    train="train.txt",
    validation="valid.txt",
    test="test.txt", separator='\t')

In [None]:
print(len(train_data))
print(len(valid_data))
print(len(test_data))

1083506
270877
338596


In [None]:
train_data.examples[0]

<torchtext.data.example.Example at 0x7f7886772d68>

We can check an example by printing it. As we have already numericalized our `text` using the vocabulary of the pretrained model, it is already a sequence of integers. The tags have yet to be numericalized. 

In [None]:
idx=0
print(vars(train_data.examples[idx])['text'])
print(vars(train_data.examples[idx])['udtags'])

[100, 17399, 100]
['PROP', 'ADJ', '__PUNCT__']


Our next step is to build the tag vocabulary so they can be numericalized during training. We do this by using the field's `.build_vocab` method on the `train_data`.

In [None]:
UD_TAGS.build_vocab(train_data)

print(UD_TAGS.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f7a06206e18>, {'<pad>': 0, 'N': 1, 'DET': 2, 'PRP': 3, '__PUNCT__': 4, 'V': 5, 'PROP': 6, 'ADJ': 7, 'ADV': 8, 'NUM': 9, 'KC': 10, 'SPEC': 11, 'PERS': 12, 'KS': 13, 'IN': 14, 'EC': 15, 'PRON': 16})


Next, we'll define our iterators. This will define how batches of data are provided when training. We set a batch size and define `device`, which will automatically put our batch on to the GPU, if we have one.

The BERT model is quite large, so the batch size here is usually smaller than usual. However, the BERT paper itself mentions how they also fine-tuned using small batch sizes, so this shouldn't cause too much of an issue.

In [None]:
BATCH_SIZE = 32

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## Building the Model

Next up is defining our model. The model is relatively simple, with all of the complicated parts contained inside the BERT module which we do not have to worry about. We can think of the BERT as an embedding layer and all we do is add a linear layer on top of these embeddings to predict the tag for each token in the input sequence. 

![](https://github.com/bentrevett/pytorch-pos-tagging/blob/master/assets/pos-bert.png?raw=1)

Previously the yellow squares were the embeddings provided by the embedding layer, but now they are embeddings provided by the pretrained BERT model. All inputs are passed to BERT at the same time. The arrows between the BERT embeddings indicate how BERT does not calculate embeddings for each tokens individually, but the embeddings are actually based off the other tokens within the sequence. We say the embeddings are *contextualized*.

One thing to note is that we do not define an `embedding_dim` for our model, it is the size of the output of the pretrained BERT model and we cannot change it. Thus, we simply get the `embedding_dim` from the model's `hidden_size` attribute.

BERT also wants sequences with the batch element first, hence we permute our input sequence before passing it to BERT.

In [None]:
class BERTPoSTagger(nn.Module):
    def __init__(self,
                 bert,
                 output_dim, 
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
  
        #text = [sent len, batch size]
    
        text = text.permute(1, 0)
        
        #text = [batch size, sent len]
        
        embedded = self.dropout(self.bert(text)[0])
        
        #embedded = [batch size, seq len, emb dim]
                
        embedded = embedded.permute(1, 0, 2)
                    
        #embedded = [sent len, batch size, emb dim]
        
        predictions = self.fc(self.dropout(embedded))
        
        #predictions = [sent len, batch size, output dim]
        
        return predictions

Next, we load the actual pretrained BERT uncased model - before we only loaded the tokenizer associated with the model.

The first time we run this it will have to download the pretrained parameters.

In [None]:
### Here I'll be using a different pre-trained model: https://github.com/neuralmind-ai/portuguese-bert
bert = BertModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=647.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438235074.0, style=ProgressStyle(descri…




## Training the Model

We finally get to instantiate our model - a simple linear model using BERT model to get word embeddings.

Best of all, the only hyperparameter is dropout! This value has been chosen as it's a sensibile value, so there may be a better value of dropout available.

In [None]:
OUTPUT_DIM = len(UD_TAGS.vocab)
DROPOUT = 0.25

model = BERTPoSTagger(bert,
                      OUTPUT_DIM, 
                      DROPOUT)

We can then count the number of trainable parameters. This includes the linear layer and all of the BERT parameters.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 108,936,209 trainable parameters


Next, we define our optimizer. Usually when fine-tuning you want to use a lower learning rate than normal, this is because we don't want to drastically change the parameters as it may cause our model to forget what it has learned. This phenomenon is called catastrophic forgetting.

We pick 5e-5 (0.00005) as it is one of the three values recommended in the BERT paper. Again, there may be better values for this dataset.

In [None]:
LEARNING_RATE = 5e-5

optimizer = optim.Adam(model.parameters(), lr = LEARNING_RATE)

The rest of the notebook is pretty similar to before.

We define a loss function, making sure to ignore losses whenever the target tag is a padding token.

In [None]:
TAG_PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

Then, we place the model on to the GPU, if we have one.

In [None]:
model = model.to(device)
criterion = criterion.to(device)

Like in the previous tutorial, we define a function which calculates our accuracy of predicting tags, ignoring predictions over padding tokens.

In [None]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]]).to(device)

We then define our `train` and `evaluate` functions to train and test our model. 

In [None]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        text = batch.text
        tags = batch.udtags
                
        optimizer.zero_grad()
        
        #text = [sent len, batch size]
        
        predictions = model(text)
        
        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]
        
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)
        
        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]
        
        loss = criterion(predictions, tags)
                
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            tags = batch.udtags
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Then, we define a helper function used to see how long an epoch takes.

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we can train our model!

This model takes a considerable amount of time per epoch compared to the last model as the number of parameters is significantly higher. However, we beat the performance of our last model after only 2 epochs which takes around 2 minutes.

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	nonzero()
Consider using one of the following signatures instead:
	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  


Epoch: 01 | Epoch Time: 103m 35s
	Train Loss: 0.194 | Train Acc: 93.67%
	 Val. Loss: 0.260 |  Val. Acc: 91.65%
Epoch: 02 | Epoch Time: 103m 41s
	Train Loss: 0.163 | Train Acc: 94.63%
	 Val. Loss: 0.260 |  Val. Acc: 91.63%
Epoch: 03 | Epoch Time: 103m 59s
	Train Loss: 0.148 | Train Acc: 95.08%
	 Val. Loss: 0.256 |  Val. Acc: 91.83%
Epoch: 04 | Epoch Time: 101m 49s
	Train Loss: 0.138 | Train Acc: 95.39%
	 Val. Loss: 0.262 |  Val. Acc: 91.69%
Epoch: 05 | Epoch Time: 101m 42s
	Train Loss: 0.130 | Train Acc: 95.65%
	 Val. Loss: 0.261 |  Val. Acc: 91.59%


We can then load our "best" performing model and try it out on the test set. 

We beat our previous model by 2%!

In [None]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.257 | Test Acc: 91.79%


## Inference

We'll now see how to use our model to tag actual sentences. This is similar to the inference function from the previous notebook with the tokenization changed to match the format of our pretrained model.

If we pass in a string, this means we need to split it into individual tokens which we do by using the `tokenize` function of the `tokenizer`. Afterwards, numericalize our tokens the same way we did before, using `convert_tokens_to_ids`. Then, we add the `[CLS]` token index to the beginning of the sequence. 

**Note**: if we forget to add the `[CLS]` token our results will not be good!

We then pass the text sequence through our model to get a prediction for each token and then slice off the predictions for the `[CLS]` token as we do not care about it.

In [None]:
def tag_sentence(model, device, sentence, tokenizer, text_field, tag_field):
    
    model.eval()
    
    if isinstance(sentence, str):
        tokens = tokenizer.tokenize(sentence)
    else:
        tokens = sentence
    
    numericalized_tokens = tokenizer.convert_tokens_to_ids(tokens)
    numericalized_tokens = [text_field.init_token] + numericalized_tokens
        
    unk_idx = text_field.unk_token
    
    unks = [t for t, n in zip(tokens, numericalized_tokens) if n == unk_idx]
    
    token_tensor = torch.LongTensor(numericalized_tokens)
    
    token_tensor = token_tensor.unsqueeze(-1).to(device)
         
    predictions = model(token_tensor)
    
    top_predictions = predictions.argmax(-1)
    
    predicted_tags = [tag_field.vocab.itos[t.item()] for t in top_predictions]
    
    predicted_tags = predicted_tags[1:]
        
    assert len(tokens) == len(predicted_tags)
    
    return tokens, predicted_tags, unks

We can then run an example sentence through our model and receive the predicted tags.

In [None]:
sentence = ' '.join([p[0] for p in phrases[0]])

tokens, tags, unks = tag_sentence(model, 
                                  device, 
                                  sentence,
                                  tokenizer,
                                  TEXT, 
                                  UD_TAGS)

print(unks)

[]


We can then print out the tokens and their corresponding tags.

Notice how "1pm" in the input sequence has been converted to the two tokens "1" and "##pm". What's with the two hash symbols in front of the "pm"? This is due to the way the tokenizer tokenizes sentences. It uses something called [byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) to split words up into more common subsequences of characters.

In [None]:
print(tokens)

['PT', 'em', 'o', 'governo']


In [None]:
print("Pred. Tag\tToken\n")

for token, tag in zip(tokens, tags):
    print(f"{tag}\t\t{token}")

Pred. Tag	Token

PROP		PT
PRP		em
DET		o
N		governo


In [None]:
phrases[0]

[('PT', 'PROP'), ('em', 'PRP'), ('o', 'DET'), ('governo', 'N')]

We've now fine-tuned a BERT model for part-of-speech tagging! Well done us!

## K-Fold Cross Validation in CetenFolha Corpus

* Write test samples to files

In [None]:
from sklearn.model_selection import KFold

In [None]:
k=500000

kf = KFold(n_splits=5, random_state=42, shuffle=True)

phrases_data = np.array(phrases)

sep_word="\t"
sep_sentence="\n"
round = 0
for train_index, test_index in kf.split(phrases_data):
    round=round+1
    print('CV Round '+str(round))
    print('Train test split: '+ str(len(train_index))+','+str(len(test_index)))
    f = open("test-kfold-round-"+str(round)+".txt","w")
    for p in phrases_data[test_index]: 
      for w in p:
        f.write(w[0]+sep_word+w[1]+sep_sentence)
      f.write(sep_sentence)
    f.close()

  """


CV Round 1
Train test split: 1354383,338596
CV Round 2
Train test split: 1354383,338596
CV Round 3
Train test split: 1354383,338596
CV Round 4
Train test split: 1354383,338596
CV Round 5
Train test split: 1354384,338595


In [None]:
fold1, fold2, fold3= datasets.UDPOS.splits(
    fields=fields,
    path="./",
    train="test-kfold-round-1.txt",
    validation="test-kfold-round-2.txt",
    test="test-kfold-round-3.txt", separator='\t')

fold4, fold5= datasets.UDPOS.splits(
    fields=fields,
    path="./",
    train="test-kfold-round-4.txt",
    validation="test-kfold-round-5.txt",
    test=None,
    separator='\t')

print(len(fold1))
print(len(fold2))
print(len(fold3))
print(len(fold4))
print(len(fold5))


BATCH_SIZE = 32

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

fold1_iterator, fold2_iterator, fold3_iterator, fold4_iterator, fold5_iterator = data.BucketIterator.splits(
    (fold1, fold2, fold3, fold4, fold5), 
    batch_size = BATCH_SIZE,
    device = device)

338596
338596
338596
338596
338595


In [None]:
%%time
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, fold1_iterator, criterion, TAG_PAD_IDX)
print(f'Fold 1: '+ f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

test_loss, test_acc = evaluate(model, fold2_iterator, criterion, TAG_PAD_IDX)
print(f'Fold 2: '+ f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

test_loss, test_acc = evaluate(model, fold3_iterator, criterion, TAG_PAD_IDX)
print(f'Fold 3: '+ f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

test_loss, test_acc = evaluate(model, fold4_iterator, criterion, TAG_PAD_IDX)
print(f'Fold 4: '+ f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

test_loss, test_acc = evaluate(model, fold5_iterator, criterion, TAG_PAD_IDX)
print(f'Fold 5: '+ f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Fold 1: Test Loss: 0.167 | Test Acc: 94.61%
Fold 2: Test Loss: 0.224 | Test Acc: 92.76%
Fold 3: Test Loss: 0.223 | Test Acc: 92.77%
Fold 4: Test Loss: 0.223 | Test Acc: 92.78%
Fold 5: Test Loss: 0.225 | Test Acc: 92.72%
CPU times: user 21min 36s, sys: 3min 56s, total: 25min 32s
Wall time: 25min 32s


# Nova seção