# What is POS Tagging?

In [1]:
!pip install transformers



Part of Speech (POS) Tagging is a classification task that involves automatically assigning descriptions to tokens. The descriptor, called a tag, represents the part-of-speech of the word it is assigned to.

In this tutorial, you will learn how to fine-tune your own POS tagger with BERT. We will use the Universal Dependencies English Web Treebank (UDPOS) dataset. You can check more information for UDPOS here: https://pytorch.org/text/stable/_modules/torchtext/datasets/udpos.html

## What you will need 

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy import data
from torchtext.legacy import datasets

import spacy
import numpy as np

import time
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from transformers import BertTokenizer, BertModel
import functools

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

However, there are some steps you can take to limit the number of sources of nondeterministic behavior for a specific platform, device, and PyTorch release. First, you can control sources of randomness that can cause multiple executions of your application to behave differently. Second, you can configure PyTorch to avoid using nondeterministic algorithms for some operations, so that multiple calls to those operations, given the same inputs, will produce the same result.

We will use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA) and python's random seed. We will also use a deterministic implementation:

In [3]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Next, we load the pretrained BERT uncased model. The first time we run this it will have to download the pretrained parameters. There are different BERT models, but we use bert-base-cased. For more information on the different models, check here: https://github.com/google-research/bert

We will also use the tokenizer for the bert-base-cased model. Other tokenizers are also available on https://github.com/google-research/bert

In [4]:
bert = BertModel.from_pretrained('bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

We need to format our input sequence with CLS token to make it identical with the format the BERT model was trained. 

text = ['jack', 'went', 'to', 'the', 'shop']
should become:

text = ['[CLS]', 'jack', 'went', 'to', 'the', 'shop']

We also add [PAD] and [UNK] tokens

In [6]:
init_token = tokenizer.cls_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, pad_token, unk_token)

[CLS] [PAD] [UNK]


We are mainly interested in the actual integer representations of the special tokens. 
This is because we aren't using TorchText's vocabulary module, but using the one provided by the pretrained model.

We get the indexes of the special tokens by passing them through the tokenizer's convert_tokens_to_ids function.

In [7]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, pad_token_idx, unk_token_idx)

101 0 100


We check the maximum length of our bert model

In [8]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-cased']

print(max_input_length)

512


In [9]:
def cut_and_convert_to_id(tokens, tokenizer, max_input_length):
    tokens = tokens[:max_input_length-1]
    tokens = tokenizer.convert_tokens_to_ids(tokens)
    return tokens

In [10]:
def cut_to_max_length(tokens, max_input_length):
    tokens = tokens[:max_input_length-1]
    return tokens

In [11]:
text_preprocessor = functools.partial(cut_and_convert_to_id,
                                      tokenizer = tokenizer,
                                      max_input_length = max_input_length)

tag_preprocessor = functools.partial(cut_to_max_length,
                                     max_input_length = max_input_length)

TorchText Field how your dataset is processed. The TEXT field handles how the text that we need to tag is processed. We set lower = True to lowercase all of the text.

Next we will define the Fields for the tags. UDPOS dataset has two different sets of tags namely: the universal dependency (UD) tags and Penn Treebank (PTB) tags. We will train our model with the PTB tags.

TorchText Fields initialize a default unknown token <unk> which we remove by setting unk_token = None. We set unk_token = None because we do not want unk_token tags in our tag set when the model encounters words without tags. We want every word tagged only with the tags in the PTB tags.
    
You can find more information about field her: https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field

In [12]:

TEXT = data.Field(use_vocab = False,
                  lower = False,
                  preprocessing = text_preprocessor,
                  init_token = init_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

UD_TAGS = data.Field(unk_token = None,
                     init_token = '<pad>',
                     preprocessing = tag_preprocessor)


PTB_TAGS = data.Field(unk_token = None,
                     init_token = '<pad>',
                     preprocessing = tag_preprocessor)

We define fields for both the UD_TAGS and PTB_TAGS which passes the fields to the dataset. To define one of the tags alone, we can tell torchtext not to load those items using "NONE" as in

fields = (("text", TEXT), ("udtags", UD_TAGS), (None, None))

In [13]:
fields = (("text", TEXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))

We will then use the fields

In [14]:
train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

downloading en-ud-v2.zip


100%|███████████████████████████████████████████████████████████████████████████████| 688k/688k [00:00<00:00, 1.37MB/s]


extracting


We check the size of our train, valid and test sets. 

In [15]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


We can also print different examples and specify the text or the tags

In [16]:
print(vars(train_data.examples[0]))

{'text': [2586, 118, 100, 131, 1237, 2088, 1841, 100, 14677, 2393, 118, 100, 117, 1103, 18154, 1120, 1103, 11666, 1107, 1103, 1411, 1104, 100, 117, 1485, 1103, 8697, 3070, 119], 'udtags': ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT'], 'ptbtags': ['NNP', 'HYPH', 'NNP', ':', 'JJ', 'NNS', 'VBD', 'NNP', 'NNP', 'NNP', 'HYPH', 'NNP', ',', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNP', ',', 'IN', 'DT', 'JJ', 'NN', '.']}


In [17]:
print(vars(train_data.examples[0])['text'])

[2586, 118, 100, 131, 1237, 2088, 1841, 100, 14677, 2393, 118, 100, 117, 1103, 18154, 1120, 1103, 11666, 1107, 1103, 1411, 1104, 100, 117, 1485, 1103, 8697, 3070, 119]


In [18]:
print(vars(train_data.examples[0])['udtags'])

['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']


In [19]:
print(vars(train_data.examples[0])['ptbtags'])

['NNP', 'HYPH', 'NNP', ':', 'JJ', 'NNS', 'VBD', 'NNP', 'NNP', 'NNP', 'HYPH', 'NNP', ',', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNP', ',', 'IN', 'DT', 'JJ', 'NN', '.']


Our next step is to build the tag vocabulary so they can be numericalized during training. We do this by using the field's .build_vocab method on the train_data.

In [20]:
UD_TAGS.build_vocab(train_data)
PTB_TAGS.build_vocab(train_data)

Finally we create iterators. The iterators takes the vocabulary in batches we define. We used a bucket iterator. Details are here: https://torchtext.readthedocs.io/en/latest/data.html#iterator 

The bucket iterator https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator defines an iterator that batches examples of similar lengths together which minimizes the amount of padding needed while producing freshly shuffled batches for each new epoch.

In [21]:
BATCH_SIZE = 32

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

We use BERT model. 

BERT as an embedding layer and all we do is add a linear layer on top of these embeddings to predict the tag for each token in the input sequence.

The yellow squares are embeddings provided by the pretrained BERT model. All inputs are passed to BERT at the same time. BERT embeddings are contextualized in that they do not calculate embeddings for each tokens individually, but the embeddings are actually based off the other tokens within the sequence.

We do not define an embedding_dim for our model, it is the size of the output of the pretrained BERT model and we cannot change it. Thus, we simply get the embedding_dim from the model's hidden_size attribute.

BERT also wants sequences with the batch element first, hence we permute our input sequence before passing it to BERT.

In [22]:
class BERTPoSTagger(nn.Module):
    def __init__(self,
                 bert,
                 output_dim, 
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.fc = nn.Linear(embedding_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
  
        #text = [sent len, batch size]
    
        text = text.permute(1, 0)
        
        #text = [batch size, sent len]
        
        embedded = self.dropout(self.bert(text)[0])
        
        #embedded = [batch size, seq len, emb dim]
                
        embedded = embedded.permute(1, 0, 2)
                    
        #embedded = [sent len, batch size, emb dim]
        
        predictions = self.fc(self.dropout(embedded))
        
        #predictions = [sent len, batch size, output dim]
        
        return predictions

We finally get to instantiate our model - a simple linear model using BERT model to get word embeddings.

Best of all, the only hyperparameter is dropout! This value has been chosen as it's a sensibile value, so there may be a better value of dropout available.

In [23]:
OUTPUT_DIM = len(PTB_TAGS.vocab)
DROPOUT = 0.25

model = BERTPoSTagger(bert,
                      OUTPUT_DIM, 
                      DROPOUT)

Next, we define our optimizer. Usually when fine-tuning you want to use a lower learning rate than normal, this is because we don't want to drastically change the parameters as it may cause our model to forget what it has learned. This phenomenon is called catastrophic forgetting.

We pick 5e-5 (0.00005) as it is one of the three values recommended in the BERT paper. Again, there may be better values for this dataset.

In [24]:
LEARNING_RATE = 5e-5

optimizer = optim.Adam(model.parameters(), lr = LEARNING_RATE)

The rest of the notebook is pretty similar to before.

We define a loss function, making sure to ignore losses whenever the target tag is a padding token.

In [25]:
TAG_PAD_IDX = PTB_TAGS.vocab.stoi[PTB_TAGS.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

In [26]:
model = model.to(device)
criterion = criterion.to(device)

In [27]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]]).to(device)

In [28]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        text = batch.text
        tags = batch.ptbtags
                
        optimizer.zero_grad()
        
        #text = [sent len, batch size]
        
        predictions = model(text)
        
        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]
        
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)
        
        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]
        
        loss = criterion(predictions, tags)
                
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [29]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            tags = batch.ptbtags
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [30]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	Train Loss: 0.571 | Train Acc: 85.77%
	 Val. Loss: 0.436 |  Val. Acc: 88.62%
	Train Loss: 0.170 | Train Acc: 95.39%
	 Val. Loss: 0.380 |  Val. Acc: 89.39%
	Train Loss: 0.118 | Train Acc: 96.71%
	 Val. Loss: 0.393 |  Val. Acc: 88.92%
	Train Loss: 0.085 | Train Acc: 97.60%
	 Val. Loss: 0.408 |  Val. Acc: 88.48%
	Train Loss: 0.062 | Train Acc: 98.21%
	 Val. Loss: 0.409 |  Val. Acc: 88.64%


In [31]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.408 | Test Acc: 88.59%


References: https://github.com/flsantos/pos_tagging_bert_fine_tunning/blob/main/Fine_tuning_Pretrained_Transformer_BERT_for_PoS_Tagging_(Portuguese).ipynb