# PoS Tagging with BiLSTM-CRF model

For this section, we will see a full, complicated example of a Bi-LSTM plus Conditional Random Field for pos tagging. The LSTM tagger is typically sufficient for part-of-speech tagging, but a sequence model like the CRF is really essential for strong performance. Although this name sounds scary, all the model is in essence a CRF but where an LSTM provides the features. This is an advanced model though, far more complicated than pure LSTM model. The key challenges of BiLSTM-CRF model, comparing with LSTM, are:
* Write the recurrence for the **viterbi** variable at step $i$ for tag $k$.
* Modify the recurrence to compute the **forward** variables instead.
* Modify again the above recurrence to compute the forward variables in **log-space** (*hint: log-sum-exp*)





## Theoretical Background


### Intro to BiLSTM-CRF

We'll explain Bi-LSTM+CRF model with a NER task. Assume that we have five labels: **B-Person, I-Person, B-Organization, I-Organization, O**. In addition, given a sentence $x$ consisting of five words (i.e. $w_0, w_1, w_2, w_3, w_4$), The inputs, outputs and structure of *BiLSTM* is shown as follows:

![](../figs/CRF-LAYER-4.jpg)

* First, every word in sentence $x$ is represented by a vector word embedding (may including character embedding as well). The word embedding usually is from a pre-trained word embedding file and the character embedding is initialized randomly.All the embeddings will be fine-tuned during the training process.
* Second, the inputs of BiLSTM model are those embeddings and the outputs are predicted labels for words in sentence 
$x$

Without **CRF** layer, **BiLSTM** itself is sufficient for sequence labeling task, by selecting the label which has the highest score for each word. 


However, picking the maximum-scored label for each individual word does not gurantee that the whole predicted sequence is valid. Obviously, the outputs are as is shown abvoe, i.e. `I-Organization I-Person` and `B-Organization I-Person`. Adding **CRF** on top of **Bi-LSTM** could add some constrains to the final predicted labels to ensure they are valid. These constrains can be learned by the CRF layer automatically from the training dataset during the training process. For instance:
* The label of the first word in a sentence should start with “B-“ or “O”, not “I-“
* “B-label1 I-label2 I-label3 I-…”, in this pattern, label1, label2, label3 … should be the same named entity label.

![](../figs/CRF-LAYER-2-v2.png)


### Sequence score and loss function 

In the Bi-LSTM CRF, we define two kinds of potentials: **emission** and
**transition**. The emission potential for the word at index $i$ comes
from the hidden state of the Bi-LSTM at timestep $i$. We assume $\textbf{P}_{y_i}$ represents the non-normalized score of emitting $y_i$ at index $i$;
The transition scores are stored in a $|T|x|T|$ matrix
$\textbf{A}$, where $T$ is the tag set. We assume $\textbf{A}_{j,k}$ is the score of transitioning
to tag $j$ from tag $k$. So:

\begin{align}\text{Score}(x,y) = \sum_i \log \psi_\text{EMIT}(y_i) + \log \psi_\text{TRANS}(y_{i-1} \rightarrow y_i)\end{align}

\begin{align}= \sum_{i=0} \textbf{P}_{y_i} + \textbf{A}_{y_i, y_{i-1}}\end{align}

CRF computes a conditional probability of a given sequence. Let
$y$ be a tag sequence and $x$ an input sequence of words. According to softmax, then we compute

\begin{align}P(y|x) = \frac{\exp{(\text{Score}(x, y)})}{\sum_{y'} \exp{(\text{Score}(x, y')})}\end{align}

According maximum likelihood estimation, we get the loss function after taking logrithm of $P(y|x)$,
\begin{align}
loss(\textbf{P}, \textbf{A}) = -logP(y|x)=log(\sum_{y'} \exp{(\text{Score}(x, y')}) - Score(x,y)
\end{align}
Once we have the above **loss** function, we're able to obtain model parameters through grdient decent.

## Data Preparation

In [0]:
import torch
import torch.nn as nn
from torchtext import data
from torchtext import datasets

SEED = 1234
#random.seed(SEED)
#np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(lower = True, batch_first=True )#, init_token='sos', eos_token='eos')
UD_TAGS = data.Field(unk_token = None, init_token='<sos>', eos_token='<eos>', batch_first=True )
PTB_TAGS = data.Field(unk_token = None, init_token='<sos>', eos_token='<eos>', batch_first=True )

fields = (("text", TEXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))
train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

In [2]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


In [3]:
print(vars(train_data.examples[0]))

{'text': ['al', '-', 'zaman', ':', 'american', 'forces', 'killed', 'shaikh', 'abdullah', 'al', '-', 'ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'qaim', ',', 'near', 'the', 'syrian', 'border', '.'], 'udtags': ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT'], 'ptbtags': ['NNP', 'HYPH', 'NNP', ':', 'JJ', 'NNS', 'VBD', 'NNP', 'NNP', 'NNP', 'HYPH', 'NNP', ',', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNP', ',', 'IN', 'DT', 'JJ', 'NN', '.']}


In [0]:
MIN_FREQ = 2

TEXT.build_vocab(train_data, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)


UD_TAGS.build_vocab(train_data)
PTB_TAGS.build_vocab(train_data)

In [5]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(UD_TAGS.vocab)}")
print(f"Unique tokens in PTB_TAG vocabulary: {len(PTB_TAGS.vocab)}")

Unique tokens in TEXT vocabulary: 8866
Unique tokens in UD_TAG vocabulary: 20
Unique tokens in PTB_TAG vocabulary: 53


In [6]:
TEXT.vocab.itos[:10]

['<unk>', '<pad>', 'the', '.', ',', 'to', 'and', 'a', 'of', 'i']

In [7]:
print(UD_TAGS.vocab.itos)

['<pad>', '<sos>', '<eos>', 'NOUN', 'PUNCT', 'VERB', 'PRON', 'ADP', 'DET', 'PROPN', 'ADJ', 'AUX', 'ADV', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'X', 'INTJ', 'SYM']


In [0]:
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    sort_key=lambda x: len(x.text), # field sorted by len
    sort_within_batch=True,
    batch_sizes = (BATCH_SIZE, BATCH_SIZE, BATCH_SIZE),
    device = device
)

# train_iterator = data.BucketIterator(
#     train_data,
#     sort_key=lambda x: len(x.text), # field sorted by len
#     sort_within_batch=True,
#     batch_size = BATCH_SIZE,
#     device = device
# )

## Build the Model

The model consists of two components: CRF and RNN.

### CRF model

First, define a helper function to compute the log sum exp of an input vector, in a numerically stable way for the forward algorithm 

In [0]:
def log_sum_exp(x):
    m = torch.max(x, -1)[0]
    return m + torch.log(torch.sum(torch.exp(x - m.unsqueeze(-1)), -1))

The, we define a `CRF` model, with important class functions and variables as follows:

* `self.trans` is the transition score matrix, with `self.trans[i][j]` reprenting the score of transiting from tag *j* to tag *i*
<!-- * `self.neg_log_likelihood` corresponds to the loss function defined above;  -->
* `self.score` computes $Score(x, y)$,
* `self.forward` computes the log sum equation $log(\sum_{y'} \exp{(\text{Score}(x, y')})$,
<!-- * `self._get_lstm_features` computes the emission scores (i.e. $\textbf{P}$) from BiLSTM, as the inputs to loss function  -->
* `self.decode` uses **viterbi** algorithm to predict/decode new sentence, given that the model is already trained, i.e. $\textbf{A}$ and $\textbf{P}$ is known.

In [0]:
class CRF(nn.Module):
    def __init__(self, num_tags):
        super().__init__()
        self.batch_size = 0
        self.num_tags = num_tags

        # matrix of transition scores from j to i
        self.trans = nn.Parameter(torch.randn(num_tags, num_tags))
        self.trans.data[SOS_IDX, :] = -10000 # no transition to SOS
        self.trans.data[:, EOS_IDX] = -10000 # no transition from EOS except to PAD
        self.trans.data[:, PAD_IDX] = -10000 # no transition from PAD except to PAD
        self.trans.data[PAD_IDX, :] = -10000 # no transition to PAD except from EOS
        self.trans.data[PAD_IDX, EOS_IDX] = 0
        self.trans.data[PAD_IDX, PAD_IDX] = 0

    def forward(self, h, mask):
        """
        Partition function in forawrd algorithm
        h = [batch_size, seq_len, num_tags]
        mask = [batch_size, seq_len]
        """
        # initialize forward variables in log space
        score = torch.Tensor(self.batch_size, self.num_tags).fill_(-10000) # [batch_size, num_tags]
        score = score.to(device) ### put it to GPU if available
        # score = torch.zeros(self.batch_size, self.num_tags)# [batch_size, num_tags]
        # score.fill_(-10000)
        score[:, SOS_IDX] = 0.
        trans = self.trans.unsqueeze(0) # [1, num_tags, num_tags]
        for t in range(h.size(1)): # recursion through the sequence
            mask_t = mask[:, t].unsqueeze(1) # [batch_size] -> [batch_size, 1]
            emit_t = h[:, t].unsqueeze(2) # [batch_size, num_tags] -> [batch_size, num_tags, 1]
            score_t = score.unsqueeze(1) + emit_t + trans # [batch_size, 1, num_tags] -> [batch_size, num_tags, num_tags]
            score_t = log_sum_exp(score_t) # [batch_size, num_tags, num_tags] -> [batch_size, num_tags]
            score = score_t * mask_t + score * (1 - mask_t)
        score = log_sum_exp(score + self.trans[EOS_IDX]) ## start from <SOS> and stops at <EOS>
        return score # partition function

    def score(self, h, y0, mask): 
        """
        Calculate the score of a given sequence
        y0 = [batch_size, 1+seq_len] ### start with <SOS>
        h = [batch_size, seq_len, num_tags]
        mask = [batch_size, seq_len]
        """
        score = torch.Tensor(self.batch_size).fill_(0.)
        score = score.to(device)
        #score = torch.zeros(self.batch_size)
        h = h.unsqueeze(3)  # [batch_size, seq_len, num_tags, 1]
        trans = self.trans.unsqueeze(2) # [num_tags, num_tags, 1]
        for t in range(h.size(1)): # recursion through the sequence
            mask_t = mask[:, t] # [batch_size]
            emit_t = torch.cat([h[t, y0[t + 1]] for h, y0 in zip(h, y0)]) # [batch_size]
            trans_t = torch.cat([trans[y0[t + 1], y0[t]] for y0 in y0]) # [batch_size]
            score += (emit_t + trans_t) * mask_t
        last_tag = y0.gather(1, mask.sum(1).long().unsqueeze(1)).squeeze(1)
        score += self.trans[EOS_IDX, last_tag]
        return score

    def decode(self, h, mask):
        """
        Viterbi decoding
        """
        # initialize backpointers and viterbi variables in log space
        bptr = torch.LongTensor()
        score = torch.Tensor(self.batch_size, self.num_tags).fill_(-10000)
        bptr = bptr.to(device)
        score = score.to(device)
        # score = torch.zeros(self.batch_size, self.num_tags)# [batch_size, num_tags]
        # score.fill_(-10000)
        score[:, SOS_IDX] = 0.

        for t in range(h.size(1)): # recursion through the sequence
            mask_t = mask[:, t].unsqueeze(1)
            score_t = score.unsqueeze(1) + self.trans # [B, 1, C] -> [B, C, C]
            score_t, bptr_t = score_t.max(2) # best previous scores and tags
            score_t += h[:, t] # plus emission scores
            bptr = torch.cat((bptr, bptr_t.unsqueeze(1)), 1)
            score = score_t * mask_t + score * (1 - mask_t)
        score += self.trans[EOS_IDX]
        best_score, best_tag = torch.max(score, 1)

        # back-tracking
        bptr = bptr.tolist()
        best_path = [[i] for i in best_tag.tolist()]
        for b in range(self.batch_size):
            i = best_tag[b] # best tag
            j = int(mask[b].sum().item())
            for bptr_t in reversed(bptr[b][:j]):
                i = bptr_t[i]
                best_path[b].append(i)
            best_path[b].pop()
            best_path[b].reverse()

        return best_path

### BiLSTM model. 

This part is relatively easy and should be almost the same as previous code example. The two major differences are:
1. we set `batch_first=True` in the RNN constructor, so that the output of RNN is easily connect to the CRF component.
2. we use `rnn.pack_padded_sequence` and `rnn.pad_packed_sequence` to handle padding sequences.

In [0]:
class BiLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags,
                 n_layers, bidirectional, dropout):
        super().__init__()
        self.batch_size = 0
        # if padding_idx is specified, pads the output with the embedding vector 
        # at padding_idx (initialized to zeros) whenever it encounters the index.
        # So, the embedding of padding_idx is alywas zeros during training 
        self.embedding= nn.Embedding(vocab_size, embedding_dim, padding_idx=WORD_PAD_IDX)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional,
                           batch_first = True) ### put batch_size at 0 dimension
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, num_tags)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text, mask):
        # text = [batch_size, seq_len]
        # mask = [batch_size, seq_len]

        embedded = self.dropout(self.embedding(text)) ## do we really need dropout here???
        #embedded = [batch_size, seq_len, emb_dim]

        x = nn.utils.rnn.pack_padded_sequence(embedded, mask.sum(1).int(), batch_first = True)
        h, _ = self.rnn(x)
        h, _ = nn.utils.rnn.pad_packed_sequence(h, batch_first = True)
        h = self.fc(self.dropout(h))
        predictions = h * mask.unsqueeze(2)
        #predictions = [batch_size, seq_len, num_tags]

        return predictions

### Assembly RNN + CRF to BiLSTM-CRF model

In [0]:
class BiLSTM_CRF(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags,
                 n_layers, bidirectional, dropout):
        super().__init__()
        self.bilstm = BiLSTM(vocab_size, embedding_dim, hidden_dim, num_tags,
                 n_layers, bidirectional, dropout)
        self.crf = CRF(num_tags)
        #self = self.cuda() if CUDA else self

    def forward(self, xw, y0): # for training
        #self.zero_grad()
        self.bilstm.batch_size = y0.size(0)
        self.crf.batch_size = y0.size(0)
        mask = y0[:, 1:].ne(PAD_IDX).float() ### start from y0[1] to skip <SOS>
        #mask = y0.ne(PAD_IDX).float()
        h = self.bilstm(xw, mask)
        Z = self.crf.forward(h, mask)
        score = self.crf.score(h, y0, mask)
        return torch.mean(Z - score) # average NLL loss of a mini-batch

    def decode(self, xw): # for inference
        self.bilstm.batch_size = xw.size(0)
        self.crf.batch_size = xw.size(0)
        mask = xw.ne(WORD_PAD_IDX).float()
        h = self.bilstm(xw, mask)
        return self.crf.decode(h, mask)

### Create a model instance

Here we add a `START_TAG` and `END_TAG` to the transition matrix.

In [0]:
PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]
SOS_IDX = UD_TAGS.vocab.stoi[UD_TAGS.init_token]
EOS_IDX = UD_TAGS.vocab.stoi[UD_TAGS.eos_token]
WORD_PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
NUM_TAGS = len(UD_TAGS.vocab) ### including <SOS> and <EOS>
#OUTPUT_DIM = len(UD_TAGS.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25


In [0]:
model = BiLSTM_CRF(INPUT_DIM, 
                     EMBEDDING_DIM, 
                     HIDDEN_DIM, 
                     NUM_TAGS, 
                     N_LAYERS, 
                     BIDIRECTIONAL, 
                     DROPOUT)


### Model initiaion and parameters check

In [15]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.1)
        
model.apply(init_weights)

BiLSTM_CRF(
  (bilstm): BiLSTM(
    (embedding): Embedding(8866, 100, padding_idx=1)
    (rnn): LSTM(100, 128, num_layers=2, batch_first=True, bidirectional=True)
    (fc): Linear(in_features=256, out_features=20, bias=True)
    (dropout): Dropout(p=0.25, inplace=False)
  )
  (crf): CRF()
)

In [16]:
for name, param in model.named_parameters():
    #if param.requires_grad:
    print (name, param.requires_grad, param.data.shape)

bilstm.embedding.weight True torch.Size([8866, 100])
bilstm.rnn.weight_ih_l0 True torch.Size([512, 100])
bilstm.rnn.weight_hh_l0 True torch.Size([512, 128])
bilstm.rnn.bias_ih_l0 True torch.Size([512])
bilstm.rnn.bias_hh_l0 True torch.Size([512])
bilstm.rnn.weight_ih_l0_reverse True torch.Size([512, 100])
bilstm.rnn.weight_hh_l0_reverse True torch.Size([512, 128])
bilstm.rnn.bias_ih_l0_reverse True torch.Size([512])
bilstm.rnn.bias_hh_l0_reverse True torch.Size([512])
bilstm.rnn.weight_ih_l1 True torch.Size([512, 256])
bilstm.rnn.weight_hh_l1 True torch.Size([512, 128])
bilstm.rnn.bias_ih_l1 True torch.Size([512])
bilstm.rnn.bias_hh_l1 True torch.Size([512])
bilstm.rnn.weight_ih_l1_reverse True torch.Size([512, 256])
bilstm.rnn.weight_hh_l1_reverse True torch.Size([512, 128])
bilstm.rnn.bias_ih_l1_reverse True torch.Size([512])
bilstm.rnn.bias_hh_l1_reverse True torch.Size([512])
bilstm.fc.weight True torch.Size([20, 256])
bilstm.fc.bias True torch.Size([20])
crf.trans True torch.Size(

In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,522,924 trainable parameters


In [18]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([8866, 100])


In [19]:
model.bilstm.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9261,  2.3049,  0.5502,  ..., -0.3492, -0.5298, -0.1577],
        [-0.5972,  0.0471, -0.2406,  ..., -0.9446, -0.1126, -0.2260],
        [-0.4809,  2.5629,  0.9530,  ...,  0.5278, -0.4588,  0.7294]])

In [20]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.bilstm.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.bilstm.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print("<unk> index is", UNK_IDX, ", and <pad> index is", WORD_PAD_IDX)
print(model.bilstm.embedding.weight.data)

<unk> index is 0 , and <pad> index is 1
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9261,  2.3049,  0.5502,  ..., -0.3492, -0.5298, -0.1577],
        [-0.5972,  0.0471, -0.2406,  ..., -0.9446, -0.1126, -0.2260],
        [-0.4809,  2.5629,  0.9530,  ...,  0.5278, -0.4588,  0.7294]])


## Model training and evaluation

In [0]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())

In [0]:
model = model.to(device)
#criterion = criterion.to(device)

### Training process

In [0]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = preds[non_pad_elements].eq(y[non_pad_elements])
    acc = correct.sum() / torch.tensor(len(y[non_pad_elements]), dtype=torch.float)
    return acc

In [0]:
def train(model, iterator, optimizer):
    epoch_loss = 0
    epoch_acc = 0

    model.train() # turn of dropoff and augograd

    for batch in iterator:
        text = batch.text
        tags = batch.udtags[:,:-1] ## keep <SOS> and strip <EOS> tag

        optimizer.zero_grad()
        loss = model(text, tags)

        ### compute the accuracy ###
        preds = model.decode(text)
        preds = nn.utils.rnn.pad_sequence([torch.tensor(x) for x in preds], batch_first=True, padding_value=PAD_IDX)
        preds = preds.flatten() ### preds = [batch_size*seq_len]
        preds = preds.to(device)
        labels = tags[:, 1:].flatten() ### preds = [batch_size*seq_len]
        assert len(preds) == len(labels)    

        acc = categorical_accuracy(preds, labels, PAD_IDX)


        loss.backward() ## back propagation
        optimizer.step() ## update parameters
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
def evaluate(model, iterator):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            tags = batch.udtags[:,:-1] # strip <SOS> and <EOS> tag

            loss = model(text, tags)
            ### compute the accuracy ###
            preds = model.decode(text)
            preds = nn.utils.rnn.pad_sequence([torch.tensor(x) for x in preds], batch_first=True, padding_value=PAD_IDX)
            preds = preds.flatten()
            preds = preds.to(device)
            labels = tags[:, 1:].flatten()
            assert len(preds) == len(labels)

            acc = categorical_accuracy(preds, labels, PAD_IDX)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [27]:
N_EPOCHS = 10

best_valid_loss = float('inf')

MODEL_PARAS_OBJ = 'pos_lstm_crf.pt'

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer)
    valid_loss, valid_acc = evaluate(model, valid_iterator)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_PARAS_OBJ)
    
    print(f'Epoch: {epoch+1} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 1 | Epoch Time: 0m 50s
	Train Loss: 18.945 | Train Acc: 63.16%
	 Val. Loss: 6.258 |  Val. Acc: 78.91%
Epoch: 2 | Epoch Time: 0m 49s
	Train Loss: 6.216 | Train Acc: 84.89%
	 Val. Loss: 4.538 |  Val. Acc: 82.09%
Epoch: 3 | Epoch Time: 0m 49s
	Train Loss: 4.580 | Train Acc: 88.09%
	 Val. Loss: 3.964 |  Val. Acc: 82.79%
Epoch: 4 | Epoch Time: 0m 48s
	Train Loss: 3.745 | Train Acc: 89.83%
	 Val. Loss: 3.815 |  Val. Acc: 83.39%
Epoch: 5 | Epoch Time: 0m 48s
	Train Loss: 3.293 | Train Acc: 90.85%
	 Val. Loss: 3.498 |  Val. Acc: 84.00%
Epoch: 6 | Epoch Time: 0m 48s
	Train Loss: 2.885 | Train Acc: 91.85%
	 Val. Loss: 3.583 |  Val. Acc: 83.85%
Epoch: 7 | Epoch Time: 0m 48s
	Train Loss: 2.598 | Train Acc: 92.49%
	 Val. Loss: 3.295 |  Val. Acc: 84.21%
Epoch: 8 | Epoch Time: 0m 49s
	Train Loss: 2.398 | Train Acc: 92.59%
	 Val. Loss: 3.361 |  Val. Acc: 84.17%
Epoch: 9 | Epoch Time: 0m 48s
	Train Loss: 2.216 | Train Acc: 93.22%
	 Val. Loss: 3.294 |  Val. Acc: 84.39%
Epoch: 10 | Epoch Time: 0m 

### Evaluation

In [28]:
model.load_state_dict(torch.load(MODEL_PARAS_OBJ))
test_loss, test_acc = evaluate(model, test_iterator)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 2.971 |  Test Acc: 85.05%


## References

https://github.com/bentrevett/pytorch-pos-tagging/blob/master/1%20-%20Simple%20RNN%20PoS%20Tagger.ipynb

https://github.com/threelittlemonkeys/lstm-crf-pytorch