# PoS tagging with simple LSTM

Sequence labeling is the basic task in NLP. To address it, we need to build sequence models where there is some sort of dependence through time between your inputs. The classical example of a sequence model is the Hidden Markov Model for part-of-speech tagging. Another example is the conditional random field. In neural network field, we'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences.

A RNN is a network that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state $h_t$, which in principle can contain information from arbitrary points earlier in the sequence. We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.


## LSTM in PyTorch



### Explanation
`nn.LSTM` in PyTorch applies a _multi-layer_ long short-term memory (LSTM) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

\begin{array}{ll} \\
            i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
            f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
            g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
            o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
            c_t = f_t * c_{(t-1)} + i_t * g_t \\
            h_t = o_t * \tanh(c_t) \\
        \end{array}

where $h_t$ and $c_t$ is the hidden state and cell state at time $t$; $x_t$ is the input at time $t$; $h_{(t-1)}$
is the hidden state of the layer at time $t-1$, or the initial hidden
state at time 0; $i_t$, $f_t$, $g_t$,
$o_t$ are the input, forget, cell, and output gates, respectively.
$\sigma$ is the sigmoid function, and $*$ is the Hadamard product.

In a multilayer LSTM, the input $x^{(l)}_t$ of the $l$-th layer
($l >= 2$) is the hidden state $h^{(l-1)}_t$ of the previous layer multiplied by
dropout $\delta^{(l-1)}_t$ where each $\delta^{(l-1)}_t$ is a Bernoulli random
variable which is 0 with probability `dropout`.

In [0]:
import torch
import torch.nn as nn

### Inputs
The inputs to LSTM sequence are `input` and `(h_0, c_0)`. 
* `input` is a tensor of shape **(seq_len, batch, input_size)**: containing the features of the input sequence. 
* `h_0` is a tensor of shape **(num_layers * num_directions, batch, hidden_size)** containing the initial hidden state for **each** element in the batch. If the LSTM is bidirectional, num_directions should be 2, else it should be 1. 
* `c_0` is a tensor of shape **(num_layers * num_directions, batch, hidden_size)** containing the initial cell state for **each** element in the batch. 

If `(h_0, c_0)` is not provided, both **h_0** and **c_0** default to zero.

Therefore, **Pytorch's LSTM expects
all of its inputs to be 3D tensors**. 
The semantics of the axes of these
tensors is important. 
The first axis is the sequence itself, the second indexes instances in the mini-batch, and the third indexes elements of the input. We haven't discussed mini-batching, so lets just ignore that and assume we will always have just 1 dimension on the second axis. If
we want to run the sequence model over the sentence "The cow jumped",
our input should look like:
\begin{align}\begin{bmatrix}
   \overbrace{q_\text{The}}^\text{row vector} \\
   q_\text{cow} \\
   q_\text{jumped}
   \end{bmatrix}\end{align}

**Except remember there is an additional 2nd dimension with size 1**.
Let's see a quick example.

In [2]:
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)
    print(out)
    print(hidden)

tensor([[[-0.0732,  0.0331,  0.0890]]], grad_fn=<StackBackward>)
(tensor([[[-0.0732,  0.0331,  0.0890]]], grad_fn=<StackBackward>), tensor([[[-0.4125,  0.1233,  0.7861]]], grad_fn=<StackBackward>))
tensor([[[0.0403, 0.1489, 0.1011]]], grad_fn=<StackBackward>)
(tensor([[[0.0403, 0.1489, 0.1011]]], grad_fn=<StackBackward>), tensor([[[0.2402, 0.6963, 0.6919]]], grad_fn=<StackBackward>))
tensor([[[0.1820, 0.2100, 0.1130]]], grad_fn=<StackBackward>)
(tensor([[[0.1820, 0.2100, 0.1130]]], grad_fn=<StackBackward>), tensor([[[0.3204, 0.4610, 0.2960]]], grad_fn=<StackBackward>))
tensor([[[0.1399, 0.1771, 0.0479]]], grad_fn=<StackBackward>)
(tensor([[[0.1399, 0.1771, 0.0479]]], grad_fn=<StackBackward>), tensor([[[0.3080, 0.4618, 0.1557]]], grad_fn=<StackBackward>))
tensor([[[ 0.0352,  0.0939, -0.0434]]], grad_fn=<StackBackward>)
(tensor([[[ 0.0352,  0.0939, -0.0434]]], grad_fn=<StackBackward>), tensor([[[ 0.0604,  0.2508, -0.1779]]], grad_fn=<StackBackward>))


### Outputs
The outputs from LSTM sequence are `output` and `(h_n, c_n)`. 
* `output` is a tensor of shape **(seq_len, batch, num_directions * hidden_size)** containing the output features $h_t$ from the **last layer** of the LSTM, for each $t$. If a `torch.nn.utils.rnn.PackedSequence` has been given as the input, the output will also be a packed sequence. 
For the unpacked case, the directions can be separated using **output.view(seq_len, batch, num_directions, hidden_size)**, with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case. 
* `h_n` is a tensor of shape **(num_layers * num_directions, batch, hidden_size)** containing the hidden state for *t = seq_len*.
Like output, the layers can be separated using **h_n.view(num_layers, num_directions, batch, hidden_size)** and similarly for `c_n`. 

* `c_n` is a tensor of shape **(num_layers * num_directions, batch, hidden_size)** containing the cell state for *t = seq_len*.

**Note about packed sequence**:To deal with training sequence examples with different lengths, in addtion to *padding*, pytorch allows us to further *pack* the sequence into a tuple of *two lists*. One contains the elements of sequences, and other contains the batch size at each step. This is helpful in recovering the actual sequences as well as telling RNN what is the batch size at each time step. Please check an example [here](https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch):

Insead of using a FOR loop as is shown above, we can do the entire sequence all at once.
the first value returned by LSTM is all of the hidden states throughout
the sequence. the second is just the most recent hidden state
(compare the last slice of "out" with "hidden" below, they are the same)
The reason for this is that:
"out" will give you access to all hidden states in the sequence
"hidden" will allow you to continue the sequence and backpropagate,
by passing it as an argument  to the lstm at a later time
Add the extra 2nd dimension

In [3]:
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.1930,  0.1003, -0.1436]],

        [[ 0.0378,  0.1532, -0.0980]],

        [[ 0.1615,  0.2071, -0.2197]],

        [[ 0.1152,  0.1601, -0.1457]],

        [[-0.0016,  0.0865, -0.1286]]], grad_fn=<StackBackward>)
(tensor([[[-0.0016,  0.0865, -0.1286]]], grad_fn=<StackBackward>), tensor([[[-0.0026,  0.2435, -0.5242]]], grad_fn=<StackBackward>))


## Preparing Data

### Read and parse data
The data used here is *Universal Dependencies Version 2 POS Tagged data*, which is included in `torchtext.datasets`. We define three `Field` object for `text`, `udtags` and `ptbtags`. We use `text` and `udtags` here but not `ptbtags` here, because it `ptbtags` is for chunking.


In [0]:
from torchtext import data
from torchtext import datasets

SEED = 1234
#random.seed(SEED)
#np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(lower = True)
UD_TAGS = data.Field(unk_token = None)
PTB_TAGS = data.Field(unk_token = None)

fields = (("text", TEXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))
train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

Let's check the data examples.

In [5]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


In [6]:
print(vars(train_data.examples[0]))

{'text': ['al', '-', 'zaman', ':', 'american', 'forces', 'killed', 'shaikh', 'abdullah', 'al', '-', 'ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'qaim', ',', 'near', 'the', 'syrian', 'border', '.'], 'udtags': ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT'], 'ptbtags': ['NNP', 'HYPH', 'NNP', ':', 'JJ', 'NNS', 'VBD', 'NNP', 'NNP', 'NNP', 'HYPH', 'NNP', ',', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'DT', 'NN', 'IN', 'NNP', ',', 'IN', 'DT', 'JJ', 'NN', '.']}


### Build the vocabulary
Here we used `glove` as word embedding inputs

In [0]:
MIN_FREQ = 2

TEXT.build_vocab(train_data, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)


UD_TAGS.build_vocab(train_data)
PTB_TAGS.build_vocab(train_data)

summuraize certain statistics

In [8]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(UD_TAGS.vocab)}")
print(f"Unique tokens in PTB_TAG vocabulary: {len(PTB_TAGS.vocab)}")

Unique tokens in TEXT vocabulary: 8866
Unique tokens in UD_TAG vocabulary: 18
Unique tokens in PTB_TAG vocabulary: 51


In [9]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 9076), ('.', 8640), (',', 7021), ('to', 5137), ('and', 5002), ('a', 3782), ('of', 3622), ('i', 3379), ('in', 3112), ('is', 2239), ('you', 2156), ('that', 2036), ('it', 1850), ('for', 1842), ('-', 1426), ('have', 1359), ('"', 1296), ('on', 1273), ('was', 1244), ('with', 1216)]


In [10]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', '.', ',', 'to', 'and', 'a', 'of', 'i']


In [11]:
print(UD_TAGS.vocab.itos)

['<pad>', 'NOUN', 'PUNCT', 'VERB', 'PRON', 'ADP', 'DET', 'PROPN', 'ADJ', 'AUX', 'ADV', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'X', 'INTJ', 'SYM']


Notice the `<unk>` and `<pad>` tokens are included in `TEXT`, and `<pad>` token is included in `UD_TAGS`, when building the vocabulary.

In [12]:
print(UD_TAGS.vocab.freqs.most_common())

[('NOUN', 34781), ('PUNCT', 23679), ('VERB', 23081), ('PRON', 18577), ('ADP', 17638), ('DET', 16285), ('PROPN', 12946), ('ADJ', 12477), ('AUX', 12343), ('ADV', 10548), ('CCONJ', 6707), ('PART', 5567), ('NUM', 3999), ('SCONJ', 3843), ('X', 847), ('INTJ', 688), ('SYM', 599)]


### Batch the data

Use `BucketIterator` to batch data samples with similar size to reduce the padding efforts, and do the paddings (i.e. filling with `pad_idx`) within each batch.

In [0]:
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device
)

## Build the model

### Define the model structure

In this section, we will use an LSTM to get part of speech tags. We will
not use Viterbi or Forward-Backward or anything like that, but as a
(challenging) exercise to the reader, think about how Viterbi could be
used after you have seen what is going on.

The model is as follows: let our input sentence be
$w_1, \dots, w_M$, where $w_i \in V$, our vocab. Also, let
$T$ be our tag set, and $y_i$ the tag of word $w_i$.
Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a
unique index (like how we had word\_to\_ix in the word embeddings
section). Then our prediction rule for $\hat{y}_i$ is

\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.


In [0]:
class RNNPoSTagger(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 n_layers, bidirectional, dropout, pad_idx):
        super().__init__()

        # if padding_idx is specified, pads the output with the embedding vector 
        # at padding_idx (initialized to zeros) whenever it encounters the index.
        # So, the embedding of padding_idx is alywas zeros during training 
        self.embedding= nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        # text = [seq_len, batch_size]

        embedded = self.dropout(self.embedding(text)) ## do we really need dropout here???
        #embedded = [seq_len, batch_size, emb_dim]

        outputs, (hidden, cell) = self.rnn(embedded)
        #outputs = [seq_len, batch_size, hid_dim * n_directions]
        #hidden/cell = [n_layers * n_directions, batch_size, hidden_dim]

        predictions = self.fc(self.dropout(outputs))
        #predictions = [seq_len, batch_size, output_dim]

        return predictions 
    


Here we use a **bi-directional** and **two-layers** LSTM with **dropout**

In [0]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(UD_TAGS.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNNPoSTagger(INPUT_DIM, 
                     EMBEDDING_DIM, 
                     HIDDEN_DIM, 
                     OUTPUT_DIM, 
                     N_LAYERS, 
                     BIDIRECTIONAL, 
                     DROPOUT, 
                     PAD_IDX)

### Model initialization and parameters check

Initialize the model parameters.

In [16]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.1)
        
model.apply(init_weights)

RNNPoSTagger(
  (embedding): Embedding(8866, 100, padding_idx=1)
  (rnn): LSTM(100, 128, num_layers=2, bidirectional=True)
  (fc): Linear(in_features=256, out_features=18, bias=True)
  (dropout): Dropout(p=0.25, inplace=False)
)

In [17]:
for name, param in model.named_parameters():
    #if param.requires_grad:
    print (name, param.requires_grad, param.data.shape)

embedding.weight True torch.Size([8866, 100])
rnn.weight_ih_l0 True torch.Size([512, 100])
rnn.weight_hh_l0 True torch.Size([512, 128])
rnn.bias_ih_l0 True torch.Size([512])
rnn.bias_hh_l0 True torch.Size([512])
rnn.weight_ih_l0_reverse True torch.Size([512, 100])
rnn.weight_hh_l0_reverse True torch.Size([512, 128])
rnn.bias_ih_l0_reverse True torch.Size([512])
rnn.bias_hh_l0_reverse True torch.Size([512])
rnn.weight_ih_l1 True torch.Size([512, 256])
rnn.weight_hh_l1 True torch.Size([512, 128])
rnn.bias_ih_l1 True torch.Size([512])
rnn.bias_hh_l1 True torch.Size([512])
rnn.weight_ih_l1_reverse True torch.Size([512, 256])
rnn.weight_hh_l1_reverse True torch.Size([512, 128])
rnn.bias_ih_l1_reverse True torch.Size([512])
rnn.bias_hh_l1_reverse True torch.Size([512])
fc.weight True torch.Size([18, 256])
fc.bias True torch.Size([18])


In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,522,010 trainable parameters


Inject glove-embeddings into the model

In [19]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([8866, 100])


In [20]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9261,  2.3049,  0.5502,  ..., -0.3492, -0.5298, -0.1577],
        [-0.5972,  0.0471, -0.2406,  ..., -0.9446, -0.1126, -0.2260],
        [-0.4809,  2.5629,  0.9530,  ...,  0.5278, -0.4588,  0.7294]])

Set up the embedding value of `<unk>` and `<pad>` to zeros

In [21]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print("<unk> index is", UNK_IDX, ", and <pad> index is", PAD_IDX)
print(model.embedding.weight.data)

<unk> index is 0 , and <pad> index is 1
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9261,  2.3049,  0.5502,  ..., -0.3492, -0.5298, -0.1577],
        [-0.5972,  0.0471, -0.2406,  ..., -0.9446, -0.1126, -0.2260],
        [-0.4809,  2.5629,  0.9530,  ...,  0.5278, -0.4588,  0.7294]])


## Model training and evaluation


### Set up optimizer and loss

We use `Adam` here which is a standard optimizer for all neutral network models. 

In [0]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())

At each time step for a sequence, it is a classification task in essence. So we use `nn.CrossEntropyLoss` as the loss function. 
This criterion combines `nn.LogSoftmax` and `nn.NLLLoss` in one single class.

It is useful when training a classification problem with `C` classes.
If provided, the optional argument `weight` should be a 1D `Tensor`
assigning weight to each of the classes.
This is particularly useful when you have an unbalanced training set.

The `input` is expected to contain raw, unnormalized scores for each class.
`input` has to be a Tensor of size either **(minibatch, C)** or
**(minibatch, C, d_1, d_2, ..., d_K)**
with $K \geq 1$ for the `K`-dimensional case (not used here).

This criterion expects a class index in the range `[0, C-1]` as the
`target` for each value of a 1D tensor of size `minibatch`; if `ignore_index`
is specified, this criterion also accepts this class index (this index may not
necessarily be in the class range).

The loss of unweighted classes can be described as:
${loss}(x, class) = -\log\left(\frac{\exp(x[class])}{\sum_j \exp(x[j])}\right)= -x[class] + \log\left(\sum_j \exp(x[j])\right)$

The losses are averaged across observations for each minibatch, and do not include `ignore_index` specified class.


In [0]:
TAG_PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

move the `criterion` and `model` to GPU

In [0]:
model = model.to(device)
criterion = criterion.to(device)

### Training process

First, define a `categorical_accuracy` to compute the accuracy of predictions per batch

In [0]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]])

Then, define the `train` and `evaluate` function, pay attention to the dimensions of input data, output predictions, and labels. Make sure that they match.

In [0]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    epoch_loss = 0
    epoch_acc = 0

    model.train() # turn of dropoff and augograd

    for batch in iterator:
        text = batch.text
        tags = batch.udtags

        # text = [sent_len, batch_size]
        # tags = [sent_len, batch_size]

        optimizer.zero_grad()

        
        predictions = model(text)
        # predictions = [sent_len, batch_size, output_dim]
        
        ## reshape tensors to make them fit the loss function
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)

        # predictions = [sent_len*batch_size, output_dim]
        # tags = [sent_len*batch_size]

        loss = criterion(predictions, tags)
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)

        loss.backward() ## back propagation
        optimizer.step() ## update parameters
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [0]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            tags = batch.udtags
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, run the training process as follows. Remeber to save the best model based on the metrics of validation dataset

In [29]:
N_EPOCHS = 10

best_valid_loss = float('inf')

MODEL_PARAS_OBJ = 'pos_lstm.pt'

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_PARAS_OBJ)
    
    print(f'Epoch: {epoch+1} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 1 | Epoch Time: 0m 3s
	Train Loss: 0.918 | Train Acc: 71.42%
	 Val. Loss: 0.575 |  Val. Acc: 82.35%
Epoch: 2 | Epoch Time: 0m 3s
	Train Loss: 0.345 | Train Acc: 89.14%
	 Val. Loss: 0.480 |  Val. Acc: 85.15%
Epoch: 3 | Epoch Time: 0m 3s
	Train Loss: 0.262 | Train Acc: 91.66%
	 Val. Loss: 0.432 |  Val. Acc: 86.30%
Epoch: 4 | Epoch Time: 0m 3s
	Train Loss: 0.220 | Train Acc: 93.01%
	 Val. Loss: 0.409 |  Val. Acc: 87.93%
Epoch: 5 | Epoch Time: 0m 3s
	Train Loss: 0.194 | Train Acc: 93.76%
	 Val. Loss: 0.388 |  Val. Acc: 88.27%
Epoch: 6 | Epoch Time: 0m 3s
	Train Loss: 0.172 | Train Acc: 94.44%
	 Val. Loss: 0.377 |  Val. Acc: 88.43%
Epoch: 7 | Epoch Time: 0m 3s
	Train Loss: 0.154 | Train Acc: 95.02%
	 Val. Loss: 0.367 |  Val. Acc: 88.70%
Epoch: 8 | Epoch Time: 0m 3s
	Train Loss: 0.140 | Train Acc: 95.47%
	 Val. Loss: 0.370 |  Val. Acc: 88.90%
Epoch: 9 | Epoch Time: 0m 3s
	Train Loss: 0.130 | Train Acc: 95.78%
	 Val. Loss: 0.353 |  Val. Acc: 89.05%
Epoch: 10 | Epoch Time: 0m 3s
	Train 

### Evaluation

In [30]:
test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.365 |  Test Acc: 88.94%


## References
* https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging
* https://github.com/bentrevett/pytorch-pos-tagging/blob/master/1%20-%20Simple%20RNN%20PoS%20Tagger.ipynb