In [None]:
%matplotlib inline


In this exercise you will implement a part-of-speech tagger using LSTM (sequence-to-sequence task). This notebook is based on [this pytorch tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html).

Code that you should implement is marked with TODO.

## LSTMs in Pytorch

Before getting to the example, note a few things. Pytorch's LSTM expects
all of its inputs to be 3D tensors. The semantics of the axes of these
tensors is important. The first axis is the sequence itself, the second
indexes instances in the mini-batch, and the third indexes elements of
the input. We haven't discussed mini-batching, so let's just ignore that
and assume we will always have just 1 dimension on the second axis. If
we want to run the sequence model over the sentence "The cow jumped",
our input should look like

\begin{align}\begin{bmatrix}
\overbrace{q*\text{The}}^\text{row vector} \\
q*\text{cow} \\
q\_\text{jumped}
\end{bmatrix}\end{align}

Except remember there is an additional 2nd dimension with size 1.

In addition, you could go through the sequence one at a time, in which
case the 1st axis will have size 1 also.

Let's see a quick example.


In [2]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


<torch._C.Generator at 0x7ff59dc80990>

In [None]:
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5
# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
print(hidden)
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)


# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)


## Example: An LSTM for Part-of-Speech Tagging

In this section, we will use an LSTM to get part of speech tags, i.e., to predict whether a word is a noun or a verb etc. We will
not use Viterbi or Forward-Backward or anything like that, but as a
(challenging) exercise to the reader, think about how Viterbi could be
used after you have seen what is going on. In this example, we also refer
to embeddings. If you are unfamiliar with embeddings, you can read up
about them [here](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).

The model is as follows: let our input sentence be
$w_1, \dots, w_M$, where $w_i \in V$, our vocab. Also, let
$T$ be our tag set, and $y_i$ the tag of word $w_i$.
Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a
unique index (like how we had word_to_ix in the word embeddings
section). Then our prediction rule for $\hat{y}_i$ is

\begin{align}
\hat{y}\_i = \text{argmax}\_j \ (\log \text{Softmax}(Ah_i + b))\_j
\end{align}

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.

Prepare data:


In [None]:
training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("the dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# This is alternative training data which requires the model to encode the context to make
# correct predictions.
# training_data = [
#     ("fix the lie".split(), ["V", "DET", "NN"]),
#     ("everybody lie about apple".split(), ["NN", "V", "UNK", "NN"]),
#     ("everybody lie between book".split(), ["NN", "V", "UNK", "NN"]),
#     ("the fix went live".split(), ["DET", "NN", "V", "UNK"]),
#     ("the lie is bad".split(), ["DET", "NN", "V", "UNK"]),
# ]

word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            # Assign each word with a unique index
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
# Assign each tag with a unique index
tag_to_ix = {"DET": 0, "NN": 1, "V": 2, "UNK": 3}


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6


Create the model:


In [None]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores


Train the model:


In [None]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:

        # Step 0. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 1. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 2. Calculate loss
        loss = loss_function(tag_scores, targets)

        # Step 3. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 4. Compute gradients
        loss.backward()

        # Step 5. Update parameters
        optimizer.step()

# We add a function to convert the model output to easily readable tokens


def score_to_tags(scores):
    ix_to_tag = {v: k for k, v in tag_to_ix.items()}
    return [ix_to_tag[s.item()] for s in scores.argmax(dim=1)]


# See what the scores are after training
with torch.no_grad():
    for text, tags in training_data:
        inputs = prepare_sequence(text, word_to_ix)
        tag_scores = model(inputs)
        print(text)
        print("predicted:", score_to_tags(tag_scores), "ground truth:", tags)


The original data is trivial to predict for a suffienciently complex model: every word either occurs only with a single tag.
If the words are presented individually to a linear model, it will still be able to predict correctly.
Context is not necessary.
To see that the model is really learning to use context, train the model using the alternative training data (commented in the cell with the original training data), in which words like lie and fix occur both as verbs and nouns.
Correct prediction is only possible if previous words are taking into account.
Another tag is added for words that are neither of the original tags.
