# Sequence Models and Long-Short Term Memory Networks

A recurrent network is one that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state *ht*, which in principle can contain information from arbitrary points earlier in the sequence.

We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.

## LSTMs in Pytorch

Pytorch's LSTMs expect all of its inputs to be 3D tensors. The semanstics of its axes are **very important**. 
- The first axis is the **sequence itself**
- The second **indexes instances in the mini-batch**
- The third **indexes elements of the input**

To avoid mini-batching for now, lets just assume we will always have 1 dimension on the 2nd axis.

If we want to run the sequence model over the sentence "The cow jumped", our input should look like:

```python
[
 'The',
 'cow',
 'jumped'
]

```

In addition, you could go through the sequence one at a time, in which case the 1st axis will have size 1.

Here's a quick example:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x113efa910>

In [2]:
lstm = nn.LSTM(3,3) # Input dim is 3, output dim is 3
inputs = [torch.randn(1,3) for i in range(5)] # Make a sequence of length 5

# Initialize the hidden state
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))

for i in inputs:
    # Step through the sequence one element at a time.
    # After each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

print(out)
print(hidden)

tensor([[[-0.3600,  0.0893,  0.0215]]])
(tensor([[[-0.3600,  0.0893,  0.0215]]]), tensor([[[-1.1298,  0.4467,  0.0254]]]))


Alternatively, we can do the entire sequence *all at once*. The first value returned by LSTM is all of the hidden states throughout the sequence. The second is just the most recent hidden state (compare the result of "out" with the first of "hidden", its the same)

The reason for this is that:

**"Out"** will give you access to all hidden states in the sequence. **"Hidden"** will allow you to contitnue the sequence and backpropogate, By passing it as an argument to the LSTM at a later time.

Lets add the second dimension

In [3]:
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3)) # Clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.0187,  0.1713, -0.2944]],

        [[-0.3521,  0.1026, -0.2971]],

        [[-0.3191,  0.0781, -0.1957]],

        [[-0.1634,  0.0941, -0.1637]],

        [[-0.3368,  0.0959, -0.0538]]])
(tensor([[[-0.3368,  0.0959, -0.0538]]]), tensor([[[-0.9825,  0.4715, -0.0633]]]))


## Example: An LSTM for Part-of-Speech Tagging

In this section, we will use LSTMs to get part of speech tags.

The model is as follows: let our input sentence be w1,…,wM, where wi∈V, our vocab. Also, let T be our tag set, and yi the tag of word wi. Denote our prediction of the tag of word wi by ŷ i.

This is a structure prediction, model, where our output is a sequence ŷ 1,…,ŷ M, where ŷ i∈T.

To do the prediction, pass an LSTM over the sentence. Denote the hidden state at timestep i as hi. Also, assign each tag a unique index (like how we had word_to_ix in the word embeddings section). Then our prediction rule for ŷ i is

̂ i=argmaxj (logSoftmax(Ahi+b))j

That is, take the log softmax of the affine map of the hidden state, and the predicted tag is the tag that has the maximum value in this vector. Note this implies immediately that the dimensionality of the target space of A is |T|

First, let's prepare the data:

In [5]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


Now, to create the model

In [13]:
class LSTMTagger(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # The LSTM takes word embeddings as inputs, and outputs hidden
        # states with dimensionality hidden_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        
    def init_hidden(self):
        # Before we've done anything, there is no hidden state. Refer to
        # the Pytorch documentation to see exactly why they have
        # this dimenstionality.
        # The aces semantics are (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))
    
    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out , self.hidden = self.lstm(
            embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

We will initialize the model, loss function, and optimizer. Also, let's see what the scores look like before training:

In [14]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
# We use Stochastic Grad Descent since we have a
# minibatch size of 1
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training:
# Note that element i,j of the output is the score for tag j and word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()

with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

tensor([[-1.1673, -1.2724, -0.8949],
        [-1.0838, -1.3116, -0.9357],
        [-0.9590, -1.3362, -1.0388],
        [-1.0451, -1.3285, -0.9585],
        [-1.0733, -1.3155, -0.9422]])


Finally lets train the model

In [19]:
for epoch in range(1000): # Small epoch num. for training only
    for sentence, tags in training_data:
        # 1) Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance.
        model.zero_grad()
        
        # We need to also clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()
        
        # 2) Get our inputs ready for the network. That is, turn
        # them into tensors of word indices
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)
        
        # 3) Forward Pass
        tag_scores = model(sentence_in)
        
        # 4) Compute Loss, Gradients, and update parameters by calling
        # optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

Lets see what the scores are now after training. Notice again how we are initializing the model again, but with torch.no_grad() since we are only evaluating, not training.

The sentence is "the dog ate the apple". i,j corresponds to score for tag j for word i. The predicted tag is the maximum scoring tag. Here, we can see the predicted sequence below is 0 1 2 0 1; since 0 is the index of the maximum value of row 1, 1 is the index of the maximum value of row 2, etc. Which is DET NOUN VERV DET NOUN, the correct sequence!

In [20]:
# See what the scores are after training (hence why we call no_grad())
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    
    
    print(tag_scores)

tensor([[-0.0019, -6.8743, -7.0384],
        [-7.8202, -0.0009, -7.5317],
        [-6.3918, -6.5512, -0.0031],
        [-0.0033, -6.1206, -6.7931],
        [-6.8818, -0.0015, -7.6234]])


## Exercise: Augmenting the LSTM part-of-speech tagger with character-level features