# Sequence Models and Long-Short Term Memory Networks

A recurrent network is one that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state *ht*, which in principle can contain information from arbitrary points earlier in the sequence.

We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.

## LSTMs in Pytorch

Pytorch's LSTMs expect all of its inputs to be 3D tensors. The semanstics of its axes are **very important**. 
- The first axis is the **sequence itself**
- The second **indexes instances in the mini-batch**
- The third **indexes elements of the input**

To avoid mini-batching for now, lets just assume we will always have 1 dimension on the 2nd axis.

If we want to run the sequence model over the sentence "The cow jumped", our input should look like:

```python
[
 'The',
 'cow',
 'jumped'
]

```

In addition, you could go through the sequence one at a time, in which case the 1st axis will have size 1.

Here's a quick example:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x113efa910>

In [2]:
lstm = nn.LSTM(3,3) # Input dim is 3, output dim is 3
inputs = [torch.randn(1,3) for i in range(5)] # Make a sequence of length 5

# Initialize the hidden state
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))

for i in inputs:
    # Step through the sequence one element at a time.
    # After each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

print(out)
print(hidden)

tensor([[[-0.3600,  0.0893,  0.0215]]])
(tensor([[[-0.3600,  0.0893,  0.0215]]]), tensor([[[-1.1298,  0.4467,  0.0254]]]))


Alternatively, we can do the entire sequence *all at once*. The first value returned by LSTM is all of the hidden states throughout the sequence. The second is just the most recent hidden state (compare the result of "out" with the first of "hidden", its the same)

The reason for this is that:

**"Out"** will give you access to all hidden states in the sequence. **"Hidden"** will allow you to contitnue the sequence and backpropogate, By passing it as an argument to the LSTM at a later time.

Lets add the second dimension

In [3]:
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3)) # Clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.0187,  0.1713, -0.2944]],

        [[-0.3521,  0.1026, -0.2971]],

        [[-0.3191,  0.0781, -0.1957]],

        [[-0.1634,  0.0941, -0.1637]],

        [[-0.3368,  0.0959, -0.0538]]])
(tensor([[[-0.3368,  0.0959, -0.0538]]]), tensor([[[-0.9825,  0.4715, -0.0633]]]))


## Example: An LSTM for Part-of-Speech Tagging

In this section, we will use LSTMs to get part of speech tags.

The model is as follows: let our input sentence be w1,…,wM, where wi∈V, our vocab. Also, let T be our tag set, and yi the tag of word wi. Denote our prediction of the tag of word wi by ŷ i.

This is a structure prediction, model, where our output is a sequence ŷ 1,…,ŷ M, where ŷ i∈T.

To do the prediction, pass an LSTM over the sentence. Denote the hidden state at timestep i as hi. Also, assign each tag a unique index (like how we had word_to_ix in the word embeddings section). Then our prediction rule for ŷ i is

̂ i=argmaxj (logSoftmax(Ahi+b))j

That is, take the log softmax of the affine map of the hidden state, and the predicted tag is the tag that has the maximum value in this vector. Note this implies immediately that the dimensionality of the target space of A is |T|

First, let's prepare the data:

In [5]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}
