# Predicting parts of speech with an LSTM

Let's preview the end result. We want to take a sentence and output the part of speech for each word in that sentence. Something like this:

**Code**

```python
new_sentence = "I is a teeth"

...

predictions = model(new_sentence)

...
```

**Output**

```text
I     => Noun
is    => Verv
a     => Determiner
teeth => Noun
```

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from random import shuffle
from typing import Dict, List

## The dataset

Our dataset includes a number of labeled sentences.

In [2]:
# We could use a larger dataset...
# Please add sentences here: https://docs.google.com/spreadsheets/d/1HJmlehaYhGWclDo1t0k6i1VHxN15zr8ZmJj7Rf_VEaI/edit#gid=865244837
# You can use this to double check yourself: https://parts-of-speech.info/

# Tags:
#  D - determiner
#  N - noun
#  V - verb
dataset = [
    ("The dog ate the apple".lower().split(), ["D", "N", "V", "D", "N"]),
    ("Everybody read that book".lower().split(), ["N", "V", "D", "N"]),
    ("Trapp is sleeping".lower().split(), ["N", "V", "V"]),
    ("Everybody ate the apple".lower().split(), ["N", "V", "D", "N"]),
    ("Cats are good".lower().split(), ["N", "V", "D"]),
    (
        "Dogs are not as good as cats".lower().split(),
        ["N", "V", "D", "D", "D", "D", "N"],
    ),
    ("Dogs eat dog food".lower().split(), ["N", "V", "N", "N"]),
    ("Watermelon is the best food".lower().split(), ["N", "V", "D", "D", "N"]),
    ("I want a milkshake right now".lower().split(), ["N", "V", "D", "N", "D", "D"]),
    ("I have too much homework".lower().split(), ["N", "V", "D", "D", "N"]),
    ("Zoom won't work".lower().split(), ["N", "D", "V"]),
    ("Pie also sounds good".lower().split(), ["N", "D", "V", "D"]),
    (
        "The college is having the department fair this Friday".lower().split(),
        ["D", "N", "V", "V", "D", "N", "N", "D", "N"],
    ),
    ("Research interests span many areas".lower().split(), ["N", "N", "V", "D", "N"]),
    ("Alex is finishing his Ph.D".lower().split(), ["N", "V", "V", "D", "N"]),
    ("She is the author".lower().split(), ["N", "V", "D", "N"]),
    (
        "It is almost the end of the semester".lower().split(),
        ["N", "V", "D", "D", "N", "D", "D", "N"],
    ),
    ("Blue is a color".lower().split(), ["N", "V", "D", "N"]),
    ("They wrote a book".lower().split(), ["N", "V", "D", "N"]),
    ("The syrup covers the pancake".lower().split(), ["D", "N", "V", "D", "N"]),
    ("Harrison has these teeth".lower().split(), ["N", "V", "D", "N"]),
    ("The numbers are fractions".lower().split(), ["D", "N", "V", "N"]),
    ("Yesterday happened".lower().split(), ["N", "V"]),
    ("Caramel is sweet".lower().split(), ["N", "V", "D"]),
    ("Computers use electricity".lower().split(), ["N", "V", "N"]),
    ("Gold is a valuable thing".lower().split(), ["N", "V", "D", "D", "N"]),
    ("This extension cord helps".lower().split(), ["D", "D", "N", "V"]),
    ("It works on my machine".lower().split(), ["N", "V", "D", "D", "N"]),
    ("We have the words".lower().split(), ["N", "V", "D", "N"]),
    ("Trapp is a dog".lower().split(), ["N", "V", "D", "N"]),
    ("This is a computer".lower().split(), ["N", "V", "D", "N"]),
]

## Preparing data for use as NN input

We can't pass a list of plain text words and tags to a NN. We need to convert them to a more appropriate format.

We'll start by creating a unique index for each word and tag.

In [3]:
word_indices = {}
total_words = 0

tag_indices = {}
tag_list = []
total_tags = 0

for sentence, tags in dataset:
    assert len(sentence) == len(tags)
    total_words += len(sentence)
    for word in sentence:
        if word not in word_indices:
            word_indices[word] = len(word_indices)

    total_tags += len(tags)
    for tag in tags:
        if tag not in tag_indices:
            tag_indices[tag] = len(tag_indices)
            tag_list.append(tag)

In [4]:
print("       Vocabulary Indices")
print("-------------------------------")

for word, index in word_indices.items():
    print(f"{word:>14} => {index:>2}")

print("\nTotal number of words:", total_words)
print("Number of unique words:", len(word_indices))

       Vocabulary Indices
-------------------------------
           the =>  0
           dog =>  1
           ate =>  2
         apple =>  3
     everybody =>  4
          read =>  5
          that =>  6
          book =>  7
         trapp =>  8
            is =>  9
      sleeping => 10
          cats => 11
           are => 12
          good => 13
          dogs => 14
           not => 15
            as => 16
           eat => 17
          food => 18
    watermelon => 19
          best => 20
             i => 21
          want => 22
             a => 23
     milkshake => 24
         right => 25
           now => 26
          have => 27
           too => 28
          much => 29
      homework => 30
          zoom => 31
         won't => 32
          work => 33
           pie => 34
          also => 35
        sounds => 36
       college => 37
        having => 38
    department => 39
          fair => 40
          this => 41
        friday => 42
      research => 43
     interests => 

In [5]:
print("Tag Indices")
print("-----------")

for tag, index in tag_indices.items():
    print(f"  {tag} => {index}")

print("\nTotal number of tags:", total_tags)
print("Number of unique tags:", len(tag_indices))

Tag Indices
-----------
  D => 0
  N => 1
  V => 2

Total number of tags: 139
Number of unique tags: 3


## Letting the NN parameterize words

Once we have a unique identifier for each word, it is useful to start our NN with an [embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding) layer. This layer converts an index into a vector of values.

You can think of each value as indicating something about the word. For example, maybe the first value indicates how much a word conveys happiness vs sadness. Of course, the NN can learn any attributes and it is not limited to thinks like happy/sad, masculine/feminine, etc.

**Creating an embedding layer**. An embedding layer is created by telling it the size of the vocabulary (the number of words) and an embedding dimension (how many values to use to represent a word).

**Embedding layer input and output**. An embedding layer takes an index and return a matrix.

In [6]:
def convert_to_indices_tensor(
    input_sequence: List[str], indices_dict: Dict[str, int]
) -> torch.tensor:
    """Convert a list of indices into a torch.tensor"""
    indices = [indices_dict[w] for w in input_sequence]
    return torch.tensor(indices, dtype=torch.long)

In [7]:
vocabulary_count = len(word_indices)  # Depends on the dataset
embedding_size = 6  # Hyperparameter

example_sentence = ["dog", "ate"]
example_sentence_indices = convert_to_indices_tensor(example_sentence, word_indices)

embedding_layer = nn.Embedding(vocabulary_count, embedding_size)
embeddings = embedding_layer(example_sentence_indices)

print("The embedding layer output contains 'embedding_size' values for each word")
print(embeddings.shape)
print(embeddings.data)

The embedding layer output contains 'embedding_size' values for each word
torch.Size([2, 6])
tensor([[-0.3101, -2.0614,  0.6138,  0.6309, -2.1374, -0.3048],
        [-1.9204,  1.1053, -0.4639, -0.1313,  0.1203, -0.5256]])


## Adding an LSTM layer

The [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) layer is in charge of processing embeddings such that the network can output the correct classification. Since this is a recurrent layer, it will take into account past words when it creates an output for the current word.

**Creating an LSTM layer**. To create an LSTM you need to tell it the size of its input (the size of an embedding) and the size of its internal cell state.

**LSTM layer input and output**. An LSTM takes an embedding (and optionally an initial hidden and cell state) and outputs a value for each word as well as the current hidden and cell state).

If you read the linked LSTM documentation you will see that it requires input in this format: (seq_len, batch, input_size)

As you can see above, our embedding layer outputs something that is (seq_len, input_size). So, we need to add a dimension in the middle.

In [8]:
unflatten_layer = nn.Unflatten(1, (1, embedding_size))
reshaped_embeddings = unflatten_layer(embeddings)
print(reshaped_embeddings.shape)
print(reshaped_embeddings.data)

torch.Size([2, 1, 6])
tensor([[[-0.3101, -2.0614,  0.6138,  0.6309, -2.1374, -0.3048]],

        [[-1.9204,  1.1053, -0.4639, -0.1313,  0.1203, -0.5256]]])


In [9]:
lstm_state_size = 4  # Hyperparamter

lstm_layer = nn.LSTM(embedding_size, lstm_state_size)

# We can ignore the hidden and cell state outputs
lstm_output, (h_T, C_T) = lstm_layer(reshaped_embeddings)
print(lstm_output.shape)
print(lstm_output.data)

torch.Size([2, 1, 4])
tensor([[[ 0.2159,  0.0386, -0.2349,  0.0862]],

        [[ 0.2722, -0.0091, -0.4316,  0.0084]]])


## Classifiying the LSTM output

We can now add a fully connected, [linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) layer to our NN to learn the correct part of speech (classification).

**Creating a linear layer**. We create a linear layer by specifying the shape of the input into the layer and the number of neurons in the linear layer.

**Linear layer input and output**. The input is expected to be (input_size, output_size) and the output will be the output of each neuron.

In [10]:
flatten_layer = nn.Flatten()
reshaped_lstm_output = flatten_layer(lstm_output)
print(reshaped_lstm_output.shape)
print(reshaped_lstm_output.data)

torch.Size([2, 4])
tensor([[ 0.2159,  0.0386, -0.2349,  0.0862],
        [ 0.2722, -0.0091, -0.4316,  0.0084]])


In [11]:
tag_count = len(tag_list)

linear = nn.Linear(lstm_state_size, tag_count)
linear_out = linear(reshaped_lstm_output)
print(linear_out.shape)
print(linear_out.data)

torch.Size([2, 3])
tensor([[-0.2860,  0.6836,  0.0550],
        [-0.2402,  0.7779,  0.0714]])


# Training an LSTM model

In [12]:
# Hyperparameters
valid_percent = 0.2

embedding_size = 6
lstm_state_size = 6

learning_rate = 0.1
num_epochs = 300

## Creating training and validation datasets

In [13]:
# Dataset values
N = len(dataset)
vocab_count = len(word_indices)
tag_count = len(tag_indices)

# Shuffle the data so that we can split the dataset randomly
shuffle(dataset)

split_point = int(N * valid_percent)
valid_dataset = dataset[:split_point]
train_dataset = dataset[split_point:]

len(valid_dataset), len(train_dataset)

(6, 25)

## Creating the LSTM model

If you followed the steps above, you might notice that there are some LSTM outputs that we need to ignore. One way to do that is to create a new LSTM layer that simply ignores the unneeded output. That is what `FlatLSTM` does below.

In [14]:
class FlatLSTM(nn.Module):
    """An LSTM layer that ignores the current hidden and cell states."""
    def __init__(self, in_dim, state_dim):
        super().__init__()
        self.lstm = nn.LSTM(in_dim, state_dim)

    def forward(self, x):
        lstm_output, _ = self.lstm(x)
        return lstm_output


model = nn.Sequential(
    nn.Embedding(vocab_count, embedding_size),
    nn.Unflatten(1, (1, embedding_size)),
    FlatLSTM(embedding_size, lstm_state_size),
    nn.Flatten(),
    nn.Linear(lstm_state_size, tag_count),
)

## Training

In [15]:
def compute_accuracy(dataset):
    """A helper function for computing accuracy on the given dataset."""
    total_words = 0
    total_correct = 0

    with torch.no_grad():
        for sentence, tags in dataset:
            sentence_indices = convert_to_indices_tensor(sentence, word_indices)
            tag_scores = model(sentence_indices)
            predictions = tag_scores.argmax(dim=1)
            total_words += len(sentence)
            total_correct += sum(t == tag_list[p] for t, p in zip(tags, predictions))

    return total_correct / total_words

In [16]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

accuracy = compute_accuracy(valid_dataset)
print(f"Validation accuracy before training : {accuracy * 100:.2f}%")

for epoch in range(num_epochs):
    # Shuffle the data for each epoch (stochastic gradient descent)
    shuffle(train_dataset)
    
    for sentence, tags in train_dataset:

        model.zero_grad()

        sentence = convert_to_indices_tensor(sentence, word_indices)
        tags = convert_to_indices_tensor(tags, tag_indices)
        
        tag_scores = model(sentence)
        
        loss = criterion(tag_scores, tags)
        
        loss.backward()
        optimizer.step()

accuracy = compute_accuracy(valid_dataset)
print(f"Validation accuracy after training  : {accuracy * 100:.2f}%")

Validation accuracy before training : 25.00%
Validation accuracy after training  : 54.17%


## Examining results

Here we look at all words that are misclassified by the model

In [17]:
print("\nMis-predictions after training on entire dataset")
header = "Word".center(14) + " | True Tag | Prediction"
print(header)
print("-" * len(header))

with torch.no_grad():
    for sentence, tags in dataset:
        sentence_indices = convert_to_indices_tensor(sentence, word_indices)
        tag_scores = model(sentence_indices)
        predictions = tag_scores.argmax(dim=1)
        for word, tag, pred in zip(sentence, tags, predictions):
            if tag != tag_list[pred]:
                print(f"{word:>14} |     {tag}    |    {tag_list[pred]}")


Mis-predictions after training on entire dataset
     Word      | True Tag | Prediction
--------------------------------------
      research |     N    |    D
          span |     V    |    N
          many |     D    |    V
         areas |     N    |    V
           has |     V    |    N
         these |     D    |    N
         teeth |     N    |    D
         sweet |     D    |    V
     fractions |     N    |    V
            we |     N    |    V
         words |     N    |    D
          this |     D    |    N


## Using the model for inference

In [18]:
new_sentence = "I is a teeth"

# Convert sentence to lowercase words
sentence = new_sentence.lower().split()

# Check that each word is in our vocabulary
for word in sentence:
    assert word in word_indices

# Convert input to a tensor
sentence = convert_to_indices_tensor(sentence, word_indices)

# Compute prediction
predictions = model(sentence)
predictions = predictions.argmax(dim=1)

# Print results
for word, tag in zip(new_sentence.split(), predictions):
    print(word, "=>", tag_list[tag.item()])

I => N
is => V
a => D
teeth => N
