# LSTM for Part-of-Speech Tagging

In this section, we will use an LSTM to predict part-of-speech tags for words. What exactly is part-of-speech tagging?

Part of speech tagging is the process of determining the *category* of a word from the words in its surrounding context. You can think of part of speech tagging as a way to go from words to their [Mad Libs](https://en.wikipedia.org/wiki/Mad_Libs#Format) categories. Mad Libs are incomplete short stories that have many words replaced by blanks. Each blank has a specified word-category, such as `"noun"`, `"verb"`, `"adjective"`, and so on. One player asks another to fill in these blanks (prompted only by the word-category) until they have created a complete, silly story of their own. Here is an example of such categories:

```text
Today, you'll be learning how to [verb]. It may be a [adjective] process, but I think it will be rewarding! 
If you want to take a break you should [verb] and treat yourself to some [plural noun].
```
... and a set of possible words that fall into those categories:
```text
Today, you'll be learning how to code. It may be a challenging process, but I think it will be rewarding! 
If you want to take a break you should stretch and treat yourself to some puppies.
```


### Why Tag Speech?

Tagging parts of speech is often used to help disambiguate natural language phrases because it can be done quickly and with high accuracy. It can help answer: what subject is someone talking about? Tagging can be used for many NLP tasks like creating new sentences using a sequence of tags that make sense together, filling in a Mad Libs style game, and determining correct pronunciation during speech synthesis. It is also used in information retrieval, and for word disambiguation (ex. determining when someone says *right* like the direction versus *right* like "that's right!").

---


### Preparing the Data

Now, we know that neural networks do not do well with words as input and so our first step will be to prepare our training data and map each word to a numerical value. 

We start by creating a small set of training data, you can see that this is a few simple sentences broken down into a list of words and their corresponding word-tags. Note that the sentences are turned into lowercase words using `lower()` and then split into separate words using `split()`, which splits the sentence by whitespace characters.

#### Words to indices

Then, from this training data, we create a dictionary that maps each unique word in our vocabulary to a numerical value; a unique index `idx`. We do the same for each word-tag, for example: a noun will be represented by the number `1`.

In [1]:
# import resources
import pandas as pd
import torch
import numpy as np

import torch.nn as nn
import re
import spacy
import torch.optim as optim
from collections import Counter
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import string
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from sklearn.metrics import mean_squared_error
import time

from google.colab import drive
drive.mount('/content/drive')

%matplotlib inline

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
path_to_test = "/content/drive/MyDrive/NLP/yelp_review_full_csv/test.csv"
path_to_train = "/content/drive/MyDrive/NLP/yelp_review_full_csv/train.csv"

In [3]:
reviews = pd.read_csv(path_to_train, header=None, names=["rating", "review"])
reviews_test = pd.read_csv(path_to_test, header=None, names=["rating", "review"])
tok = spacy.load('en')

In [4]:
import collections
import re
import string


def tokenize (text):
    text = re.sub(r"[^\x00-\x7F]+", " ", text)
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]') # remove punctuation and numbers
    nopunct = regex.sub(" ", text.lower())
    return [token.text for token in tok.tokenizer(nopunct)]


def encode_sentence(text, vocab2index, N=160):
    tokenized = tokenize(text)
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([vocab2index.get(word, vocab2index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]
    return encoded, length


def preprocessing_df(reviews, reviews_test):
    zero_numbering = {1:0, 2:1, 3:2, 4:3, 5:4}
    reviews['rating'] = reviews['rating'].apply(lambda x: zero_numbering[x])
    reviews_test["rating"] = reviews_test['rating'].apply(lambda x: zero_numbering[x])

    vocab2index, words = vocabulary(reviews)

    reviews['encoded'] = reviews['review'].apply(lambda x: np.array(encode_sentence(x,vocab2index )))
    reviews_test['encoded'] = reviews_test['review'].apply(lambda x: np.array(encode_sentence(x,vocab2index )))
    return reviews, reviews_test, vocab2index, words


def preprocessing_words(df):
    counts = collections.Counter()
    for index, row in df.iterrows():
        counts.update(tokenize(row['review']))

    for word in list(counts):
        if counts[word] <= 10 or len(word.strip()) <= 2:
            del counts[word]

    return counts


def vocabulary(df):
    vocab2index = {"":0, "UNK":1}
    words = ["", "UNK"]

    counts = preprocessing_words(df)
    for word in sorted(counts, key=lambda x: counts[x], reverse=True )[:6500]:
        vocab2index[word] = len(words)
        words.append(word)

    return vocab2index, words



In [5]:
reviews, reviews_test, vocab2index, words = preprocessing_df(reviews, reviews_test)

Next, print out the created dictionary to see the words and their numerical values! 

You should see every word in our training set and its index value. Note that the word "the" only appears once because our vocabulary only includes *unique* words.

In [6]:
# print out the created dictionary
print(vocab2index)



In [7]:
# check out what prepare_sequence does for one of our training sentences:
print(reviews["review"][0], reviews["encoded"][0][0])
print(vocab2index)

dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank. [   1    1    1 1351  165    1  227    5    1    1 1066    1    1    1
    1   60    3  563    1  700    1  296  141    1    1    1    1  102
    1   30    1  944  164 3480    1    1    1    1   11    1  239 1461
 2337    1    1    1   42    1 1610   13 1050    1    1    1   26 1433
    1  758  139 1684    3    8  190 3431    1    3    8   36   28    1
    1  118    1  296  303    1  118  217   79    1   40    1   41   48
    1    8  190    1    1    1  496   2

---
## Creating the Model

Our model will assume a few things:
1. Our input is broken down into a sequence of words, so a sentence will be [w1, w2, ...]
2. These words come from a larger list of words that we already know (a vocabulary)
3. We have a limited set of tags, `[NN, V, DET]`, which mean: a noun, a verb, and a determinant (words like "the" or "that"), respectively
4. We want to predict\* a tag for each input word

\* To do the prediction, we will pass an LSTM over a test sentence and apply a softmax function to the hidden state of the LSTM; the result is a vector of tag scores from which we can get the predicted tag for a word based on the *maximum* value in this distribution of tag scores. 

Mathematically, we can represent any tag prediction $\hat{y}_i$ as: 

\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

Where $A$ is a learned weight and $b$, a learned bias term, and the hidden state at timestep $i$ is $h_i$. 


### Word embeddings

We know that an LSTM takes in an expected input size and hidden_dim, but sentences are rarely of a consistent size, so how can we define the input of our LSTM?

Well, at the very start of this net, we'll create an `Embedding` layer that takes in the size of our vocabulary and returns a vector of a specified size, `embedding_dim`, for each word in an input sequence of words. It's important that this be the first layer in this net. You can read more about this embedding layer in [the PyTorch documentation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#word-embeddings-in-pytorch).

Pictured below is the expected architecture for this tagger model.

<img src='images/speech_tagger.png' height=60% width=60% >


In [24]:
class LSTM(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.hidden_dim = hidden_dim
        self.dropout = nn.Dropout(0.35)
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        
    def forward(self, x, s):
        x = self.embeddings(x)
        x = self.dropout(x)
        x_pack = pack_padded_sequence(x, s, batch_first=True, enforce_sorted=False)
        out_pack, (ht, ct) = self.lstm(x_pack)
        out = self.linear(ht[-1])
        return out

In [17]:
X_rev_train = list(reviews['encoded'])
y_rev_train = list(reviews['rating'])
X_rev_test = list(reviews_test['encoded'])
y_rev_test = list(reviews_test['rating'])


from sklearn.model_selection import train_test_split

X_train, X_o, y_train, y_o = train_test_split(X_rev_train,
                                              y_rev_train, 
                                              test_size=0.5,
                                              stratify=y_rev_train)

X_test, X_oo, y_test, y_oo = train_test_split(X_rev_test,
                                              y_rev_test,
                                              test_size=0.5,
                                              stratify=y_rev_test)




class ReviewsDataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return torch.from_numpy(self.X[idx][0].astype(np.int32)), self.y[idx], self.X[idx][1]


batch_size = 128
vocab_size = len(words)
train_dl = DataLoader(ReviewsDataset(X_train, y_train),
                      batch_size=batch_size, shuffle=True)
val_dl = DataLoader(ReviewsDataset(X_test, y_test),
                    batch_size=batch_size)


del X_o
del X_oo
del y_o
del y_oo

## Define how the model trains

To train the model, we have to instantiate it and define the loss and optimizers that we want to use.

First, we define the size of our word embeddings. The `EMBEDDING_DIM` defines the size of our word vectors for our simple vocabulary and training set; we will keep them small so we can see how the weights change as we train.

**Note: the embedding dimension for a complex dataset will usually be much larger, around 64, 128, or 256 dimensional.**


#### Loss and Optimization

Since our LSTM outputs a series of tag scores with a softmax layer, we will use `NLLLoss`. In tandem with a softmax layer, NLL Loss creates the kind of cross entropy loss that we typically use for analyzing a distribution of class scores. We'll use standard gradient descent optimization, but you are encouraged to play around with other optimizers!

In [29]:
# the embedding dimension defines the size of our word vectors
# for our simple vocabulary and training set, we will keep these small
EMBEDDING_DIM = 50
HIDDEN_DIM = 75

# instantiate our model
model = LSTM(len(vocab2index), EMBEDDING_DIM, HIDDEN_DIM)

Just to check that our model has learned something, let's first look at the scores for a sample test sentence *before* our model is trained. Note that the test sentence *must* be made of words from our vocabulary otherwise its words cannot be turned into indices.

The scores should be Tensors of length 3 (for each of our tags) and there should be scores for each word in the input sentence.

For the test sentence, "The cheese loves the elephant", we know that this has the tags (DET, NN, V, DET, NN) or `[0, 1, 2, 0, 1]`, but our network does not yet know this. In fact, in this case, our model starts out with a hidden state of all zeroes and so all the scores and the predicted tags should be low, random, and about what you'd expect for a network that is not yet trained!

In [31]:
# see what the scores are before training
# element [i,j] of the output is the *score* for tag j for word i.
# to check the initial accuracy of our model, we don't need to train, so we use model.eval()

inputs = next(iter(train_dl))

tag_scores = model(inputs[0].long(), inputs[2])
print(tag_scores.shape, "Scores")
print(inputs[2].shape,inputs[1].shape)

# tag_scores outputs a vector of tag scores for each word in an inpit sentence
# to get the most likely tag index, we grab the index with the maximum score!
# recall that these numbers correspond to tag2idx = {"DET": 0, "NN": 1, "V": 2}
_, predicted_tags = torch.max(tag_scores, 1)
print('\n')
print('Predicted tags: \n',predicted_tags)


torch.Size([128, 5]) Scores
torch.Size([128]) torch.Size([128])


Predicted tags: 
 tensor([4, 4, 0, 4, 4, 4, 4, 4, 0, 4, 4, 0, 4, 0, 4, 4, 4, 2, 1, 4, 0, 2, 2, 0,
        0, 4, 0, 0, 4, 0, 0, 4, 2, 4, 0, 2, 4, 4, 0, 1, 2, 4, 0, 0, 0, 0, 2, 4,
        1, 4, 0, 2, 4, 0, 0, 0, 0, 2, 4, 0, 0, 4, 0, 1, 4, 4, 4, 2, 2, 4, 0, 0,
        4, 2, 0, 2, 4, 4, 0, 1, 4, 4, 0, 4, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 2, 1,
        4, 4, 4, 4, 0, 4, 4, 0, 2, 0, 0, 4, 4, 4, 4, 2, 4, 4, 1, 4, 2, 0, 4, 4,
        4, 2, 1, 2, 4, 4, 4, 4])


---
## Train the Model

Loop through all our training data for multiple epochs (again we are using a small epoch value for this simple training data). This loop:

1. Prepares our model for training by zero-ing the gradients
2. Initializes the hidden state of our LSTM
3. Prepares our data for training
4. Runs a forward pass on our inputs to get tag_scores
5. Calculates the loss between tag_scores and the true tag
6. Updates the weights of our model using backpropagation

In this example, we are printing out the average epoch loss, every 20 epochs; you should see it decrease over time.

In [9]:
def train_model(model, epochs=10, lr=0.001):
    since = time.time()
    ep_time = time.time()
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y, l in train_dl:
            x = x.long().to(device)
            y = y.long().to(device)
            y_pred = model(x, l)
            optimizer.zero_grad()
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()
            time_elapsed = time.time() - ep_time
            ep_time = time.time()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        val_loss, val_acc, val_rmse = validation_metrics(model, val_dl)
        print("train loss %.3f, val loss %.3f, val accuracy %.3f, and val rmse %.3f" % (sum_loss/total, val_loss, val_acc, val_rmse))

def validation_metrics (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    sum_rmse = 0.0
    for x, y, l in valid_dl:
        x = x.long().to(device)
        y = y.long().to(device)
        y_hat = model(x, l)
        loss = F.cross_entropy(y_hat, y)
        pred = torch.max(y_hat, 1)[1].cpu()
        correct += (pred == y.cpu()).float().sum().cpu()
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
        sum_rmse += np.sqrt(mean_squared_error(pred, y.cpu().unsqueeze(-1)))*y.shape[0]
    return sum_loss/total, correct/total, sum_rmse/total

In [30]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
model.to(device)

train_model(model, epochs=8, lr=0.005)

cuda
train loss 1.132, val loss 0.976, val accuracy 0.572, and val rmse 0.913
train loss 0.984, val loss 0.956, val accuracy 0.583, and val rmse 0.860
train loss 0.954, val loss 0.938, val accuracy 0.590, and val rmse 0.856
train loss 0.937, val loss 0.936, val accuracy 0.591, and val rmse 0.858
train loss 0.925, val loss 0.920, val accuracy 0.597, and val rmse 0.850
train loss 0.915, val loss 0.922, val accuracy 0.597, and val rmse 0.844
train loss 0.908, val loss 0.913, val accuracy 0.602, and val rmse 0.841
train loss 0.901, val loss 0.911, val accuracy 0.602, and val rmse 0.829
train loss 0.897, val loss 0.910, val accuracy 0.600, and val rmse 0.823
train loss 0.892, val loss 0.929, val accuracy 0.596, and val rmse 0.841


In [31]:
train_model(model, epochs=2, lr=0.002)

train loss 0.874, val loss 0.908, val accuracy 0.604, and val rmse 0.826
train loss 0.867, val loss 0.904, val accuracy 0.606, and val rmse 0.825


# By increasing the size of the dictionary, we will get better scoring. But the learning rate will be much longer. Also, if we used the entire dataset, the result is likely to be better.

## Testing

See how your model performs *after* training. Compare this output with the scores from before training, above.

Again, for the test sentence, "The cheese loves the elephant", we know that this has the tags (DET, NN, V, DET, NN) or `[0, 1, 2, 0, 1]`. Let's see if our model has learned to find these tags!

In [34]:
inputs = next(iter(val_dl))

scores = model(inputs[0].long().to(device), inputs[2])


# tag_scores outputs a vector of tag scores for each word in an inpit sentence
# to get the most likely tag index, we grab the index with the maximum score!
# recall that these numbers correspond to tag2idx = {"DET": 0, "NN": 1, "V": 2}
_, predicted_tags = torch.max(scores, 1)
print('\n')
print('Predicted rating: \n',predicted_tags)
print('Goal rating: \n', inputs[1])



Predicted rating: 
 tensor([4, 1, 1, 3, 0, 2, 4, 1, 1, 2, 0, 0, 2, 2, 1, 3, 0, 0, 0, 4, 3, 3, 0, 0,
        0, 2, 2, 0, 4, 0, 2, 3, 4, 0, 1, 4, 3, 0, 0, 1, 1, 3, 1, 3, 3, 1, 1, 2,
        2, 3, 1, 3, 3, 0, 2, 4, 3, 3, 4, 1, 0, 4, 4, 4, 2, 0, 2, 0, 0, 1, 3, 4,
        3, 3, 0, 0, 3, 4, 4, 3, 2, 4, 4, 0, 3, 0, 0, 2, 0, 0, 1, 3, 2, 1, 4, 3,
        0, 1, 1, 4, 4, 0, 1, 3, 2, 1, 1, 3, 0, 2, 0, 2, 3, 3, 4, 0, 1, 3, 0, 0,
        1, 2, 1, 1, 1, 4, 2, 2], device='cuda:0')
Goal rating: 
 tensor([4, 1, 1, 3, 0, 2, 4, 1, 1, 3, 1, 0, 4, 2, 1, 2, 1, 0, 0, 3, 4, 1, 0, 0,
        0, 2, 3, 0, 4, 0, 3, 3, 4, 0, 1, 4, 3, 0, 0, 1, 1, 4, 1, 3, 3, 2, 1, 1,
        1, 3, 1, 3, 4, 0, 2, 4, 3, 4, 0, 2, 0, 4, 4, 4, 2, 0, 3, 0, 1, 1, 3, 4,
        4, 2, 0, 0, 4, 3, 4, 2, 3, 4, 4, 1, 3, 0, 0, 3, 0, 0, 1, 0, 2, 2, 4, 4,
        0, 0, 0, 3, 4, 0, 1, 2, 3, 2, 1, 4, 1, 3, 0, 0, 3, 3, 4, 0, 2, 2, 0, 0,
        1, 2, 1, 1, 1, 4, 0, 3])
