## Model 3: Long Short-Term Memory Network (LSTM)

This section provides a step-by-step tutorial on how to approach the text classification task using the LSTM model, which belongs to the family of Recurrent Neural Networks.

LSTM models handle long term dependencies in text through a RNN architecture that uses three different types of gates - input, output, and forget gates. The gates operate together to decide which information to retain in the LSTM cell. 

In [272]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import seaborn as sn

# data preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader

# models
import torch.nn as nn
from torch.nn import functional as F

# training
import torch.optim as optim

# evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from tqdm import tqdm
import gc

#### A. Data

In [2]:
# load dataset
df = pd.read_json(r"../data/df_final_document.json") 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Reference                     143 non-null    object
 1   Feedback date                 143 non-null    object
 2   User type                     112 non-null    object
 3   Scope                         4 non-null      object
 4   Organisation name             112 non-null    object
 5   Transparency register number  93 non-null     object
 6   Organisation size             112 non-null    object
 7   label_132                     143 non-null    object
 8   label_134                     143 non-null    object
 9   submit                        143 non-null    int64 
 10  file_name                     73 non-null     object
 11  language                      143 non-null    object
 12  text                          143 non-null    object
 13  text_clean          

In [3]:
# pdf submissions only
df = df[df['submit']==1]

In [4]:
le = LabelEncoder()
df['label.132'] = le.fit_transform(df['label_132'])
df['label.134'] = le.fit_transform(df['label_134'])

#### B. Prepare input tensor

In [293]:
# split data into train and test sets
train_text, test_text, train_labels, test_labels = train_test_split(df['text_clean'], df['label.132'], 
                                                                    random_state=2018, 
                                                                    test_size=0.3, 
                                                                    stratify=df['label.132'])

In [294]:
# create train and test dataset 
train_dataset = list(zip(train_labels, train_text))
test_dataset = list(zip(test_labels, test_text))

# convert pd series to list
train_text = train_text.tolist()
test_text = test_text.tolist()

After splitting the data into training and test set, build the corpus vocabulary by tokenizing all texts and assigning each word to a unique index.

In [296]:
tokenizer = get_tokenizer("basic_english")

def tokenize(datasets):
    for dataset in datasets:
        for text in dataset:
            yield tokenizer(text)

vocab = build_vocab_from_iterator(tokenize([train_text, test_text]), min_freq=1, specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])

In [297]:
len(vocab)

12540

In [298]:
# example
tokens = tokenizer("This is an example.")
index = vocab(tokens)
index

[0, 1273, 691, 5528, 1]

Once the vocabulary is built, we create several batches of text sequences and map the tokens to indices. We also pad the sequence of words so all are of the same length. This returns a tensor of the sequence length and batch size. 

In [299]:
target_classes = ["0", "1"]
max_words = 100

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] # map tokes to index using vocab
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] # pad sequences

    return torch.tensor(X, dtype=torch.int32), torch.tensor(Y)

In [300]:
train_loader = DataLoader(train_dataset, batch_size=100, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=100, collate_fn=vectorize_batch)

In [301]:
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([51, 100]) torch.Size([51])


#### C. LSTM Classifier

In [311]:
# define hyperparameters
embed_len = 50
hidden_dim = 20
n_layers=1

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.embed_len = embed_len
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim), torch.randn(n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

    def init_hidden(self):
      return (
               torch.zeros(n_layers, 1, self.hidden_dim, device=device),
               torch.zeros(n_layers, 1, self.hidden_dim, device=device)
            )

In [312]:
lstm_classifier = LSTMClassifier()
lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(12540, 50)
  (lstm): LSTM(50, 20, batch_first=True)
  (linear): Linear(in_features=20, out_features=2, bias=True)
)

In [313]:
for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(12540, 50)
Parameters : 
torch.Size([12540, 50])

Layer : LSTM(50, 20, batch_first=True)
Parameters : 
torch.Size([80, 50])
torch.Size([80, 20])
torch.Size([80])
torch.Size([80])

Layer : Linear(in_features=20, out_features=2, bias=True)
Parameters : 
torch.Size([2, 20])
torch.Size([2])



In [314]:
out = lstm_classifier(torch.randint(0, len(vocab), (1024, max_words)))
out.shape

torch.Size([1024, 2])

In [324]:
def evaluate(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for X, Y in val_loader:
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())

            Y_shuffled.append(Y)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.detach().numpy(), Y_preds.detach().numpy())))

def train(model, loss_fn, optimizer, train_loader, val_loader, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            Y_preds = model(X)

            loss = loss_fn(Y_preds, Y) 
            losses.append(loss.item())

            ## back propagation
            optimizer.zero_grad() 
            loss.backward() 
            optimizer.step() 

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        evaluate(model, loss_fn, val_loader)

In [321]:
from torch.optim import Adam

epochs = 20
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

train(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 1/1 [00:00<00:00,  6.34it/s]


Train Loss : 0.706
Valid Loss : 0.733
Valid Acc  : 0.318
Confusion matrix:
[[ 6  4]
 [11  1]]


100%|██████████| 1/1 [00:00<00:00,  9.10it/s]


Train Loss : 0.700
Valid Loss : 0.732
Valid Acc  : 0.318
Confusion matrix:
[[ 6  4]
 [11  1]]


100%|██████████| 1/1 [00:00<00:00,  8.79it/s]


Train Loss : 0.693
Valid Loss : 0.732
Valid Acc  : 0.409
Confusion matrix:
[[6 4]
 [9 3]]


100%|██████████| 1/1 [00:00<00:00, 10.06it/s]


Train Loss : 0.687
Valid Loss : 0.731
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00,  9.50it/s]


Train Loss : 0.681
Valid Loss : 0.730
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00, 10.01it/s]


Train Loss : 0.674
Valid Loss : 0.730
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00,  9.55it/s]


Train Loss : 0.668
Valid Loss : 0.730
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00,  9.41it/s]


Train Loss : 0.662
Valid Loss : 0.729
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00,  8.66it/s]


Train Loss : 0.656
Valid Loss : 0.729
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00,  9.88it/s]


Train Loss : 0.649
Valid Loss : 0.728
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00,  9.49it/s]


Train Loss : 0.643
Valid Loss : 0.728
Valid Acc  : 0.455
Confusion matrix:
[[6 4]
 [8 4]]


100%|██████████| 1/1 [00:00<00:00, 10.09it/s]


Train Loss : 0.637
Valid Loss : 0.727
Valid Acc  : 0.455
Confusion matrix:
[[5 5]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00,  9.54it/s]


Train Loss : 0.630
Valid Loss : 0.727
Valid Acc  : 0.455
Confusion matrix:
[[5 5]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00, 10.17it/s]


Train Loss : 0.624
Valid Loss : 0.727
Valid Acc  : 0.455
Confusion matrix:
[[5 5]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00, 10.14it/s]


Train Loss : 0.618
Valid Loss : 0.726
Valid Acc  : 0.455
Confusion matrix:
[[5 5]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00, 10.15it/s]


Train Loss : 0.611
Valid Loss : 0.726
Valid Acc  : 0.409
Confusion matrix:
[[4 6]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00, 10.09it/s]


Train Loss : 0.605
Valid Loss : 0.726
Valid Acc  : 0.409
Confusion matrix:
[[4 6]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00,  8.85it/s]


Train Loss : 0.598
Valid Loss : 0.725
Valid Acc  : 0.409
Confusion matrix:
[[4 6]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00,  9.85it/s]


Train Loss : 0.592
Valid Loss : 0.725
Valid Acc  : 0.409
Confusion matrix:
[[4 6]
 [7 5]]


100%|██████████| 1/1 [00:00<00:00, 10.27it/s]


Train Loss : 0.585
Valid Loss : 0.725
Valid Acc  : 0.409
Confusion matrix:
[[4 6]
 [7 5]]


### References

* https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
* https://www.kaggle.com/code/mehmetlaudatekman/lstm-text-classification-pytorch/notebook
* https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html