# Sentiment Analysis with LSTMs and Text Preprocessing (Chapter 15 Application) üé¨

---

This notebook implements a complete sequence processing solution for **Sentiment Analysis** (classifying movie reviews as positive or negative) on the **IMDB Large Movie Review Dataset**. It applies the advanced recurrent architectures introduced in **Chapter 15: Processing Sequential Data with Recurrent Neural Networks (RNNs)**.

### 1. Advanced Text Data Preprocessing and Pipelining üìù

Handling text data requires several specialized steps:

* **Dataset Loading:** Uses **`torchtext.datasets.IMDB`** for efficient loading of the large review corpus.
* **Tokenization and Vocabulary:**
    * **Tokenization:** Reviews are broken down into discrete units (words or sub-words).
    * **Vocabulary:** A mapping is created from every unique word to a numerical index. This is necessary because neural networks only process numbers.
* **Padding and Masking:** Since reviews have varying lengths, all input sequences must be standardized:
    * **Padding:** Shorter sequences are extended with a special padding token (index 0) to match the longest sequence in the batch.
    * **`pack_padded_sequence`:** A critical PyTorch utility is used to tell the RNN (specifically the LSTM) to **ignore the padding tokens** during computation, saving time and preventing the padding from corrupting the final hidden state.

### 2. Deep Recurrent Architecture: LSTM and Bidirectionality üß†

The model moves beyond the simple RNN of the first notebook to use a more powerful variant:

* **Embedding Layer (`nn.Embedding`):** The first layer converts the sparse integer index of each word into a dense, low-dimensional **word vector**. This vector captures semantic relationships between words (e.g., 'good' and 'great' should have similar vectors).
* **Long Short-Term Memory (LSTM):** The core recurrent layer. LSTMs are used instead of simple RNNs because they contain internal **gates (Input, Forget, Output)** and a **Cell State ($C_t$)** that effectively solve the **vanishing gradient problem**, allowing the network to capture dependencies that span hundreds of time steps (long-range dependencies in the review text).
* **Bidirectional RNN (`bidirectional=True`):** The LSTM is run in both the forward and backward directions over the sequence. This allows the model to capture context from both the past and the future of a word, significantly boosting performance in text tasks.

### 3. Training and Sentiment Classification

* **Final Feature Combination:** The final hidden states from the forward and backward LSTMs are **concatenated** (`torch.cat`), providing a summary of the entire review's context.
* **Classification Head:** This combined feature vector is passed through one or more fully connected layers (`nn.Linear`) to output a single prediction.
* **Loss Function:** **`nn.BCEWithLogitsLoss`** is used, as this is a **Binary Classification** task (positive vs. negative sentiment).

This notebook provides a complete and powerful blueprint for sequence modeling, demonstrating the necessary data preparation and the use of the LSTM/Bidirectional architecture for state-of-the-art text classification.

In [3]:
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
from torchtext.datasets import IMDB
from torch.utils.data.dataset import random_split

In [111]:
def seed_everthing(SEED= 1):
    torch.manual_seed(SEED)

In [112]:
seed_everthing()

In [113]:
train_ds = IMDB(split= 'train')
test_ds = IMDB(split= 'test')

In [114]:
# step 1 create train and valid dataset
train_dataset, valid_dataset = random_split(
    list(train_ds), [20000, 5000]
)

In [115]:
# find unique tokens(words)
from collections import Counter
import re

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall(
    '(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower()
    )
    text = re.sub('[\W]+', ' ', text.lower()) +\
    ' '.join(emoticons).replace('-', '')
    tokenized = text.split()
    return tokenized

In [116]:
tokens_count = Counter()
for label, line in train_dataset:
    tokens = tokenizer(line)
    tokens_count.update(tokens)
print(f'Vocab size(Tokens Count): {len(tokens_count)}')

Vocab size(Tokens Count): 69023


In [117]:
from collections import OrderedDict

sorted_by_freq_tuples = sorted(
    tokens_count.items(), key= lambda x: x[1], reverse= True
)

ordered_tokens = OrderedDict(sorted_by_freq_tuples)

In [118]:
from torchtext.vocab import vocab

vocab = vocab(ordered_dict= ordered_tokens)
vocab.insert_token('<pad>', 0)
vocab.insert_token('<unk>', 1)
vocab.set_default_index(1)

In [119]:
print([vocab[token] for token in ['this', 'is', 'an', 'example']])

[11, 7, 35, 457]


In [120]:
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: 1. if x == 'pos' else 0.

In [121]:
def collate_batch(batch):
    text_list, label_list, lengths = [], [], []
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype= torch.int64)
        text_list.append(processed_text)
        lengths.append(processed_text.size(0))
    label_list = torch.tensor(label_list)
    lengths = torch.tensor(lengths)
    padded_text_list = nn.utils.rnn.pad_sequence(
            text_list, batch_first= True
    )
        
    return padded_text_list, label_list, lengths

In [122]:
dataloader = DataLoader(train_dataset, batch_size= 4, shuffle= False, collate_fn= collate_batch)

In [123]:
text_batch, label_batch, lengths_batch = next(iter(dataloader))
print(text_batch)
print(label_batch)
print(lengths_batch)
print(text_batch.shape)

tensor([[   35,  1739,     7,   449,   721,     6,   301,     4,   787,     9,
             4,    18,    44,     2,  1705,  2460,   186,    25,     7,    24,
           100,  1874,  1739,    25,     7, 34415,  3568,  1103,  7517,   787,
             5,     2,  4991, 12401,    36,     7,   148,   111,   939,     6,
         11598,     2,   172,   135,    62,    25,  3199,  1602,     3,   928,
          1500,     9,     6,  4601,     2,   155,    36,    14,   274,     4,
         42945,     9,  4991,     3,    14, 10296,    34,  3568,     8,    51,
           148,    30,     2,    58,    16,    11,  1893,   125,     6,   420,
          1214,    27, 14542,   940,    11,     7,    29,   951,    18,    17,
         15994,   459,    34,  2480, 15211,  3713,     2,   840,  3200,     9,
          3568,    13,   107,     9,   175,    94,    25,    51, 10297,  1796,
            27,   712,    16,     2,   220,    17,     4,    54,   722,   238,
           395,     2,   787,    32,    27,  5236,  

In [124]:
batch_size = 32
train_dl = DataLoader(train_dataset, batch_size, shuffle= True, collate_fn= collate_batch)
val_dl = DataLoader(valid_dataset, batch_size, shuffle= False, collate_fn= collate_batch)
test_dl = DataLoader(test_ds, batch_size, shuffle= False, collate_fn= collate_batch)

In [125]:
embedding = nn.Embedding(
    num_embeddings= 10,
    embedding_dim= 3,
    padding_idx= 0
)
print(embedding(torch.LongTensor([[1, 2, 5, 2], [2, 3, 4, 0]])))

tensor([[[ 0.7039, -0.8321, -0.4651],
         [-0.3203,  2.2408,  0.5566],
         [-0.7106, -0.2959,  0.8356],
         [-0.3203,  2.2408,  0.5566]],

        [[-0.3203,  2.2408,  0.5566],
         [ 0.0946, -0.3531,  0.9124],
         [-0.4643,  0.3046,  0.7046],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)


In [126]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, hidden_size, num_layers= 2, batch_first= True)
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        _, hidden = self.rnn(x)
        out = hidden[-1, :, :]
        out = self.fc(out)
        
        return out

In [127]:
model = RNN(64, 32)
print(model)

RNN(
  (rnn): RNN(64, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)


In [128]:
print(model(torch.randn(5, 3, 64)))

tensor([[ 0.3183],
        [ 0.1230],
        [ 0.1772],
        [-0.1052],
        [-0.1259]], grad_fn=<AddmmBackward0>)


In [137]:
class RNNModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings= vocab_size, embedding_dim= emb_dim, padding_idx= 0)
        self.rnn = nn.LSTM(emb_dim, rnn_hidden_size, num_layers= 2, batch_first= True)
        self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        
    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(out, lengths.cpu(), enforce_sorted= False, batch_first= True)
        out , (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        
        return out

In [138]:
vocab_size = len(vocab)
emb_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64

model = RNNModel(vocab_size, emb_dim, rnn_hidden_size, fc_hidden_size)
print(model)

RNNModel(
  (embedding): Embedding(69025, 20, padding_idx=0)
  (rnn): LSTM(20, 64, num_layers=2, batch_first=True)
  (fc1): Linear(in_features=64, out_features=64, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=64, out_features=1, bias=True)
)


In [139]:
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr= 0.001)

In [140]:
def train(dataloader):
    model.train()
    train_acc, train_loss = 0., 0.
    for text_batch, label_batch, lengths in dataloader:
        optimizer.zero_grad()
        pred = model(text_batch, lengths)[:, 0]
        loss = loss_fn(pred, label_batch)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * label_batch.size(0)
        is_correct = ((pred >= 0.5).float() == label_batch).float()
        train_acc += is_correct.sum().item()
    train_loss /= len(dataloader.dataset)
    train_acc /= len(dataloader.dataset)
    
    return train_loss, train_acc
        

In [141]:
def evaluate(dataloader):
    model.eval()
    val_loss, val_acc = 0., 0.
    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
            pred = model(text_batch, lengths)[:, 0]
            loss = loss_fn(pred, label_batch)
            val_loss += loss.item() * label_batch.size(0)
            is_correct = ((pred >= 0.5).float() == label_batch).float()
            val_acc += is_correct.sum().item()
        val_loss /= len(dataloader.dataset)
        val_acc /= len(dataloader.dataset)
        
        return val_loss, val_acc

In [142]:
num_epochs = 10
for epoch in range(num_epochs):
    train_loss, train_acc = train(train_dl)
    val_loss, val_acc = evaluate(val_dl)
    
    print(f'Epochs {epoch + 1}\tTrain:\tAccuracy: {train_acc} Loss: {train_loss}')
    print(f'                  \tValid:\tAccuracy: {val_acc}   Loss: {val_loss}')

Epochs 1	Train:	Accuracy: 1.0 Loss: 0.018987190446181922
                  	Valid:	Accuracy: 1.0   Loss: 2.0846029452513904e-05
Epochs 2	Train:	Accuracy: 1.0 Loss: 1.0463686278671957e-05
                  	Valid:	Accuracy: 1.0   Loss: 5.122671836579684e-06
Epochs 3	Train:	Accuracy: 1.0 Loss: 3.283095588631113e-06
                  	Valid:	Accuracy: 1.0   Loss: 2.0333269636466865e-06
Epochs 4	Train:	Accuracy: 1.0 Loss: 1.4813292138569522e-06
                  	Valid:	Accuracy: 1.0   Loss: 1.0728830375228426e-06


KeyboardInterrupt: 

In [None]:
class RNNModelBidirectional(nn.Module):
    def __init__(self, vocab_size, emb_size, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size, padding_index= 0)
        self.rnn = nn.LSTM(emb_size, rnn_hidden_size, batch_first= True, bidirectional= True)
        self.fc1 = nn.Linear(2* rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        
    def forward(self, text, lengths):
        out = nn.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(
            out, lengths.cpu(), enforce_sorted= False, batch_first= True
        )
        out = self.rnn(out)
        out = torch.cat((hidden[-2, :, :],
                         hidden[-1, :, :]), dim= 1)
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        return out