# Day 34 – LSTM Text Classification
### Build Deep Learning Sentiment Classifier using LSTM

Today we will:
- Tokenize text and build vocabulary
- Convert text to padded sequences
- Train Embedding + LSTM model
- Perform binary sentiment classification

This is the deep learning continuation of Day 32.

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using device:', device)

Using device: cuda


In [37]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?dataset_version_number=1...


100%|██████████| 25.7M/25.7M [01:18<00:00, 344kB/s]

Extracting files...





Path to dataset files: C:\Users\vedav\.cache\kagglehub\datasets\lakshmi25npathi\imdb-dataset-of-50k-movie-reviews\versions\1


## 1. Review Dataset

In [3]:
# Use the actual path returned by KaggleHub
path = r"C:\Users\vedav\.cache\kagglehub\datasets\lakshmi25npathi\imdb-dataset-of-50k-movie-reviews\versions\1"

# Load the dataset
df = pd.read_csv(f"{path}/IMDB Dataset.csv")

# Map string labels to integers
df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Drop the original sentiment column
df = df.drop('sentiment', axis=1)

# Balanced subset of 10k items
df = pd.concat([
    df[df['label'] == 1].head(5000),
    df[df['label'] == 0].head(5000)
]).sample(frac=1, random_state=42).reset_index(drop=True)

df


Unnamed: 0,review,label
0,"If another Hitler ever arises, it will be than...",0
1,This is the movie I've seen more times than an...,1
2,Debbie Vickers (Nell Schofield) and Sue Knight...,1
3,For pure gothic vampire cheese nothing can com...,1
4,The many comments made by others have been ver...,1
...,...,...
9995,"Yes, bad acting isn't only one thing to mentio...",0
9996,This movie features Charlie Spradling dancing ...,0
9997,I think my summary says it all. This MTV-ish a...,0
9998,It's unbelievable but the fourth is better tha...,1


## 2. Cleaning Function

In [4]:
def clean_text(text):
    text = text.lower()
    text = re.sub('[^a-zA-Z ]', '', text)
    return text.strip()

df['clean'] = df['review'].apply(clean_text)
df

Unnamed: 0,review,label,clean
0,"If another Hitler ever arises, it will be than...",0,if another hitler ever arises it will be thank...
1,This is the movie I've seen more times than an...,1,this is the movie ive seen more times than any...
2,Debbie Vickers (Nell Schofield) and Sue Knight...,1,debbie vickers nell schofield and sue knight j...
3,For pure gothic vampire cheese nothing can com...,1,for pure gothic vampire cheese nothing can com...
4,The many comments made by others have been ver...,1,the many comments made by others have been ver...
...,...,...,...
9995,"Yes, bad acting isn't only one thing to mentio...",0,yes bad acting isnt only one thing to mention ...
9996,This movie features Charlie Spradling dancing ...,0,this movie features charlie spradling dancing ...
9997,I think my summary says it all. This MTV-ish a...,0,i think my summary says it all this mtvish ans...
9998,It's unbelievable but the fourth is better tha...,1,its unbelievable but the fourth is better than...


## 3. Tokenization + Vocabulary Building

In [5]:
tokenized = df['clean'].apply(lambda x: x.split())

word_to_idx = {'<PAD>':0, '<UNK>':1}
idx = 2

for tokens in tokenized:
    for word in tokens:
        if word not in word_to_idx:
            word_to_idx[word] = idx
            idx += 1

vocab_size = len(word_to_idx)
vocab_size

73245

## 4. Convert Tokens → Indexed Sequences

In [6]:
max_len = 12
max_len = 128
def encode(tokens):
    seq = [word_to_idx.get(word, 1) for word in tokens]
    if len(seq) < max_len:
        seq += [0] * (max_len - len(seq))
    else:
        seq = seq[:max_len]
    return seq

df['seq'] = tokenized.apply(encode)
df[['clean','seq']]

Unnamed: 0,clean,seq
0,if another hitler ever arises it will be thank...,"[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1..."
1,this is the movie ive seen more times than any...,"[16, 167, 20, 172, 173, 174, 149, 87, 152, 175..."
2,debbie vickers nell schofield and sue knight j...,"[230, 231, 232, 233, 36, 234, 235, 236, 237, 1..."
3,for pure gothic vampire cheese nothing can com...,"[67, 330, 331, 332, 333, 158, 125, 334, 13, 20..."
4,the many comments made by others have been ver...,"[20, 340, 341, 90, 70, 342, 48, 343, 199, 344,..."
...,...,...
9995,yes bad acting isnt only one thing to mention ...,"[844, 628, 424, 785, 365, 130, 162, 13, 2220, ..."
9996,this movie features charlie spradling dancing ...,"[16, 172, 1884, 4786, 69013, 3747, 11, 26, 281..."
9997,i think my summary says it all this mtvish ans...,"[155, 374, 351, 10383, 2033, 7, 266, 16, 73235..."
9998,its unbelievable but the fourth is better than...,"[453, 430, 159, 20, 6864, 167, 817, 152, 20, 7..."


## 5. Prepare Dataset + DataLoader

In [7]:
class ReviewDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.float32)

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

X_train, X_test, y_train, y_test = train_test_split(df['seq'].tolist(), df['label'].tolist(), test_size=0.2, random_state=42, stratify=df['label'])

train_ds = ReviewDataset(X_train, y_train)
test_ds = ReviewDataset(X_test, y_test)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=len(test_ds), shuffle=False)

len(train_loader), len(test_loader)

(125, 1)

## 6. Build LSTM Model for Classification

In [8]:
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, hidden_dim=128):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(hidden_dim, 1)
    
    def forward(self, x):
        x = self.embedding(x)
        out, (h, c) = self.lstm(x)
        h = self.dropout(h[-1])
        return self.fc(h).squeeze(1)

model = LSTMClassifier(vocab_size).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
model

LSTMClassifier(
  (embedding): Embedding(73245, 64)
  (lstm): LSTM(64, 128, batch_first=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

## 7. Train the Model

In [9]:
epochs = 25
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        preds = model(X)
        loss = criterion(preds, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f}")

Epoch 1/25 - Loss: 0.6939
Epoch 2/25 - Loss: 0.6883
Epoch 3/25 - Loss: 0.6820
Epoch 4/25 - Loss: 0.6736
Epoch 5/25 - Loss: 0.6793
Epoch 6/25 - Loss: 0.6789
Epoch 7/25 - Loss: 0.6708
Epoch 8/25 - Loss: 0.6442
Epoch 9/25 - Loss: 0.6260
Epoch 10/25 - Loss: 0.6624
Epoch 11/25 - Loss: 0.6692
Epoch 12/25 - Loss: 0.6683
Epoch 13/25 - Loss: 0.6228
Epoch 14/25 - Loss: 0.6500
Epoch 15/25 - Loss: 0.6188
Epoch 16/25 - Loss: 0.5194
Epoch 17/25 - Loss: 0.4218
Epoch 18/25 - Loss: 0.3227
Epoch 19/25 - Loss: 0.2392
Epoch 20/25 - Loss: 0.1625
Epoch 21/25 - Loss: 0.1103
Epoch 22/25 - Loss: 0.0757
Epoch 23/25 - Loss: 0.0541
Epoch 24/25 - Loss: 0.0537
Epoch 25/25 - Loss: 0.0408


## 8. Evaluate Model Accuracy

In [10]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for X, y in test_loader:
        X, y = X.to(device), y.to(device)
        preds = torch.sigmoid(model(X))
        predicted = (preds > 0.5).float()
        correct += (predicted == y).sum().item()
        total += y.size(0)

print("Accuracy:", correct/total)

Accuracy: 0.785


## 9. Predict on Custom Inputs

In [13]:
def predict_sentiment(text):
    tokens = clean_text(text).split()
    seq = encode(tokens)
    seq = torch.tensor([seq], dtype=torch.long).to(device)
    with torch.no_grad():
        pred = torch.sigmoid(model(seq)).item()
    return "Positive" if pred > 0.5 else "Negative"

samples = [
    "Fantastic update, great experience!",
    "This app is full of bugs and crashes",
    "Works smoothly now, thank you team",
]

for s in samples:
    print(s, "-->", predict_sentiment(s))

Fantastic update, great experience! --> Positive
This app is full of bugs and crashes --> Positive
Works smoothly now, thank you team --> Positive


## Summary
- Tokenized and padded text data
- Built vocabulary and sequences
- Trained LSTM classifier
- Achieved good accuracy on sentiment classification
- Tested custom predictions

**Deliverable:** `day34_lstm_text_classifier.ipynb`