### Text classification using LSTM

In this coding exercise, you will create a simple LSTM model using PyTorch to perform text classification on a dataset of short phrases. Your task is to fill in the missing parts of the code marked with `# TODO`.

You need to:

- Create a vocabulary to represent words as indices.
- Tokenize, encode, and pad the phrases.
- Convert the phrases and categories to PyTorch tensors.
- Instantiate the LSTM model with the vocabulary size, embedding dimensions, hidden dimensions, and output dimensions.
- Define the loss function and optimizer.
- Train the model for a number of epochs.
- Test the model on new phrases and print the category predictions.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# Phrases (textual data) and their category labels (0 for sports, 1 for technology, 2 for food)
# Note: this data is extremely less for realistically training an LSTM model. Feel free to use
# a relevant data source or create your own dummy data for this exercise.
phrases = ["great goal scored", "amazing touchdown", "new phone release", "latest laptop model", "tasty pizza", "delicious burger"]
categories = [0, 0, 1, 1, 2, 2]

In [12]:
vectorizer = CountVectorizer(max_features=max([len(sentence) for sentence in phrases]))

# TODO: Create a vocabulary to represent words as indices
# TODO: Tokenize, encode, and pad phrases
phrases_encod = vectorizer.fit_transform(phrases).toarray()
phrases_encod

array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [18]:
# Convert phrases and categories to PyTorch tensors
inputs = torch.LongTensor(phrases_encod)
labels = torch.LongTensor(categories)
inputs

tensor([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
        [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [14]:
# Define LSTM model
class PhraseClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(PhraseClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, _) = self.lstm(embedded)
        logits = self.fc(hidden.squeeze(0))
        return logits

In [17]:
# TODO: Instantiate model and define loss and optimizer
model = PhraseClassifier(phrases_encod.shape[0], 20, 50, len(labels))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [24]:
# TODO: Train the model for a number of epochs
epochs = 100

for epoch in range(epochs):
    optimizer.zero_grad()
    output = model(inputs.t())
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch: {epoch + 1}, Loss: {loss.item()}")

Epoch: 10, Loss: 0.3281322717666626
Epoch: 20, Loss: 0.24036552011966705
Epoch: 30, Loss: 0.17729516327381134
Epoch: 40, Loss: 0.12993194162845612
Epoch: 50, Loss: 0.0953078642487526
Epoch: 60, Loss: 0.06965262442827225
Epoch: 70, Loss: 0.0513526014983654
Epoch: 80, Loss: 0.039165105670690536
Epoch: 90, Loss: 0.031118815764784813
Epoch: 100, Loss: 0.025434084236621857


In [26]:
# Test the model on new phrases
test_phrases = ["incredible match", "newest gadget", "yummy cake"]
encoded_test_phrases = vectorizer.fit_transform(test_phrases).toarray()
    
test_inputs = torch.LongTensor(encoded_test_phrases)

In [28]:
with torch.no_grad():
    test_predictions = torch.argmax(model(test_inputs.t()), dim=1)
    print("Test predictions:", test_predictions)

Test predictions: tensor([1, 2, 0])
