# NLP with PyTorch + torchtext — example

This notebook contains an independent implementation inspired by the DataCamp article "NLP with PyTorch: A comprehensive guide" (https://www.datacamp.com/tutorial/nlp-with-pytorch-a-comprehensive-guide). The code here is an original implementation using torch + torchtext to train a simple text classifier on the AG_NEWS dataset.  While some of the necessary pytorch tools are now deprecated, this notebook will give you a practical introduction into the basics of training a model using pytorch.

Attribution: concepts and high-level approach are inspired by the DataCamp article, but all code in this notebook is independently written and adapted for this lesson.

In [1]:
# Optional: try to import torch and torchtext; if missing, attempt to install minimal packages
# Note: installing torch via pip can be large and may require a specific wheel for CUDA/OS.
# If pip install fails, install PyTorch following official instructions for your environment.

import sys
try:
    import torch, torchtext
    print("torch version:", torch.__version__)
    print("torchtext version:", torchtext.__version__)
except Exception as e:
    print("torch or torchtext import failed:\n", e)
    print("Attempting to install torchtext (and torch if feasible). This may take a while.")
    import subprocess
    cmd = [sys.executable, "-m", "pip", "install", "--upgrade", "pip"]
    subprocess.run(cmd, check=False)
    # Try to install torchtext and a CPU-compatible torch if not present
    subprocess.run([sys.executable, "-m", "pip", "install", "torchtext==0.18.0", "torch==2.3.0", "torchdata==0.9.0", "portalocker>=2.0.0"], check=False)    
    print("Install attempts finished. Restart the kernel or re-run this cell and imports.")

torch version: 2.3.0+cu121
torchtext version: 0.18.0+cpu


This cell imports required Python libraries and sets the compute device (CPU or CUDA). It includes torch, torchtext, and utilities used later for building the dataset, model, and training loop.

In [None]:
# Imports
import os
import time
import torch
from torch import nn
from torch.utils.data import DataLoader
import torch.nn.functional as F
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)

Device: cpu




This cell prepares tokenization and builds the vocabulary from the training split of the AG_NEWS dataset. It also defines small helper pipelines to convert raw text and labels into numerical tensors used by the model.

In [3]:
# Prepare tokenizer, vocabulary and label mapping

tokenizer = get_tokenizer('basic_english')  # small, effective tokenizer

# Build vocabulary from the training splits

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

# Use streaming iterator to avoid loading everything in memory
train_iter = AG_NEWS(split='train')
# build vocab
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

num_tokens = len(vocab)
print('Vocab size:', num_tokens)

# Helper to map raw text -> tensor of token IDs
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1  # AG_NEWS labels are 1..4, convert to 0..3

################################################################################
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
future torchdata release! Please see https://github.com/pytorch/data/issues/1196
to learn more and leave feedback.
################################################################################



Vocab size: 95811


This cell defines the collate function used by the PyTorch DataLoader. The collate function converts a batch of raw examples into a packed format (flattened token tensor + offsets) suitable for an EmbeddingBag-based model, and converts labels to tensors.

In [4]:
# Collate function for DataLoader using EmbeddingBag-style batching

def collate_batch(batch):
    label_list = []
    text_list = []
    offsets = [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed = torch.tensor(text_pipeline(_text), dtype=torch.long)
        text_list.append(processed)
        offsets.append(processed.size(0))
    labels = torch.tensor(label_list, dtype=torch.long)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text_list)
    return text.to(device), offsets.to(device), labels.to(device)

# Quick test of the collate with a tiny subset
small_train = list(AG_NEWS(split='train'))[:8]
text, offsets, labels = collate_batch(small_train)
print('text size', text.size())
print('offsets', offsets)
print('labels', labels)

text size torch.Size([338])
offsets tensor([  0,  29,  71, 111, 151, 194, 242, 289])
labels tensor([2, 2, 2, 2, 2, 2, 2, 2])


This cell shows how the dataset is materialized and wrapped into DataLoader objects for training and testing. It converts streaming iterators to lists and configures batching and shuffling.

In [5]:
# Create DataLoaders for train and test
batch_size = 64
train_dataset = AG_NEWS(split='train')
test_dataset = AG_NEWS(split='test')

# DataLoaders: wrap the iterators with list to avoid streaming behavior in DataLoader
train_list = list(train_dataset)
test_list = list(test_dataset)

train_dataloader = DataLoader(train_list, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_list, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)

print('Train batches:', len(train_dataloader), 'Test batches:', len(test_dataloader))

Train batches: 1875 Test batches: 119


This cell defines a compact text classification model using an EmbeddingBag layer followed by a linear classifier. EmbeddingBag performs a sum/mean aggregation of token embeddings per example, which works well for short text classification tasks.

In [6]:
# Define a simple text classifier using EmbeddingBag

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)


num_class = 4
embed_dim = 64
model = TextClassificationModel(num_tokens, embed_dim, num_class).to(device)
print(model)

TextClassificationModel(
  (embedding): EmbeddingBag(95811, 64, mode='mean')
  (fc): Linear(in_features=64, out_features=4, bias=True)
)


This cell contains the loss function, optimizer, and the training/validation loops. The training loop iterates over batches, computes loss and backpropagates, while the evaluation loop measures accuracy on a held-out set.

In [7]:
# Training and evaluation utilities

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=5.0)

from tqdm import tqdm

def train_epoch(model, dataloader, optimizer, criterion):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (text, offsets, labels) in enumerate(dataloader):
        optimizer.zero_grad()
        output = model(text, offsets)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

        pred = output.argmax(1)
        total_acc += (pred == labels).sum().item()
        total_count += labels.size(0)

    return total_acc / total_count


def evaluate(model, dataloader, criterion):
    model.eval()
    total_acc, total_count = 0, 0
    with torch.no_grad():
        for idx, (text, offsets, labels) in enumerate(dataloader):
            output = model(text, offsets)
            loss = criterion(output, labels)
            pred = output.argmax(1)
            total_acc += (pred == labels).sum().item()
            total_count += labels.size(0)
    return total_acc / total_count

Run a short training cycle using the training and evaluation utilities defined above. This cell performs a quick sanity run for a small number of epochs to verify the pipeline is working end-to-end.

In [9]:
# Run a short training loop (1-3 epochs) to verify everything works

n_epochs = 2
for epoch in range(n_epochs):
    train_acc = train_epoch(model, train_dataloader, optimizer, criterion)
    test_acc = evaluate(model, test_dataloader, criterion)
    print(f'Epoch: {epoch+1}, Train acc: {train_acc:.4f}, Test acc: {test_acc:.4f}')

print('Done training (short run).')

Epoch: 1, Train acc: 0.9056, Test acc: 0.8964
Epoch: 2, Train acc: 0.9173, Test acc: 0.9071
Done training (short run).


Notes / next steps

- This example builds a small vocabulary and trains a simple EmbeddingBag model for AG_NEWS.
- Suggested exercise: modify the training loop to run for 10 epochs and plot the training and test accuracy versus epoch.
- For better performance:
  - Use pretrained embeddings or larger embed_dim.
  - Use regularization, LR scheduling, or a different optimizer (Adam).
  - Use text preprocessing: lowercasing, stopword filtering, subword tokenization.
  - Explore transformer-based models with Hugging Face for state-of-the-art performance.

