# Dataset and DataLoader

**Month 2, Week 1** — Sequence Models

Efficient data loading is crucial for training. PyTorch provides:

- `Dataset`: defines how to access data
- `DataLoader`: batches, shuffles, and parallelizes loading

## What You'll Learn

1. Built-in datasets (TensorDataset)
2. Custom Dataset class
3. DataLoader for batching
4. Loading IMDB for sentiment classification

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader, TensorDataset
print(f"PyTorch {torch.__version__}")

PyTorch 2.9.1


---

## 1. TensorDataset — Quick and Simple

Wrap tensors directly into a dataset.

In [2]:
# Create data
X = torch.randn(100, 10)  # 100 samples, 10 features
y = torch.randint(0, 2, (100, 1)).float()  # binary labels

# Wrap in TensorDataset
dataset = TensorDataset(X, y)

print(f"Dataset size: {len(dataset)}")
print(f"First sample: X shape={dataset[0][0].shape}, y={dataset[0][1].item()}")

Dataset size: 100
First sample: X shape=torch.Size([10]), y=0.0


---

## 2. DataLoader — Batching & Shuffling

In [3]:
# Create DataLoader
loader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True,       # Shuffle for training
    num_workers=0,      # Parallel loading (0 for simplicity)
)

print(f"Number of batches: {len(loader)}")

# Iterate through batches
for i, (X_batch, y_batch) in enumerate(loader):
    print(f"Batch {i}: X={X_batch.shape}, y={y_batch.shape}")
    if i >= 2:
        print("...")
        break

Number of batches: 7
Batch 0: X=torch.Size([16, 10]), y=torch.Size([16, 1])
Batch 1: X=torch.Size([16, 10]), y=torch.Size([16, 1])
Batch 2: X=torch.Size([16, 10]), y=torch.Size([16, 1])
...


---

## 3. Custom Dataset Class

For complex data, create your own Dataset class.

In [4]:
class TextDataset(Dataset):
    """Custom dataset for text classification."""
    
    def __init__(self, texts, labels, vocab, max_len=100):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        
        # Tokenize and convert to indices
        tokens = text.lower().split()
        indices = [self.vocab.get(t, 0) for t in tokens]  # 0 = unknown
        
        # Pad or truncate
        if len(indices) < self.max_len:
            indices += [1] * (self.max_len - len(indices))  # 1 = padding
        else:
            indices = indices[:self.max_len]
        
        return torch.tensor(indices), torch.tensor(label, dtype=torch.float32)

# Example usage
texts = ["this movie is great", "terrible film", "loved it"]
labels = [1, 0, 1]
vocab = {"<unk>": 0, "<pad>": 1, "this": 2, "movie": 3, "is": 4, "great": 5, 
         "terrible": 6, "film": 7, "loved": 8, "it": 9}

dataset = TextDataset(texts, labels, vocab, max_len=10)
print(f"Sample: {dataset[0]}")

Sample: (tensor([2, 3, 4, 5, 1, 1, 1, 1, 1, 1]), tensor(1.))


---

## 4. Loading IMDB Dataset

We use our custom data module (the `datasets` library has Python 3.14 compatibility issues).

In [5]:
# Use our custom data module
from sequence_models.data import load_imdb, build_vocab, IMDBDataset

# Load IMDB (use relative path from notebooks/ directory)
data = load_imdb("../data/aclImdb")
print(f"Train: {len(data['train']['texts']):,} samples")
print(f"Test: {len(data['test']['texts']):,} samples")

Train: 25,000 samples
Test: 25,000 samples


In [6]:
# Build vocabulary and create dataset
vocab = build_vocab(data["train"]["texts"], max_vocab_size=10000)
print(f"Vocab size: {len(vocab):,}")

# Create PyTorch dataset
train_dataset = IMDBDataset(
    data["train"]["texts"], 
    data["train"]["labels"], 
    vocab, 
    max_len=256
)

# Sample
x, y = train_dataset[0]
print(f"\nSample review:")
print(f"  Label: {int(y.item())} ({'positive' if y.item() == 1 else 'negative'})")
print(f"  Sequence length: {x.shape[0]}")
print(f"  First 10 token indices: {x[:10].tolist()}")

# Create DataLoader
loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
X_batch, y_batch = next(iter(loader))
print(f"\nBatch shapes: X={X_batch.shape}, y={y_batch.shape}")

Vocab size: 10,000

Sample review:
  Label: 1 (positive)
  Sequence length: 256
  First 10 token indices: [16, 3, 21, 12, 199, 61, 1379, 42, 276, 27]

Batch shapes: X=torch.Size([32, 256]), y=torch.Size([32])
