# Datasets and DataLoaders

In this notebook, you'll learn how PyTorch organizes data for training. Instead of manually slicing tensors into batches, you'll use `Dataset` and `DataLoader` â€” the standard abstraction that every PyTorch project uses.

**What you'll do:**
- Build a custom `Dataset` class from scratch
- Load a real dataset (MNIST) with torchvision
- Integrate `DataLoader` into a training loop
- Experiment with batch sizes and observe the effects
- Write a complete data pipeline for a CSV dataset

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones â€” they reveal gaps in your mental model.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import numpy as np
import io
import csv

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

# For nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

---

## Exercise 1: Custom Dataset (Guided)

A PyTorch `Dataset` is any class that implements three methods:
- `__init__`: Generate or load data
- `__len__`: Return the number of samples
- `__getitem__`: Return a single sample by index

You'll create a `LinearDataset` that generates data from `y = 2x + 1 + noise`.
Then wrap it in a `DataLoader` and iterate through one epoch.

**Before running, predict:** If you create a `DataLoader` with `batch_size=16` over a dataset of 200 samples, how many batches will you get per epoch? Will the last batch be the same size as the others?

In [None]:
class LinearDataset(Dataset):
    """Dataset for y = 2x + 1 + noise."""

    def __init__(self, n_samples=200, noise_std=0.5):
        torch.manual_seed(42)
        self.x = torch.randn(n_samples, 1)
        self.y = 2 * self.x + 1 + torch.randn(n_samples, 1) * noise_std

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]


# Create the dataset
dataset = LinearDataset(n_samples=200)
print(f'Dataset size: {len(dataset)}')
print(f'Single sample: x={dataset[0][0].item():.4f}, y={dataset[0][1].item():.4f}')

In [None]:
# Wrap in a DataLoader
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Iterate one epoch and print batch shapes
print(f'Number of batches per epoch: {len(loader)}')
print(f'Batch size: 16, Dataset size: {len(dataset)}\n')

for i, (x_batch, y_batch) in enumerate(loader):
    print(f'Batch {i+1:2d}: x shape = {x_batch.shape}, y shape = {y_batch.shape}')

print(f'\nTotal batches iterated: {i + 1}')
print(f'Last batch size: {x_batch.shape[0]} (may be smaller if dataset size is not divisible by batch_size)')

**What just happened:**
- `DataLoader` automatically split our 200 samples into batches of 16
- Each batch is a tuple of `(x_batch, y_batch)` tensors
- `shuffle=True` randomizes the order each epoch
- The last batch may be smaller if `len(dataset) % batch_size != 0`

---

## Exercise 2: Loading MNIST (Guided)

PyTorch's `torchvision.datasets` provides pre-built `Dataset` classes for common datasets. You don't need to write `__len__` or `__getitem__` â€” it's already done.

Load MNIST, apply standard transforms, and inspect what comes out.

**Before running, predict:** After applying `ToTensor()` (scales to [0, 1]) and then `Normalize((0.1307,), (0.3081,))`, will the pixel values still be in [0, 1]? What range do you expect?

In [None]:
# Load MNIST with standard transforms
transform = transforms.Compose([
    transforms.ToTensor(),                          # PIL Image -> Tensor, scales to [0, 1]
    transforms.Normalize((0.1307,), (0.3081,))      # Normalize with MNIST mean and std
])

mnist_train = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

print(f'Dataset size: {len(mnist_train)}')
print(f'Single sample type: image={type(mnist_train[0][0])}, label={type(mnist_train[0][1])}')

In [None]:
# Create a DataLoader and inspect one batch
mnist_loader = DataLoader(mnist_train, batch_size=64, shuffle=True)

# Grab one batch
images, labels = next(iter(mnist_loader))

print(f'Batch image shape: {images.shape}')     # [64, 1, 28, 28]
print(f'Batch label shape: {labels.shape}')      # [64]
print(f'Image dtype: {images.dtype}')
print(f'Label dtype: {labels.dtype}')
print(f'Value range: min={images.min():.4f}, max={images.max():.4f}')
print(f'\nAfter normalization, values are centered around 0 (not [0, 1]).')
print(f'Mean of batch: {images.mean():.4f}')

In [None]:
# Visualize a few samples
fig, axes = plt.subplots(1, 8, figsize=(14, 2))
for i in range(8):
    axes[i].imshow(images[i].squeeze(), cmap='gray')
    axes[i].set_title(f'Label: {labels[i].item()}')
    axes[i].axis('off')
plt.suptitle('MNIST Batch Samples', fontsize=14)
plt.tight_layout()
plt.show()

**What just happened:**
- `ToTensor()` converted PIL images to float tensors in `[0, 1]`
- `Normalize((0.1307,), (0.3081,))` shifted the distribution to roughly `[-0.4, 2.8]`
- The DataLoader collated 64 individual `(image, label)` pairs into batched tensors
- Image shape is `[batch, channels, height, width]` â€” PyTorch's standard format

---

## Exercise 3: DataLoader in a Training Loop (Supported)

Now integrate the `DataLoader` into a real training loop. Instead of passing the entire dataset as one tensor, you iterate over batches.

The template below trains `nn.Linear(1, 1)` on your `LinearDataset`. Fill in the parts marked `TODO`.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that the inner loop iterates over batches from the DataLoader, and each batch gets its own forward-backward-update cycle. This is different from Exercise 3 in the training loop notebook where the entire dataset was one tensor.

```python
for x_batch, y_batch in train_loader:
    predictions = model(x_batch)
    loss = criterion(predictions, y_batch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

The DataLoader handles all the batching and shuffling. Your training loop code is the same 5-step pattern regardless of how the data is organized.

</details>

In [None]:
# Create dataset and dataloader
train_dataset = LinearDataset(n_samples=200, noise_std=0.5)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Model, loss, optimizer
model = nn.Linear(1, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
n_epochs = 20
epoch_losses = []

for epoch in range(n_epochs):
    running_loss = 0.0
    n_batches = 0

    for x_batch, y_batch in train_loader:  # <-- DataLoader handles batching
        # TODO: Forward pass
        predictions = model(x_batch)

        # TODO: Compute loss
        loss = criterion(predictions, y_batch)

        # TODO: Backward pass and update
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        n_batches += 1

    avg_loss = running_loss / n_batches
    epoch_losses.append(avg_loss)

    if (epoch + 1) % 5 == 0:
        print(f'Epoch {epoch+1:2d}/{n_epochs} | Avg Loss: {avg_loss:.4f}')

# Check learned parameters
w = model.weight.item()
b = model.bias.item()
print(f'\nLearned: y = {w:.4f}x + {b:.4f}')
print(f'Target:  y = 2.0000x + 1.0000')

In [None]:
# Plot the loss curve
plt.plot(range(1, n_epochs + 1), epoch_losses, 'o-', linewidth=2, markersize=4)
plt.xlabel('Epoch')
plt.ylabel('Average Loss (MSE)')
plt.title('Training Loss with DataLoader')
plt.grid(alpha=0.3)
plt.show()

**Key difference from raw tensor training:**
- The inner `for x_batch, y_batch in train_loader` replaces manual slicing
- Each epoch sees the data in a different random order (because `shuffle=True`)
- The model updates multiple times per epoch (once per batch), not once

---

## Exercise 4: Batch Size Experiments (Supported)

How does batch size affect training? You'll train the same model 4 times with different batch sizes:
- **Batch size 1** (stochastic gradient descent)
- **Batch size 32** (mini-batch â€” the default for most projects)
- **Batch size 256**
- **Full batch** (the entire dataset as one batch)

For each, record:
1. How many iterations (gradient updates) per epoch
2. The loss curve over 20 epochs

<details>
<summary>ðŸ’¡ Solution</summary>

The insight is that batch size controls a fundamental trade-off: smaller batches mean noisier gradient estimates but more parameter updates per epoch, while larger batches mean cleaner gradients but fewer updates.

```python
for bs in batch_sizes:
    loader = DataLoader(dataset, batch_size=bs, shuffle=True)
    # Same training loop as before â€” only the DataLoader changes
```

With batch_size=1 you get 200 updates per epoch (one per sample), while full-batch gives you 1 update per epoch. Mini-batch (32) is the standard compromise: enough noise for exploration, enough signal for stable progress.

</details>

In [None]:
experiment_dataset = LinearDataset(n_samples=200, noise_std=0.5)

batch_sizes = [1, 32, 256, len(experiment_dataset)]  # SGD, mini-batch, large-batch, full-batch
results = {}

for bs in batch_sizes:
    # Fresh model for each experiment
    torch.manual_seed(0)
    model = nn.Linear(1, 1)
    criterion = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    loader = DataLoader(experiment_dataset, batch_size=bs, shuffle=True)

    epoch_losses = []
    iters_per_epoch = len(loader)

    for epoch in range(20):
        running_loss = 0.0
        n_batches = 0

        for x_batch, y_batch in loader:
            predictions = model(x_batch)
            loss = criterion(predictions, y_batch)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            n_batches += 1

        epoch_losses.append(running_loss / n_batches)

    label = f'bs={bs}' if bs < len(experiment_dataset) else 'full-batch'
    results[label] = {
        'losses': epoch_losses,
        'iters_per_epoch': iters_per_epoch,
        'final_loss': epoch_losses[-1],
    }

    print(f'Batch size {bs:>4d} | Iters/epoch: {iters_per_epoch:>4d} | Final loss: {epoch_losses[-1]:.4f}')

In [None]:
# Plot all loss curves on one plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
for label, data in results.items():
    axes[0].plot(range(1, 21), data['losses'], 'o-', label=label, linewidth=2, markersize=3)

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Average Loss (MSE)')
axes[0].set_title('Loss Curves by Batch Size')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Iterations per epoch comparison
labels = list(results.keys())
iters = [results[l]['iters_per_epoch'] for l in labels]
colors = plt.cm.Set2(np.linspace(0, 1, len(labels)))
bars = axes[1].bar(labels, iters, color=colors, edgecolor='white', linewidth=0.5)
axes[1].set_ylabel('Iterations per Epoch')
axes[1].set_title('Gradient Updates per Epoch')
axes[1].grid(alpha=0.3, axis='y')

for bar, val in zip(bars, iters):
    axes[1].text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
                 str(val), ha='center', va='bottom', fontsize=11)

plt.tight_layout()
plt.show()

**What you should observe:**

| Batch Size | Iters/Epoch | Noise in Gradient | Convergence |
|------------|-------------|-------------------|-------------|
| 1 (SGD) | 200 | Very high | Fast early, noisy |
| 32 | ~7 | Moderate | Good balance |
| 256 | 1 | Low | Smooth but slow |
| Full batch | 1 | None | Smoothest, can undershoot |

- **Small batches** = noisy gradients = irregular loss curves, but more updates per epoch
- **Large batches** = clean gradients = smooth loss curves, but fewer updates per epoch
- **Mini-batch (32)** is the standard trade-off: enough noise for exploration, enough signal for progress

---

## Exercise 5: Custom Dataset for CSV Data (Independent)

Real data often comes from CSV files. Here you'll:
1. Create a small CSV dataset (inline using `io.StringIO`)
2. Write a custom `Dataset` that reads it and converts to tensors
3. Train a model with `DataLoader`

The dataset has 3 feature columns and 1 target column.

<details>
<summary>ðŸ’¡ Solution</summary>

The key insight is that a custom Dataset just needs to parse the data in `__init__` and return tensors from `__getitem__`. The CSV parsing is standard Python â€” the PyTorch-specific part is converting to tensors and implementing the three required methods.

```python
class CSVDataset(Dataset):
    def __init__(self, csv_string):
        reader = csv.reader(io.StringIO(csv_string))
        header = next(reader)  # skip header
        rows = [[float(val) for val in row] for row in reader]
        data = torch.tensor(rows, dtype=torch.float32)
        self.features = data[:, :-1]  # all columns except last
        self.targets = data[:, -1:]   # last column

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]
```

Once you have the Dataset, the DataLoader and training loop are identical to what you've already written. The Dataset abstraction separates "how to load one sample" from "how to batch and iterate."

</details>

In [None]:
# Create a small CSV dataset inline
# Target = 0.5 * feature_1 + 1.2 * feature_2 - 0.8 * feature_3 + 3.0 + noise

np.random.seed(42)
n_samples = 300
features = np.random.randn(n_samples, 3)
target = 0.5 * features[:, 0] + 1.2 * features[:, 1] - 0.8 * features[:, 2] + 3.0
target += np.random.randn(n_samples) * 0.3  # noise

# Build CSV string
csv_buffer = io.StringIO()
writer = csv.writer(csv_buffer)
writer.writerow(['feature_1', 'feature_2', 'feature_3', 'target'])
for i in range(n_samples):
    writer.writerow([f'{features[i, 0]:.4f}', f'{features[i, 1]:.4f}',
                     f'{features[i, 2]:.4f}', f'{target[i]:.4f}'])

csv_string = csv_buffer.getvalue()
print('First 5 lines of CSV:')
print('\n'.join(csv_string.strip().split('\n')[:6]))

In [None]:
class CSVDataset(Dataset):
    """Custom Dataset that reads from a CSV string."""

    def __init__(self, csv_string):
        reader = csv.reader(io.StringIO(csv_string))
        header = next(reader)  # skip header

        rows = []
        for row in reader:
            rows.append([float(val) for val in row])

        data = torch.tensor(rows, dtype=torch.float32)
        self.features = data[:, :-1]  # all columns except last
        self.targets = data[:, -1:]   # last column

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]


# Create dataset and verify
csv_dataset = CSVDataset(csv_string)
print(f'Dataset size: {len(csv_dataset)}')
print(f'Feature shape: {csv_dataset[0][0].shape}')
print(f'Target shape:  {csv_dataset[0][1].shape}')

# Create DataLoader
csv_loader = DataLoader(csv_dataset, batch_size=32, shuffle=True)
x_batch, y_batch = next(iter(csv_loader))
print(f'\nBatch feature shape: {x_batch.shape}')  # [32, 3]
print(f'Batch target shape:  {y_batch.shape}')     # [32, 1]

In [None]:
# Train a small model on the CSV data
torch.manual_seed(42)
model = nn.Linear(3, 1)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

n_epochs = 30
epoch_losses = []

for epoch in range(n_epochs):
    running_loss = 0.0
    n_batches = 0

    for x_batch, y_batch in csv_loader:
        predictions = model(x_batch)
        loss = criterion(predictions, y_batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        n_batches += 1

    avg_loss = running_loss / n_batches
    epoch_losses.append(avg_loss)

    if (epoch + 1) % 10 == 0:
        print(f'Epoch {epoch+1:2d}/{n_epochs} | Avg Loss: {avg_loss:.4f}')

# Check learned parameters vs true
w = model.weight.data.squeeze().tolist()
b = model.bias.item()
print(f'\nLearned weights: [{w[0]:.4f}, {w[1]:.4f}, {w[2]:.4f}], bias: {b:.4f}')
print(f'True weights:    [0.5000, 1.2000, -0.8000], bias: 3.0000')

In [None]:
# Plot the loss curve
plt.plot(range(1, n_epochs + 1), epoch_losses, 'o-', linewidth=2, markersize=4)
plt.xlabel('Epoch')
plt.ylabel('Average Loss (MSE)')
plt.title('Training on CSV Dataset')
plt.grid(alpha=0.3)
plt.show()

---

## Key Takeaways

1. **`Dataset`** defines *what* your data is: implement `__len__` and `__getitem__`
2. **`DataLoader`** defines *how* to serve it: batching, shuffling, parallelism
3. **The training loop iterates over `DataLoader`**, not raw tensors â€” this is the PyTorch standard
4. **Batch size** controls the noise/speed trade-off: small batches = noisy but frequent updates, large batches = smooth but few updates
5. **`torchvision.datasets`** provides ready-made `Dataset` classes for common benchmarks (MNIST, CIFAR, ImageNet, etc.)
6. **The same `Dataset`/`DataLoader` pattern works** whether your data is synthetic, from a CSV, from images on disk, or from a database