When working with large-scale machine learning models, data and model parallelism are crucial for efficient training across multiple GPUs or distributed systems. Here’s a comprehensive overview of how to implement data parallelism and model parallelism using PyTorch.
Data Parallelism

Data parallelism involves splitting the data across multiple GPUs and running the same model on each GPU. PyTorch’s torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel modules facilitate this.

**1. Using torch.nn.DataParallel**

DataParallel is simpler to use but may have some performance limitations due to Python GIL (Global Interpreter Lock) and is suitable for single-node multi-GPU setups.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
from torch.utils.data import DataLoader, Dataset

# Define a simple dataset and model
class SimpleDataset(Dataset):
    def __init__(self, size):
        self.size = size

    def __len__(self):
        return self.size

    def __getitem__(self, index):
        return torch.randn(3, 224, 224), torch.tensor(0)

model = models.resnet18(pretrained=False)
model = nn.DataParallel(model)  # Wrap the model with DataParallel

# Define a simple DataLoader
dataset = SimpleDataset(size=100)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Define optimizer and loss function
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(5):
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

**2. Using torch.nn.parallel.DistributedDataParallel**

DistributedDataParallel (DDP) is more efficient and scalable for multi-node, multi-GPU setups. It is preferred over DataParallel for large-scale training.
Setting Up Distributed Training

To use DDP, you need to initialize a process group and spawn multiple processes. Here’s an example:

In [None]:
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
from torch.utils.data import DataLoader, Dataset, DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
import os

# Initialize the distributed process group
def init_process_group(backend='nccl'):
    dist.init_process_group(backend=backend, init_method='env://')

def main():
    # Initialize distributed process group
    init_process_group()

    # Define model, dataset, and dataloader
    model = models.resnet18(pretrained=False).cuda()
    model = DDP(model, device_ids=[torch.cuda.current_device()])

    dataset = SimpleDataset(size=100)
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)

    # Define optimizer and loss function
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    # Training loop
    for epoch in range(5):
        sampler.set_epoch(epoch)  # Ensure data is shuffled differently for each epoch
        for inputs, targets in dataloader:
            inputs, targets = inputs.cuda(), targets.cuda()
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        if dist.get_rank() == 0:
            print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

if __name__ == "__main__":
    main()


**Launching the Distributed Training**

Use torch.distributed.launch or torchrun to start the training script across multiple processes.

In [None]:
# Using torch.distributed.launch
! python -m torch.distributed.launch --nproc_per_node=4 script.py

# Using torchrun (recommended)
! torchrun --nproc_per_node=4 script.py


**Model Parallelism**

Model parallelism splits a model across multiple GPUs. This is useful for very large models that cannot fit into the memory of a single GPU.
Example: Simple Model Parallelism

Here’s a simple example where different layers of a model are placed on different GPUs:

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

class ModelParallel(nn.Module):
    def __init__(self):
        super(ModelParallel, self).__init__()
        self.part1 = nn.Linear(1024, 2048).cuda(0)
        self.part2 = nn.Linear(2048, 1024).cuda(1)

    def forward(self, x):
        x = x.cuda(0)
        x = self.part1(x)
        x = x.cuda(1)
        x = self.part2(x)
        return x

# Initialize model, optimizer, and loss function
model = ModelParallel()
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

# Dummy data
inputs = torch.randn(64, 1024)
targets = torch.randn(64, 1024)

# Training loop
for epoch in range(5):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")
