Model sharding using Pipeline Parallel

Let us start with a toy model that contains two linear layers.

import torch
import torch.nn as nn

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5)

    def forward(self, x):
        x = self.relu(self.net1(x))
        return self.net2(x)

model = ToyModel()

To run this model on 2 GPUs we need to convert the model to torch.nn.Sequential and then wrap it with fairscale.nn.Pipe.

import fairscale
import torch
import torch.nn as nn

model = nn.Sequential(
            torch.nn.Linear(10, 10),
            torch.nn.ReLU(),
            torch.nn.Linear(10, 5)
        )

model = fairscale.nn.Pipe(model, balance=[2, 1])

This will run the first two layers on cuda:0 and the last layer on cuda:1. To learn more, visit the Pipe documentation.

You can then define any optimizer and loss function

import torch.optim as optim
import torch.nn.functional as F

optimizer = optim.SGD(model.parameters(), lr=0.001)
loss_fn = F.nll_loss

optimizer.zero_grad()
target = torch.randint(0,2,size=(20,1)).squeeze()
data = torch.randn(20, 10)

Finally, to run the model and compute the loss function, make sure that outputs and target are on the same device.

device = model.devices[0]
## outputs and target need to be on the same device
# forward step
outputs = model(data.to(device))
# compute loss
loss = loss_fn(outputs.to(device), target.to(device))

# backward + optimize
loss.backward()
optimizer.step()

You can find a complete example under the examples folder in the fairscale repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipe.rst

pipe.rst

Model sharding using Pipeline Parallel

Files

pipe.rst

Latest commit

History

pipe.rst

File metadata and controls

Model sharding using Pipeline Parallel