Petals communication protocol

Jump to bottom Edit New page

Max Ryabinin edited this page Nov 27, 2023 · 1 revision

Petals communication protocol

This page is meant as a brief summary of client-server interactions in Petals during inference.

From the model point of view (e.g., LLaMA in petals.models.llama), sending activations is done by means of the RemoteSequential class. This class sends inputs using RemoteSequentialAutogradFunction, which handles forward and backward passes through the servers.

In the context of the forward pass, the key method is sequential_forward (https://github.com/bigscience-workshop/petals/blob/main/src/petals/client/sequential_autograd.py#L26), which sends model inputs (a torch.Tensor) through a sequence of layers. These functions use remote forward/backward calls to the server with hivemind.compression.{serialize/deserialize}_torch_tensor (https://github.com/learning-at-home/hivemind/blob/master/hivemind/compression/serialization.py#L30-L47).

All intermediate messages between the server and the client use ExpertRequest and ExpertResponse Protobuf classes for communication: their schemas can be found in https://github.com/learning-at-home/hivemind/blob/master/hivemind/proto/runtime.proto#L12-L21

This project is a part of the BigScience research workshop.