Make inference, forward, and backward fully fault-tolerant #91

borzunov · 2022-11-26T00:34:29Z

In this PR:

The inference session is made fully fault-tolerant. If a server fails (or doesn't respond within timeout), it will be replaced by one (or more) servers, and the lost attention caches will be regenerated. See the screenshot of how it works below.

Here, the 24-layer model is spread across 3 servers, then the intermediate one (holding blocks 8-14) leaves, then a new large server joins and decides to host layers 6-24 (thus closing the gap). The inference session is able to recover the span 8-14 through the new server and successfully continues inference. The results are identical to the ones without failures:
Forward and backward are made "gap-tolerant". If some servers leave and a gap arises in the swarm (i.e., some blocks are not handled by anyone), the forward and backward pass will not fail anymore. Instead, they will keep retrying until a new path through the model is available. I have checked that the activations/gradients calculated after failure recovery are equal to the ones calculated without failures.
Fixed an important bug in forward/backward. Sometimes, make_sequence() returns a sequence that is longer than requested (since the last server hosts blocks further than end_index). Before this PR, the code actually run the inputs through these extra blocks leading to incorrect results. This affected partial forward passes and partial/full backward passes (since they re-run partial forward passes in case of failures).
minor: Renamed classes:
- RemoteTransformerBlockInferenceSession -> _ServerInferenceSession (since (a) it actually handles the whole span instead of a single block and (b) it is not designed to be used outside Petals internals)
- RemoteSequentialInferenceSession -> InferenceSession (since the previous name is too long and hardly readable)
While working on this PR, I have also discovered a bug in hivemind that leads to an error in Petals. As soon as it is fixed, we should update requirements.txt to point to a new hivemind version. See Fix MPFuture failing outside inference mode learning-at-home/hivemind#521

until the end

borzunov · 2022-11-26T05:25:51Z

src/client/inference_session.py

        max_length: int,
        points: int = 0,
    ):
        self.uid, self.rpc_info = uid, rpc_info
        self.num_blocks = uid.count(CHAIN_DELIMITER) + 1
-        # warning: this code manages async objects that are only usable inside RemoteExpertWorker's background thread;
-        # using them in any other EventLoop may cause side-effects including, headaches, diarrhea, and loss of sleep


This important message is now sent by starting the class name from _.

src/client/inference_session.py

justheuristic

Great work!

from another failure

borzunov added 2 commits November 25, 2022 21:44

Rename Remote{TransformerBlock => Server}InferenceSession

bd10d15

Implement fault-tolerant inference

f6622bc

borzunov requested a review from justheuristic November 26, 2022 00:34

borzunov added 8 commits November 26, 2022 01:06

Regenerate attn caches when necessary

55bea82

Make inference session fields private

b6316a5

Rename RemoteTransformerBlockInferenceSession => _ServerInferenceSession

a232f13

Rename RemoteSequentialInferenceSession => InferenceSession

8d47e38

black

3a7b8a4

Log disconnect errors with DEBUG level

b1b1947

Make forward more fault-tolerant

87fd00e

Make backward more fault-tolerant

a58a8b9

borzunov force-pushed the fault-tolerant-inference branch from ea91607 to a58a8b9 Compare November 26, 2022 02:18

borzunov added 4 commits November 26, 2022 02:21

Fix sequential_forward()

756e277

Fix sequential_backward()

a59facc

Fix bug with make_sequence() returning longer sequences

fb47655

InferenceSession: Replace only a segment of spans instead of everything

3bc06f0

until the end

borzunov mentioned this pull request Nov 26, 2022

Fix MPFuture failing outside inference mode learning-at-home/hivemind#521

Merged

Fix timeout on next token

2fafbaa

borzunov changed the title ~~Make inference fault-tolerant~~ Make inference, foward, and backward more fault-tolerant Nov 26, 2022

borzunov changed the title ~~Make inference, foward, and backward more fault-tolerant~~ Make inference, foward, and backward ultimately fault-tolerant Nov 26, 2022

Fix max_length

01cffeb

borzunov changed the title ~~Make inference, foward, and backward ultimately fault-tolerant~~ Make inference, foward, and backward fully fault-tolerant Nov 26, 2022

borzunov commented Nov 26, 2022

View reviewed changes

borzunov marked this pull request as ready for review November 26, 2022 16:15

justheuristic reviewed Nov 26, 2022

View reviewed changes

src/client/inference_session.py Show resolved Hide resolved

justheuristic reviewed Nov 26, 2022

View reviewed changes

src/client/inference_session.py Outdated Show resolved Hide resolved

borzunov changed the title ~~Make inference, foward, and backward fully fault-tolerant~~ Make inference, forward, and backward fully fault-tolerant Nov 26, 2022

justheuristic approved these changes Nov 26, 2022

View reviewed changes

borzunov added 2 commits November 26, 2022 17:55

InferenceSession: Fix the case when failure happens while recovering

226fe91

from another failure

Make the first retry delay be zero

b278a8d

borzunov added 4 commits November 26, 2022 23:40

Fix InferenceSession edge cases

292e359

Improve InferenceSession typing

90654da

Require hivemind with MPFuture in inference mode fixed

0ef1d15

black

8c50f65

borzunov merged commit 11d6ba6 into main Nov 27, 2022

justheuristic mentioned this pull request Dec 21, 2022

[CODE] fault-tolerant inference (client side) #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make inference, forward, and backward fully fault-tolerant #91

Make inference, forward, and backward fully fault-tolerant #91

borzunov commented Nov 26, 2022 •

edited

Loading

borzunov Nov 26, 2022 •

edited

Loading

justheuristic left a comment

Make inference, forward, and backward fully fault-tolerant #91

Make inference, forward, and backward fully fault-tolerant #91

Conversation

borzunov commented Nov 26, 2022 • edited Loading

borzunov Nov 26, 2022 • edited Loading

Choose a reason for hiding this comment

justheuristic left a comment

Choose a reason for hiding this comment

borzunov commented Nov 26, 2022 •

edited

Loading

borzunov Nov 26, 2022 •

edited

Loading