Implement shortest-path routing for inference #362

borzunov · 2023-07-17T09:47:24Z

This PR:

Adds shortest path routing for inference. We build a graph with client-server and server-server latencies and compute costs, as well as empirically measured overheads. For client-server latencies, we ping possible first and last servers in a sequence in SequenceManager.update(). We penalize servers who may not have enough cache for our request. This uses info added to DHT in Share more info about a server in DHT #355, Make a server ping next servers #356, Report inference, forward, and network RPS separately #358.
Makes a server ping neighboring servers in addition to next ones. This is to get an opportunity to change the server even before we use all its blocks (e.g., because a neighboring server is faster). This feature is not enabled though, since it increases graph size for N servers to O(N^2) - but we may enable it if needed.
Fixes a SequenceManager bug with the first update(). Previously, this update was likely to produce incorrect information and cause to MissingBlocksErrors until the next update happens.

justheuristic · 2023-07-17T15:10:45Z

src/petals/client/routing/sequence_manager.py

+
+        # This is a pessimistic estimate that assumes that we'll use all blocks hosted by this server,
+        # which is not always true. This is okay since false positives are more costly than false negatives here.
+        return cache_tokens_needed * 2 * span.length <= span.server_info.cache_tokens_left


Perhaps the servers should report cache tokens*layers to the DHT so we can avoid estimates? In other words, multiply whatever is reported by the number of blocks hosted.

iirc, it is a relatively new feature that can be modified without ache

It's already reported this way. I've updated the comment to improve clarity.

justheuristic · 2023-07-17T21:09:25Z

setup.cfg

@@ -48,6 +48,7 @@ install_requires =
    sentencepiece>=0.1.99
    peft@git+https://github.com/huggingface/peft@5884bdbea49e5e71e2cd06ecfa484bb635063735
    safetensors>=0.3.1
+    Dijkstar>=2.6.0


nit: perhaps we should set <3.0.0 to protect against interface changes

Here, I use only the simplest graph interface - I think it's unlikely to be changed and/or can be quickly updated.

justheuristic · 2023-07-17T21:10:49Z

src/petals/client/routing/sequence_manager.py

+        end_index: int,
+        *,
+        cache_tokens_needed: Optional[int],
+        default_inference_rps: float = 300,  # If inference RPS unknown


Nit: consider moving these defaults to sequence manager config so that they are accessible without editing the source code

These are really low-level constants that's unlikely to be changed by a user, so I wouldn't pollute the config with them (it already has lots of knobs).

justheuristic · 2023-07-17T21:11:45Z

src/petals/client/routing/sequence_manager.py

+            raise MissingBlocksError(missing_blocks)
+
+        client_server_rtts = self.ping_aggregator.to_dict()
+        logger.info(f"Client-server RTTs: {client_server_rtts}")


Nit: do we want to print it to every client? If not, consider logger.debug

Sure! I use them for debug, I'll convert all logger.info() to logger.debug() before merging

justheuristic · 2023-07-17T21:13:43Z

src/petals/client/routing/sequence_manager.py

+                inference_rps = span.server_info.inference_rps
+                if inference_rps is None:
+                    inference_rps = default_inference_rps
+                graph.add_edge((span.peer_id, block_idx), (span.peer_id, block_idx + 1), 1 / inference_rps)


nit: do you, by chance, have some mockup performance numbers for graph construction & pathfinding for simulated larger graphs? If no, please remind me to run them.

If we allow switching a server before going to the end of the first one, we get O(N^2) graph for servers. For N = 50 servers holding blocks 0..80, building a graph and looking for the shortest path takes ~0.3 sec and scales as O(N^2).

I'll forbid such switching, this makes graph much smaller (takes ~0.03 sec now) and scales much better.

borzunov requested a review from justheuristic July 17, 2023 09:47

borzunov marked this pull request as draft July 17, 2023 09:48

borzunov changed the title ~~Implement shortest-path routing~~ Implement shortest-path routing for inference Jul 17, 2023

borzunov force-pushed the shortest-path branch 5 times, most recently from f05a3d7 to 3173d95 Compare July 17, 2023 11:16

Draft shortest-path routing

2d04c79

borzunov force-pushed the shortest-path branch from 3173d95 to 2d04c79 Compare July 17, 2023 11:45

Add debug outputs

8ae9d9a

justheuristic reviewed Jul 17, 2023

View reviewed changes

justheuristic approved these changes Jul 17, 2023

View reviewed changes

borzunov added 2 commits July 18, 2023 00:04

Update comment in _has_cache_for()

4bd3c17

Go to the end of the server to avoid O(N^2) graphs for N servers

9e21cca

borzunov force-pushed the shortest-path branch from 7e40e95 to 9e21cca Compare July 18, 2023 01:02

borzunov and others added 3 commits July 18, 2023 01:54

Hide debug logging

cd4b374

Merge branch 'main' into shortest-path

5e8be84

Update config.show_route param

fd4cb43

borzunov force-pushed the shortest-path branch from b855caa to fd4cb43 Compare July 18, 2023 02:10

borzunov added 4 commits July 18, 2023 03:15

Fix bug with incomplete first update()

7c0e679

Decrease DHT's num_workers

59b7f2a

Remove unused arg

cb4e6ef

Take into account empirically measured overheads

7e33661

borzunov marked this pull request as ready for review July 18, 2023 04:14

Increase dev version

f614702

Decrease default alloc_timeout

ede942f

borzunov merged commit 62d9ed5 into main Jul 18, 2023
7 checks passed

borzunov deleted the shortest-path branch July 18, 2023 04:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement shortest-path routing for inference #362

Implement shortest-path routing for inference #362

borzunov commented Jul 17, 2023 •

edited

justheuristic Jul 17, 2023 •

edited

borzunov Jul 18, 2023

justheuristic Jul 17, 2023

borzunov Jul 18, 2023

justheuristic Jul 17, 2023 •

edited

borzunov Jul 18, 2023

justheuristic Jul 17, 2023

borzunov Jul 18, 2023

justheuristic Jul 17, 2023

borzunov Jul 18, 2023

Implement shortest-path routing for inference #362

Implement shortest-path routing for inference #362

Conversation

borzunov commented Jul 17, 2023 • edited

justheuristic Jul 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justheuristic Jul 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borzunov commented Jul 17, 2023 •

edited

justheuristic Jul 17, 2023 •

edited

justheuristic Jul 17, 2023 •

edited