rpc : add support for multiple devices #16276

rgerganov · 2025-09-26T12:16:01Z

Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. Add new API to get the device count from an RPC endpoint.

closes: #15210

Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. Add new API to get the device count from an RPC endpoint. closes: ggml-org#15210

rgerganov · 2025-09-26T12:18:36Z

@slaren I am still working on this but I'd appreciate some early feedback on the API changes in ggml-rpc.h

ggml/include/ggml-rpc.h

rgerganov · 2025-09-30T09:33:00Z

I am also considering changing the naming scheme for RPC devices. Now I am using RPC<X>[<host>:<port>] for both name and description which doesn't look nice in various logs.

I am thinking of switching to RPC<X> for device name and [<host>:<port>] for device description. The X number should be global monotonic counter making RPC devices appear like this:

RPC0 ([localhost:50052])
RPC1 ([localhost:50053])
RPC2 ([localhost:18053])
RPC3 ([localhost:18054])
...

Thoughts?

wallentri88 · 2025-09-30T22:04:03Z

ggml/src/ggml-rpc/ggml-rpc.cpp

 };

+struct rpc_msg_get_device_memory_req {
+    uint32_t device;


I've actually done the same optimization a long ago locally, haven't had a chance to properly submit it. I think uint8_t is enough for device ID, I don't think we are able to exceed 256 devices per endpoint. Using uint8_t will reduce amount of transferred data.

I thought about this but I was afraid of running into alignment issues if the payload for GRAPH_COMPUTE is not multiple of 4. Will do some tests and may reconsider.

ggerganov

I am doing some testing and it seems that the change in #15966 broke Metal compatibility with the RPC backend. Running any model over RPC to a Metal machine since that change with FA enabled results in garbage.

I think the reason is that the RPC backend uses ggml_nbytes() to get the required memory for a tensor, but the Metal backend can now allocate extra size for fleeting data:

llama.cpp/ggml/src/ggml-metal/ggml-metal.cpp

Lines 184 to 208 in bf6f3b3

    
           static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_type_t buft, const ggml_tensor * tensor) { 
        
               size_t res = ggml_nbytes(tensor); 
        
               // some operations require additional memory for fleeting data: 
        
               switch (tensor->op) { 
        
                   case GGML_OP_MUL_MAT_ID: 
        
                       { 
        
                           res += ggml_metal_op_mul_mat_id_extra_tpe(tensor); 
        
                           res += ggml_metal_op_mul_mat_id_extra_ids(tensor); 
        
                       } break; 
        
                   case GGML_OP_FLASH_ATTN_EXT: 
        
                       { 
        
                           if (ggml_metal_op_flash_attn_ext_use_vec(tensor)) { 
        
                               res += ggml_metal_op_flash_attn_ext_extra_tmp(tensor); 
        
                           } 
        
                       } break; 
        
                   default: 
        
                       break; 
        
               } 
        
               return res; 
        
               GGML_UNUSED(buft); 
        
           }

This is not directly related to the changes in this PR, but we should look into resolving it. I think the issue is similar to the padding that the CUDA backend does for quantized tensors.

ggerganov · 2025-10-01T06:14:25Z

tools/llama-bench/llama-bench.cpp

            auto *      reg  = ggml_backend_reg_get(i);
            std::string name = ggml_backend_reg_name(reg);
-            if (name != "CPU") {
+            if (name != "CPU" && !string_starts_with(name, "RPC")) {


I wonder if instead of adhoc string matching, the RPC backend could expose a ggml_backend_rpc_is_name_valid(const char * name) and move the entire naming logic inside the backend implementation. As it is now, it seems a bit of a hack that assumes a specific name syntax and there is no way to keep it in sync with the actual naming syntax used in the backend.

Just an idea - @slaren can confirm if this is needed or we keep it as it is.

We create a separate ggml_backend_reg for each RPC server which has the name RPC[host:port]. The tricky part is that we don't want to list all of these names in the backend column of the report but just say "RPC". I guess we can expose an is_rpc() function from ggml_backend_reg but I think it'd be an overkill.

I have slightly improved this logic with the current APIs

ggml/include/ggml-rpc.h

rgerganov · 2025-10-02T08:18:37Z

I am doing some testing and it seems that the change in #15966 broke Metal compatibility with the RPC backend. Running any model over RPC to a Metal machine since that change with FA enabled results in garbage.

I think the reason is that the RPC backend uses ggml_nbytes() to get the required memory for a tensor, but the Metal backend can now allocate extra size for fleeting data:

llama.cpp/ggml/src/ggml-metal/ggml-metal.cpp

Lines 184 to 208 in bf6f3b3

static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_type_t buft, const ggml_tensor * tensor) {

size_t res = ggml_nbytes(tensor);

// some operations require additional memory for fleeting data:

switch (tensor->op) {

case GGML_OP_MUL_MAT_ID:

{

res += ggml_metal_op_mul_mat_id_extra_tpe(tensor);

res += ggml_metal_op_mul_mat_id_extra_ids(tensor);

} break;

case GGML_OP_FLASH_ATTN_EXT:

{

if (ggml_metal_op_flash_attn_ext_use_vec(tensor)) {

res += ggml_metal_op_flash_attn_ext_extra_tmp(tensor);

}

} break;

default:

break;

}

return res;

GGML_UNUSED(buft);

}

This is not directly related to the changes in this PR, but we should look into resolving it. I think the issue is similar to the padding that the CUDA backend does for quantized tensors.

Thanks for catching this. Let's fix it in a separate PR.

ggerganov

Wait for slaren's review before merging.

slaren · 2025-10-02T11:29:53Z

ggml/src/ggml-rpc/ggml-rpc.cpp

    // serialization format:
-    // | n_nodes (4 bytes) | nodes (n_nodes * sizeof(uint64_t) | n_tensors (4 bytes) | tensors (n_tensors * sizeof(rpc_tensor)) |
-    if (input.size() < sizeof(uint32_t)) {
+    // | device (4 bytes) | n_nodes (4 bytes) | nodes (n_nodes * sizeof(uint64_t) | n_tensors (4 bytes) | tensors (n_tensors * sizeof(rpc_tensor)) |


This could be refactored into a header struct with the fixed fields (device, n_nodes, n_tensors), followed by the fields with variable size (nodes and tensors).

rgerganov · 2025-10-04T08:25:34Z

I did some benchmarks with two hosts connected on a local gigabit network. The first host has no accelerators and offloads everything to the second host which has two NVIDIA cards (RTX 5090, GTX 1660).

master

$ ./llama-bench -m ~/llamacpp/models/gemma-3-4b-it-q4_0.gguf -ngl 99 --rpc 192.168.88.34:50052,192.168.88.34:50053 --list-devices
load_backend: loaded RPC backend from /home/deck/bench/master/libggml-rpc.so
load_backend: loaded CPU backend from /home/deck/bench/master/libggml-cpu-haswell.so
Available devices:
  RPC[192.168.88.34:50052]: RPC[192.168.88.34:50052] (32109 MiB, 31588 MiB free)
  RPC[192.168.88.34:50053]: RPC[192.168.88.34:50053] (5748 MiB, 5603 MiB free)

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	1039.83 ± 0.44
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	33.18 ± 0.05

PR

$ ./llama-bench -m ~/llamacpp/models/gemma-3-4b-it-q4_0.gguf -ngl 99 --rpc 192.168.88.34:50052 --list-devices
load_backend: loaded RPC backend from /home/deck/bench/multidev/libggml-rpc.so
load_backend: loaded CPU backend from /home/deck/bench/multidev/libggml-cpu-haswell.so
Available devices:
  RPC0: 192.168.88.34:50052 (32109 MiB, 31588 MiB free)
  RPC1: 192.168.88.34:50052 (5748 MiB, 5671 MiB free)

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	1266.96 ± 2.77
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	33.68 ± 0.10

Exposing the same card (RTX 5090) twice

master

$ ./llama-bench -m ~/llamacpp/models/gemma-3-4b-it-q4_0.gguf -ngl 99 --rpc 192.168.88.34:50052,192.168.88.34:50053 --list-devices
load_backend: loaded RPC backend from /home/deck/bench/master/libggml-rpc.so
load_backend: loaded CPU backend from /home/deck/bench/master/libggml-cpu-haswell.so
Available devices:
  RPC[192.168.88.34:50052]: RPC[192.168.88.34:50052] (32109 MiB, 31588 MiB free)
  RPC[192.168.88.34:50053]: RPC[192.168.88.34:50053] (32109 MiB, 31084 MiB free)

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	1424.32 ± 2.49
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	47.10 ± 0.02

PR

$ bin/rpc-server -c -d CUDA0,CUDA0 -H 0.0.0.0 -p 50052

$ ./llama-bench -m ~/llamacpp/models/gemma-3-4b-it-q4_0.gguf -ngl 99 --rpc 192.168.88.34:50052 --list-devices
load_backend: loaded RPC backend from /home/deck/bench/multidev/libggml-rpc.so
load_backend: loaded CPU backend from /home/deck/bench/multidev/libggml-cpu-haswell.so
Available devices:
  RPC0: 192.168.88.34:50052 (32109 MiB, 31588 MiB free)
  RPC1: 192.168.88.34:50052 (32109 MiB, 31588 MiB free)

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	pp512	1896.91 ± 5.12
gemma3 4B Q4_0	2.93 GiB	3.88 B	RPC	99	tg128	47.65 ± 0.49

In summary, this patch brings massive improvements in PP and some small improvements in TG when multiple devices are used with RPC.
I also don't see any regressions when using RPC with a single device.

However, I noticed that I made the mistake of putting RPC_CMD_DEVICE_COUNT first and thus changing the number for RPC_CMD_HELLO resulting in bad user experience when the client and server version mismatch. I will fix this and then merge.

jukofyork · 2025-10-04T10:11:25Z

So if I have 3 machines each with 2 cards in and assuming I number the RPC clients as {{0, 1}, {2, 3}, {4, 5}}:

The old behaviour was to send the hidden state(s) from the host to card 0, then back to the host, then to card 1, and then back to the host, and so on...
The new behaviour will send from the host to card 0, then directly to card 1, and then (back to the host?).

I'm away from home, but can test this exact setup when I get back with deepseek-v3 and will post the results.

rgerganov · 2025-10-04T10:21:46Z

The old behavior was to copy tensors from device 0 to main host and then from main host to device 1 and so on. The new behavior is that main host sends RPC_CMD_COPY_TENSOR to copy tensor from dev 0 to dev 1 which saves a lot of network traffic (tensor data is not sent). You can see the commands being received and executed by setting GGML_RPC_DEBUG=1 before starting rpc-server. I will also update the documentation soon.

I'm away from home, but can test this exact setup when I get back with deepseek-v3 and will post the results.

That would be very much appreciated, thanks

slaren · 2025-10-04T11:34:10Z

There is a possible issue here:

static bool ggml_backend_rpc_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const ggml_tensor * src, ggml_tensor * dst) {
    // check if src and dst are on the same server
    ggml_backend_buffer_t src_buffer = src->buffer;
    ggml_backend_rpc_buffer_context * src_ctx = (ggml_backend_rpc_buffer_context *)src_buffer->context;
    ggml_backend_buffer_t dst_buffer = dst->buffer;
    ggml_backend_rpc_buffer_context * dst_ctx = (ggml_backend_rpc_buffer_context *)dst_buffer->context;
    if (src_ctx->sock != dst_ctx->sock) {
        return false;
    }

Only the dst tensor is guaranteed to be on a RPC buffer, the src tensor could be on any other buffer type. Before casting the src context to ggml_backend_rpc_buffer_context, it is necessary to check that it is an RPC buffer.

rgerganov · 2025-10-04T12:27:42Z

Only the dst tensor is guaranteed to be on a RPC buffer, the src tensor could be on any other buffer type. Before casting the src context to ggml_backend_rpc_buffer_context, it is necessary to check that it is an RPC buffer.

Submitted #16421

jukofyork · 2025-10-08T13:39:33Z

Just trying to find some definitive information about RPC latency:

https://stackoverflow.com/questions/962436/what-are-the-disadvantages-of-rpc-with-respect-to-message-passing

https://web.archive.org/web/20121216045743/http://www-scf.usc.edu/~shailesn/csci-555/mp_vs_rpc.html

Sadly all the links are dead, but I did manage to get a copy of the paper (but not the 'silcock-programmerfriendly-1998.pdf' thesis):

Message Passing, Remote Procedure Calls and Distributed Shared Memory as Communication Paradigms for Distributed Systems.pdf

Is this still correct that RPC has ~2x the latency of MPI:

The message passing implementation requires only one message to be passed between the communicating processes while RPC requires two message

slaren · 2025-10-08T13:53:51Z

I don't think the comparison makes much sense. If we want to improve the efficiency of the RPC backend, we need to look at what operations introduce latency. I suspect that one of the biggest sources of latency is that every command blocks until a response is received from the server, but this may not be strictly necessary in every case. For instance, async operations should not require waiting for a response from the server in most cases. Implementing async support, while also moving the network to a different thread (such that async operations return immediately), could reduce the number of roundtrips and reduce latency.

jukofyork · 2025-10-08T14:13:11Z

I don't think the comparison makes much sense. If we want to improve the efficiency of the RPC backend, we need to look at what operations introduce latency. I suspect that one of the biggest sources of latency is that every command blocks until a response is received from the server, but this may not be strictly necessary in every case. For instance, async operations should not require waiting for a response from the server in most cases. Implementing async support, while also moving the network to a different thread (such that async operations return immediately), could reduce the number of roundtrips and reduce latency.

Yeah, I'm just looking at the code now and it looks to all be done with socket servers underneath:

https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-rpc/ggml-rpc.cpp

I haven't tried this PR's improvements yet and probably won't be able to until the weekend now, but it will be interesting to see if the single hidden states for token generation are still the biggest bottleneck for me.

jukofyork · 2025-10-08T15:55:56Z

I haven't tried this PR's improvements yet and probably won't be able to until the weekend now, but it will be interesting to see if the single hidden states for token generation are still the biggest bottleneck for me.

I can't test deepseek or run any proper tests until I get back, but just did a quick test of an existing GLM-4.6 GGUF I have (Q6_K appart from the MoE tensors in Q4_K --> ~200GB file size) running over 3 machines with 2x A6000 in each (ie: 2x RPC servers running 2x A6000 each + main server running 2x A6000 as CUDA0 and CUDA1):

It seems to be working much better than the last time I tried RPC (where each GPU I added got cancelled by the added latency, etc).
It also seems to benefit much more using a draft model (I assume because it's almost the same network latency to send small batches as a single hidden state).

rpc : add support for multiple devices

00c211a

Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. Add new API to get the device count from an RPC endpoint. closes: ggml-org#15210

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Sep 26, 2025

slaren reviewed Sep 26, 2025

View reviewed changes

ggml/include/ggml-rpc.h Outdated Show resolved Hide resolved

ggml/include/ggml-rpc.h Outdated Show resolved Hide resolved

rgerganov added 4 commits September 27, 2025 17:11

fixes

366c6bc

use ggml_backend_reg_t

65b060e

address review comments

17097ec

fix llama-bench backend report

304f02d

rgerganov marked this pull request as ready for review September 30, 2025 08:51

rgerganov requested a review from ggerganov as a code owner September 30, 2025 08:51

wallentri88 suggested changes Sep 30, 2025

View reviewed changes

ggerganov reviewed Oct 1, 2025

View reviewed changes

address review comments, change device naming

c5861a8

ggerganov approved these changes Oct 2, 2025

View reviewed changes

slaren approved these changes Oct 2, 2025

View reviewed changes

fix cmd order

adb3462

rgerganov merged commit 898acba into ggml-org:master Oct 4, 2025
66 of 68 checks passed

nguha mentioned this pull request Oct 5, 2025

Feature Request: rpc: Load tensors in parallel throughout rpc-servers #16434

Open

4 tasks

	static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_type_t buft, const ggml_tensor * tensor) {
	size_t res = ggml_nbytes(tensor);

	// some operations require additional memory for fleeting data:
	switch (tensor->op) {
	case GGML_OP_MUL_MAT_ID:
	{
	res += ggml_metal_op_mul_mat_id_extra_tpe(tensor);
	res += ggml_metal_op_mul_mat_id_extra_ids(tensor);
	} break;
	case GGML_OP_FLASH_ATTN_EXT:
	{
	if (ggml_metal_op_flash_attn_ext_use_vec(tensor)) {
	res += ggml_metal_op_flash_attn_ext_extra_tmp(tensor);
	}
	} break;
	default:
	break;
	}

	return res;

	GGML_UNUSED(buft);
	}

rpc : add support for multiple devices #16276

rpc : add support for multiple devices #16276

Conversation

rgerganov commented Sep 26, 2025

Uh oh!

rgerganov commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

rgerganov commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wallentri88 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

rgerganov Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

rgerganov Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rgerganov commented Oct 2, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

slaren Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

rgerganov commented Oct 4, 2025

master

PR

Exposing the same card (RTX 5090) twice

master

PR

Uh oh!

Uh oh!

jukofyork commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Oct 4, 2025

Uh oh!

rgerganov commented Oct 4, 2025

Uh oh!

jukofyork commented Oct 8, 2025

Uh oh!

slaren commented Oct 8, 2025

Uh oh!

jukofyork commented Oct 8, 2025

Uh oh!

jukofyork commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rgerganov commented Sep 30, 2025 •

edited

Loading

jukofyork commented Oct 4, 2025 •

edited

Loading

rgerganov commented Oct 4, 2025 •

edited

Loading

jukofyork commented Oct 8, 2025 •

edited

Loading