-
Notifications
You must be signed in to change notification settings - Fork 13.3k
rpc : add support for multiple devices #16276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. Add new API to get the device count from an RPC endpoint. closes: ggml-org#15210
@slaren I am still working on this but I'd appreciate some early feedback on the API changes in |
I am also considering changing the naming scheme for RPC devices. Now I am using I am thinking of switching to
Thoughts? |
}; | ||
|
||
struct rpc_msg_get_device_memory_req { | ||
uint32_t device; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've actually done the same optimization a long ago locally, haven't had a chance to properly submit it. I think uint8_t is enough for device ID, I don't think we are able to exceed 256 devices per endpoint. Using uint8_t will reduce amount of transferred data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this but I was afraid of running into alignment issues if the payload for GRAPH_COMPUTE
is not multiple of 4. Will do some tests and may reconsider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am doing some testing and it seems that the change in #15966 broke Metal compatibility with the RPC backend. Running any model over RPC to a Metal machine since that change with FA enabled results in garbage.
I think the reason is that the RPC backend uses ggml_nbytes()
to get the required memory for a tensor, but the Metal backend can now allocate extra size for fleeting data:
llama.cpp/ggml/src/ggml-metal/ggml-metal.cpp
Lines 184 to 208 in bf6f3b3
static size_t ggml_backend_metal_buffer_type_get_alloc_size(ggml_backend_buffer_type_t buft, const ggml_tensor * tensor) { | |
size_t res = ggml_nbytes(tensor); | |
// some operations require additional memory for fleeting data: | |
switch (tensor->op) { | |
case GGML_OP_MUL_MAT_ID: | |
{ | |
res += ggml_metal_op_mul_mat_id_extra_tpe(tensor); | |
res += ggml_metal_op_mul_mat_id_extra_ids(tensor); | |
} break; | |
case GGML_OP_FLASH_ATTN_EXT: | |
{ | |
if (ggml_metal_op_flash_attn_ext_use_vec(tensor)) { | |
res += ggml_metal_op_flash_attn_ext_extra_tmp(tensor); | |
} | |
} break; | |
default: | |
break; | |
} | |
return res; | |
GGML_UNUSED(buft); | |
} | |
This is not directly related to the changes in this PR, but we should look into resolving it. I think the issue is similar to the padding that the CUDA backend does for quantized tensors.
tools/llama-bench/llama-bench.cpp
Outdated
auto * reg = ggml_backend_reg_get(i); | ||
std::string name = ggml_backend_reg_name(reg); | ||
if (name != "CPU") { | ||
if (name != "CPU" && !string_starts_with(name, "RPC")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if instead of adhoc string matching, the RPC backend could expose a ggml_backend_rpc_is_name_valid(const char * name)
and move the entire naming logic inside the backend implementation. As it is now, it seems a bit of a hack that assumes a specific name syntax and there is no way to keep it in sync with the actual naming syntax used in the backend.
Just an idea - @slaren can confirm if this is needed or we keep it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create a separate ggml_backend_reg
for each RPC server which has the name RPC[host:port]
. The tricky part is that we don't want to list all of these names in the backend column of the report but just say "RPC". I guess we can expose an is_rpc()
function from ggml_backend_reg
but I think it'd be an overkill.
I have slightly improved this logic with the current APIs
Thanks for catching this. Let's fix it in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for slaren's review before merging.
// serialization format: | ||
// | n_nodes (4 bytes) | nodes (n_nodes * sizeof(uint64_t) | n_tensors (4 bytes) | tensors (n_tensors * sizeof(rpc_tensor)) | | ||
if (input.size() < sizeof(uint32_t)) { | ||
// | device (4 bytes) | n_nodes (4 bytes) | nodes (n_nodes * sizeof(uint64_t) | n_tensors (4 bytes) | tensors (n_tensors * sizeof(rpc_tensor)) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be refactored into a header struct with the fixed fields (device, n_nodes, n_tensors), followed by the fields with variable size (nodes and tensors).
I did some benchmarks with two hosts connected on a local gigabit network. The first host has no accelerators and offloads everything to the second host which has two NVIDIA cards (RTX 5090, GTX 1660). master
PR
Exposing the same card (RTX 5090) twicemaster
PR
In summary, this patch brings massive improvements in PP and some small improvements in TG when multiple devices are used with RPC. However, I noticed that I made the mistake of putting |
So if I have 3 machines each with 2 cards in and assuming I number the RPC clients as
I'm away from home, but can test this exact setup when I get back with |
The old behavior was to copy tensors from device 0 to main host and then from main host to device 1 and so on. The new behavior is that main host sends
That would be very much appreciated, thanks |
There is a possible issue here: static bool ggml_backend_rpc_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const ggml_tensor * src, ggml_tensor * dst) {
// check if src and dst are on the same server
ggml_backend_buffer_t src_buffer = src->buffer;
ggml_backend_rpc_buffer_context * src_ctx = (ggml_backend_rpc_buffer_context *)src_buffer->context;
ggml_backend_buffer_t dst_buffer = dst->buffer;
ggml_backend_rpc_buffer_context * dst_ctx = (ggml_backend_rpc_buffer_context *)dst_buffer->context;
if (src_ctx->sock != dst_ctx->sock) {
return false;
} Only the |
Submitted #16421 |
Just trying to find some definitive information about RPC latency: https://web.archive.org/web/20121216045743/http://www-scf.usc.edu/~shailesn/csci-555/mp_vs_rpc.html Sadly all the links are dead, but I did manage to get a copy of the paper (but not the Is this still correct that RPC has ~2x the latency of MPI:
![]() |
I don't think the comparison makes much sense. If we want to improve the efficiency of the RPC backend, we need to look at what operations introduce latency. I suspect that one of the biggest sources of latency is that every command blocks until a response is received from the server, but this may not be strictly necessary in every case. For instance, async operations should not require waiting for a response from the server in most cases. Implementing async support, while also moving the network to a different thread (such that async operations return immediately), could reduce the number of roundtrips and reduce latency. |
Yeah, I'm just looking at the code now and it looks to all be done with socket servers underneath: https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-rpc/ggml-rpc.cpp I haven't tried this PR's improvements yet and probably won't be able to until the weekend now, but it will be interesting to see if the single hidden states for token generation are still the biggest bottleneck for me. |
I can't test
|
Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. Add new API to get the device count from an RPC endpoint.
closes: #15210