RPC issues and comments #7293

steampunque · 2024-05-15T04:52:11Z

Local LAN 1x 1070 1x 4070 1x 4070 configured with new RPC with patched server to use RPC.

I did a run to fully offload mixtral Q4_K_M into the 3 GPUs with RPC all looked good:

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: RPC buffer size = 7043.34 MiB (1070)
llm_load_tensors: RPC buffer size = 9391.12 MiB (4070)
llm_load_tensors: RPC buffer size = 8711.09 MiB (4070)

All layers offloaded and the timings I am getting are:

pp 105.99 tokens per second
tg 25.68 tokens per second

This compares to around 5tps generation with CPU+GPU to 1 4070 so over 5x speedup is nice. And it seems to be working OK. Some issues I have found so far :

The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.

I initially tried a partial offload to two machines (8G +12G) but I got an out of memory crash on one of the servers, so I
am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU? More of a question. It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

Also Great work on this feature it is extremely useful! It will be very good to support mixed CPU and GPU with this mode though so the crazy models such as dbrx commandr+ falcon180 and the 70G llama3 monster could be run if desired.

ggerganov · 2024-05-15T06:23:18Z

I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU?

Might be possible to start 2 RPC servers on the same machine - one CPU and one GPU?

rgerganov · 2024-05-15T06:36:39Z

Thanks for the feedback!

The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.

Sure, will turn this off by default.

I am guessing that RPC mode currently does not support mixed CPU and GPU offload

The problem is that we don't report available memory on CPU and Metal and it defaults to 1MB:
https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp#L41
As a stopgap solution I will add command line arguments for free_mem and total_mem until we implement this for all platforms.

It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).

Agree, will be working on this.

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

I guess this is because rpc-server fails to start on the same port and it needs some time for the old socket to expire. Does changing the port helps?

ref: ggerganov#7293

steampunque · 2024-05-15T12:46:24Z

I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU?

Might be possible to start 2 RPC servers on the same machine - one CPU and one GPU?

good idea I did not think of that. For a big model like Mixtral 8x22b loading would be very slow though pushing 50-60Gbyes thru local RPC socket instead of moving the layers straight from of file into memory ... and also slower on inference though Im guessing that would not be a huge overhead hit since CPU is not fast anyway.

steampunque · 2024-05-15T13:10:40Z

I am guessing that RPC mode currently does not support mixed CPU and GPU offload

The problem is that we don't report available memory on CPU and Metal and it defaults to 1MB: https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp#L41 As a stopgap solution I will add command line arguments for free_mem and total_mem until we implement this for all platforms.

That would be perfectly fine stopgap solution thanks!

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

I guess this is because rpc-server fails to start on the same port and it needs some time for the old socket to expire. Does changing the port helps?

Most likely it would work but then host (serverrpc in my case) needs to be also restarted on the new port address for the machine that crashed. To handle this cleanly most likely need to install a signal handler trapping ideally any signal that can abort the rpc server (SIGHUP SIGTERM) and the errors (SIGSEGV) and clean up the port before exit. I think SEGVs are extremely tricky to handle this way though https://stackoverflow.com/questions/2663456/how-to-write-a-signal-handler-to-catch-sigsegv so might be better to find the source of the SEGV when it ran out of memory and just go to a soft reset state when whatever happens that would have caused the SEGV (running out of GPU memory did it here) happens. When I stopped the program with CTL-C on the command line it was able to restart on the same port no problem but I didn't try stopping it using kill.

rgerganov · 2024-05-15T13:26:51Z

I think that setting SO_REUSEADDR on the socket might help here. Will give it a try and submit a PR if it works.

ref: #7293

This can be overridden with the -m command line option ref: ggerganov#7293

This can be overridden with the -m command line option ref: #7293

ref: ggerganov#7293

rgerganov · 2024-05-16T10:18:27Z

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.)

I believe I fixed this with #7320 . Let me know if it works for you.

steampunque · 2024-05-16T17:36:41Z

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.)

I believe I fixed this with #7320 . Let me know if it works for you.

NO PATCH:

bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7918 MB
Accepted client connection, free_mem=8303280128, total_mem=8500477952
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6144.00 MiB on device 0: cudaMalloc failed: out of memory
/usr/local/bin/ll_startrpc: line 1: 27019 Segmentation fault rpc-server 0.0.0.0 50052
bash-5.1$
bash-5.1$
bash-5.1$ rpc-server 0.0.0.0 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7918 MB
Failed to create server socket

WITH PATCH

bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7822 MB
Accepted client connection, free_mem=8202092544, total_mem=8500477952
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6144.00 MiB on device 0: cudaMalloc failed: out of memory
/usr/local/bin/ll_startrpc: line 1: 28747 Segmentation fault rpc-server -H 0.0.0.0 -p 50052
bash-5.1$
bash-5.1$
bash-5.1$
bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7822 MB

I applied patch to b2901 and it looks to fix the problem, I never reset the RPC subsystem after first crash.

I'd argue the SEGV is a bug which should be fixed. It should be debuggable with gdb by starting a debug version of rpc-server with gdb and see what code is doing the illegal memory access when it runs out of memory. You can force this to crash like this just by allocating a huge context here I use 16384 with phi-3 into a 8G GPU and it should reliably
run out of memory and then SEGV.

FATTN=0 RPC=1 C4k=16384 ll_start phi-3 1

Thanks for fixing this problem!

On a RPC related note I was testing mixtral fully offloaded to 3 GPUs on 3 separate machines yesterday and on one of my test prompts it just stopped generating shortly after beginning to generate the answer. Testing with CPU + 1 GPU local offload no RPC the prompt worked fine. Don't know if this is an RPC issue or a multiple GPU issue related to splitting KV across the 3 GPUs. If I can find anything more out about it I will create another issue note about this newly discovered problem.

b2901 seems to be working fine, phi-3 scored identical lambada and nearly identical prompt and gen speeds with both RPC and local CPU/GPU modes. Thank you also for getting rid of all that messaging it runs much cleaner now.

Q9650 CPU +1070 GPU

GPU RPC (fully offloaded)
566.47 tokens per second
41.30 tokens per second

GPU (fully offloaded)
601.75 tokens per second
49.17 tokens per second

ref: ggerganov#7293

This can be overridden with the -m command line option ref: ggerganov#7293

ref: #7293

steampunque · 2024-05-17T16:09:24Z

FYI update: latest release b2910 the aborted response with 3RPC mixtral full offload is no longer happening (perhaps the KV related patch was the issue). However I did more investigation and found that responses can differ based on the order the RPC GPUs are loaded. I suspect this is due to the distributed KV cache but the --main-gpu and --split-mode don't work with RPC. It might be nice to add support for a --main-rpc and support --split-mode and --tensor split with RPC mode so the KV can be made contiguous within one RPC GPU.

steampunque added the bug-unconfirmed label May 15, 2024

rgerganov self-assigned this May 15, 2024

ggerganov mentioned this issue May 15, 2024

ggml : add RPC backend ggerganov/ggml#761

Closed

rgerganov added a commit to rgerganov/llama.cpp that referenced this issue May 15, 2024

rpc : add command line arg for specifying backend memory

48ff66c

ref: ggerganov#7293

rgerganov mentioned this issue May 15, 2024

rpc : add command line arg for specifying backend memory #7304

Merged

rgerganov added a commit that referenced this issue May 16, 2024

rpc : add command line arg for specifying backend memory

3b3963c

ref: #7293

rgerganov added a commit to rgerganov/llama.cpp that referenced this issue May 16, 2024

rpc : get available mem for the CPU backend

24b3881

This can be overridden with the -m command line option ref: ggerganov#7293

rgerganov mentioned this issue May 16, 2024

rpc : get available mem for the CPU backend #7319

Merged

rgerganov added a commit that referenced this issue May 16, 2024

rpc : get available mem for the CPU backend

9afdffe

This can be overridden with the -m command line option ref: #7293

rgerganov added a commit to rgerganov/llama.cpp that referenced this issue May 16, 2024

rpc : set SO_REUSEADDR for the server socket

7cb754f

ref: ggerganov#7293

rgerganov mentioned this issue May 16, 2024

rpc : set SO_REUSEADDR for the server socket #7320

Merged

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this issue May 17, 2024

rpc : add command line arg for specifying backend memory

3d210da

ref: ggerganov#7293

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this issue May 17, 2024

rpc : get available mem for the CPU backend

99d5b28

This can be overridden with the -m command line option ref: ggerganov#7293

steampunque closed this as completed May 17, 2024

rgerganov added a commit that referenced this issue May 17, 2024

rpc : set SO_REUSEADDR for the server socket (#7320)

f4bd8b3

ref: #7293

steampunque reopened this May 17, 2024

steampunque closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC issues and comments #7293

RPC issues and comments #7293

steampunque commented May 15, 2024

ggerganov commented May 15, 2024

rgerganov commented May 15, 2024 •

edited

Loading

steampunque commented May 15, 2024

steampunque commented May 15, 2024

rgerganov commented May 15, 2024

rgerganov commented May 16, 2024 •

edited

Loading

steampunque commented May 16, 2024

steampunque commented May 17, 2024

RPC issues and comments #7293

RPC issues and comments #7293

Comments

steampunque commented May 15, 2024

ggerganov commented May 15, 2024

rgerganov commented May 15, 2024 • edited Loading

steampunque commented May 15, 2024

steampunque commented May 15, 2024

rgerganov commented May 15, 2024

rgerganov commented May 16, 2024 • edited Loading

steampunque commented May 16, 2024

steampunque commented May 17, 2024

rgerganov commented May 15, 2024 •

edited

Loading

rgerganov commented May 16, 2024 •

edited

Loading