Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC issues and comments #7293

Closed
steampunque opened this issue May 15, 2024 · 8 comments
Closed

RPC issues and comments #7293

steampunque opened this issue May 15, 2024 · 8 comments
Assignees

Comments

@steampunque
Copy link

Local LAN 1x 1070 1x 4070 1x 4070 configured with new RPC with patched server to use RPC.

I did a run to fully offload mixtral Q4_K_M into the 3 GPUs with RPC all looked good:

llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: RPC buffer size = 7043.34 MiB (1070)
llm_load_tensors: RPC buffer size = 9391.12 MiB (4070)
llm_load_tensors: RPC buffer size = 8711.09 MiB (4070)

All layers offloaded and the timings I am getting are:

pp 105.99 tokens per second
tg 25.68 tokens per second

This compares to around 5tps generation with CPU+GPU to 1 4070 so over 5x speedup is nice. And it seems to be working OK. Some issues I have found so far :

The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.

I initially tried a partial offload to two machines (8G +12G) but I got an out of memory crash on one of the servers, so I
am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU? More of a question. It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

Also Great work on this feature it is extremely useful! It will be very good to support mixed CPU and GPU with this mode though so the crazy models such as dbrx commandr+ falcon180 and the 70G llama3 monster could be run if desired.

@ggerganov
Copy link
Owner

I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU?

Might be possible to start 2 RPC servers on the same machine - one CPU and one GPU?

@rgerganov
Copy link
Collaborator

rgerganov commented May 15, 2024

Thanks for the feedback!

The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.

Sure, will turn this off by default.

I am guessing that RPC mode currently does not support mixed CPU and GPU offload

The problem is that we don't report available memory on CPU and Metal and it defaults to 1MB:
https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp#L41
As a stopgap solution I will add command line arguments for free_mem and total_mem until we implement this for all platforms.

It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).

Agree, will be working on this.

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

I guess this is because rpc-server fails to start on the same port and it needs some time for the old socket to expire. Does changing the port helps?

@steampunque
Copy link
Author

I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU?

Might be possible to start 2 RPC servers on the same machine - one CPU and one GPU?

good idea I did not think of that. For a big model like Mixtral 8x22b loading would be very slow though pushing 50-60Gbyes thru local RPC socket instead of moving the layers straight from of file into memory ... and also slower on inference though Im guessing that would not be a huge overhead hit since CPU is not fast anyway.

@steampunque
Copy link
Author

I am guessing that RPC mode currently does not support mixed CPU and GPU offload

The problem is that we don't report available memory on CPU and Metal and it defaults to 1MB: https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp#L41 As a stopgap solution I will add command line arguments for free_mem and total_mem until we implement this for all platforms.

That would be perfectly fine stopgap solution thanks!

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.

I guess this is because rpc-server fails to start on the same port and it needs some time for the old socket to expire. Does changing the port helps?

Most likely it would work but then host (serverrpc in my case) needs to be also restarted on the new port address for the machine that crashed. To handle this cleanly most likely need to install a signal handler trapping ideally any signal that can abort the rpc server (SIGHUP SIGTERM) and the errors (SIGSEGV) and clean up the port before exit. I think SEGVs are extremely tricky to handle this way though https://stackoverflow.com/questions/2663456/how-to-write-a-signal-handler-to-catch-sigsegv so might be better to find the source of the SEGV when it ran out of memory and just go to a soft reset state when whatever happens that would have caused the SEGV (running out of GPU memory did it here) happens. When I stopped the program with CTL-C on the command line it was able to restart on the same port no problem but I didn't try stopping it using kill.

@rgerganov
Copy link
Collaborator

I think that setting SO_REUSEADDR on the socket might help here. Will give it a try and submit a PR if it works.

rgerganov added a commit to rgerganov/llama.cpp that referenced this issue May 16, 2024
This can be overridden with the -m command line option

ref: ggerganov#7293
rgerganov added a commit that referenced this issue May 16, 2024
This can be overridden with the -m command line option

ref: #7293
rgerganov added a commit to rgerganov/llama.cpp that referenced this issue May 16, 2024
@rgerganov
Copy link
Collaborator

rgerganov commented May 16, 2024

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.)

I believe I fixed this with #7320 . Let me know if it works for you.

@steampunque
Copy link
Author

When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.)

I believe I fixed this with #7320 . Let me know if it works for you.

NO PATCH:

bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7918 MB
Accepted client connection, free_mem=8303280128, total_mem=8500477952
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6144.00 MiB on device 0: cudaMalloc failed: out of memory
/usr/local/bin/ll_startrpc: line 1: 27019 Segmentation fault rpc-server 0.0.0.0 50052
bash-5.1$
bash-5.1$
bash-5.1$ rpc-server 0.0.0.0 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7918 MB
Failed to create server socket

WITH PATCH

bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7822 MB
Accepted client connection, free_mem=8202092544, total_mem=8500477952
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6144.00 MiB on device 0: cudaMalloc failed: out of memory
/usr/local/bin/ll_startrpc: line 1: 28747 Segmentation fault rpc-server -H 0.0.0.0 -p 50052
bash-5.1$
bash-5.1$
bash-5.1$
bash-5.1$ ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 7822 MB

I applied patch to b2901 and it looks to fix the problem, I never reset the RPC subsystem after first crash.

I'd argue the SEGV is a bug which should be fixed. It should be debuggable with gdb by starting a debug version of rpc-server with gdb and see what code is doing the illegal memory access when it runs out of memory. You can force this to crash like this just by allocating a huge context here I use 16384 with phi-3 into a 8G GPU and it should reliably
run out of memory and then SEGV.

FATTN=0 RPC=1 C4k=16384 ll_start phi-3 1

Thanks for fixing this problem!

On a RPC related note I was testing mixtral fully offloaded to 3 GPUs on 3 separate machines yesterday and on one of my test prompts it just stopped generating shortly after beginning to generate the answer. Testing with CPU + 1 GPU local offload no RPC the prompt worked fine. Don't know if this is an RPC issue or a multiple GPU issue related to splitting KV across the 3 GPUs. If I can find anything more out about it I will create another issue note about this newly discovered problem.

b2901 seems to be working fine, phi-3 scored identical lambada and nearly identical prompt and gen speeds with both RPC and local CPU/GPU modes. Thank you also for getting rid of all that messaging it runs much cleaner now.

Q9650 CPU +1070 GPU

GPU RPC (fully offloaded)
566.47 tokens per second
41.30 tokens per second

GPU (fully offloaded)
601.75 tokens per second
49.17 tokens per second

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this issue May 17, 2024
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this issue May 17, 2024
This can be overridden with the -m command line option

ref: ggerganov#7293
@steampunque steampunque reopened this May 17, 2024
@steampunque
Copy link
Author

FYI update: latest release b2910 the aborted response with 3RPC mixtral full offload is no longer happening (perhaps the KV related patch was the issue). However I did more investigation and found that responses can differ based on the order the RPC GPUs are loaded. I suspect this is due to the distributed KV cache but the --main-gpu and --split-mode don't work with RPC. It might be nice to add support for a --main-rpc and support --split-mode and --tensor split with RPC mode so the KV can be made contiguous within one RPC GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants