-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
llama-server --version
version: 7054 (becc481)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux, Mac
GGML backends
CUDA, Metal, Vulkan
Hardware
- Macbook Pro M3 Pro - 18 GB
*14GB allocated to GPU - AMD 9900x 128 GB with Radeon r9700 AI Pro - 32GB
- AMD 7950x 128 GB with Nvidia RTX 3090 ti - 24GB
Models
unsloth gpt-oss-120b-GGUF Q4_K_M
Problem description & steps to reproduce
I'm trying to run llama server with 3 machines, 2 of which are running as rpc nodes sharing GPU only. When I include the Mac, so I can load larger models, I have frequent crashes, corruption of output mid-way after getting a partial response to my prompt. Without the Mac the setup is a bit more stable.
However, I also noticed that I always get this error as long as I use rpc server. Even if I choose a model, for example Gemma 27b, that easily fits on either node by itself, llama server with rpc still loads some of it on the CPU.
"load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead"
Obviously this slows down the processing making the GPU clustering useless. I've synced my code checkout to the latest "release" tag. I've tried different quantization of GPT 120b and a few other models. Same results.
I understand rpc server is experimental so if these are known issues I can hang back. Thanks.
First Bad Commit
No response
Relevant log output
1. Segmentation issue frequently on startup
30 Segmentation fault (core dumped) ./llama-server --model $modelPath --host ${RPC_BIND} --port 26100 --ctx-size $contextSize -t $numProcs --n-gpu-layers 99 $rpcArg --no-warmup --verbose
2. Tensors load in CPU despite having more than enough GPU
load_tensors: tensor 'token_embd.weight' (q5_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead