Skip to content

Eval bug: problem with llama serve and rpc #17274

@d-shehu

Description

@d-shehu

Name and Version

llama-server --version
version: 7054 (becc481)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux, Mac

GGML backends

CUDA, Metal, Vulkan

Hardware

  1. Macbook Pro M3 Pro - 18 GB
    *14GB allocated to GPU
  2. AMD 9900x 128 GB with Radeon r9700 AI Pro - 32GB
  3. AMD 7950x 128 GB with Nvidia RTX 3090 ti - 24GB

Models

unsloth gpt-oss-120b-GGUF Q4_K_M

Problem description & steps to reproduce

I'm trying to run llama server with 3 machines, 2 of which are running as rpc nodes sharing GPU only. When I include the Mac, so I can load larger models, I have frequent crashes, corruption of output mid-way after getting a partial response to my prompt. Without the Mac the setup is a bit more stable.

However, I also noticed that I always get this error as long as I use rpc server. Even if I choose a model, for example Gemma 27b, that easily fits on either node by itself, llama server with rpc still loads some of it on the CPU.

"load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead"

Obviously this slows down the processing making the GPU clustering useless. I've synced my code checkout to the latest "release" tag. I've tried different quantization of GPT 120b and a few other models. Same results.

I understand rpc server is experimental so if these are known issues I can hang back. Thanks.

First Bad Commit

No response

Relevant log output

1. Segmentation issue frequently on startup

 30 Segmentation fault      (core dumped) ./llama-server --model $modelPath --host ${RPC_BIND} --port 26100 --ctx-size $contextSize -t $numProcs --n-gpu-layers 99 $rpcArg --no-warmup --verbose

2. Tensors load in CPU despite having more than enough GPU

load_tensors: tensor 'token_embd.weight' (q5_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions