Eval bug: problem with llama serve and rpc

### Name and Version

llama-server --version
version: 7054 (becc4816d)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux, Mac

### GGML backends

CUDA, Metal, Vulkan

### Hardware

1. Macbook Pro M3 Pro - 18 GB 
       *14GB allocated to GPU
2. AMD 9900x 128 GB with Radeon r9700 AI Pro - 32GB
3. AMD 7950x 128 GB with Nvidia RTX 3090 ti - 24GB
### Models

unsloth gpt-oss-120b-GGUF Q4_K_M

### Problem description & steps to reproduce

I'm trying to run llama server with 3 machines, 2 of which are running as rpc nodes sharing GPU only. When I include the Mac, so I can load larger models, I have frequent crashes, corruption of output mid-way after getting a partial response to my prompt. Without the Mac the setup is a bit more stable.

**However**, I also noticed that I always get this error as long as I use rpc server. Even if I choose a model, for  example Gemma 27b, that easily fits on either node by itself, llama server with rpc still loads some of it on the CPU.

```
"load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead"
```

Obviously this slows down the processing making the GPU clustering useless. I've synced my code checkout to the latest "release" tag. I've tried different quantization of GPT 120b and a few other models. Same results.

I understand rpc server is experimental so if these are known issues I can hang back. Thanks. 

### First Bad Commit

_No response_

### Relevant log output

```shell
1. Segmentation issue frequently on startup

 30 Segmentation fault      (core dumped) ./llama-server --model $modelPath --host ${RPC_BIND} --port 26100 --ctx-size $contextSize -t $numProcs --n-gpu-layers 99 $rpcArg --no-warmup --verbose

2. Tensors load in CPU despite having more than enough GPU

load_tensors: tensor 'token_embd.weight' (q5_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: problem with llama serve and rpc #17274

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: problem with llama serve and rpc #17274

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions