The GGUF and Metal files are loaded relative to the current working directory, not relative to the ds4-server binary. So llama-swap and other external programs can't just easily execute the binary directly:
username@MacStudio ~ % /Users/username/git/ds4/ds4-server --port 5555 --ctx 262144 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 32768
ds4: cannot open model 'ds4flash.gguf': No such file or directory
Alright, let's give it the full path to the model:
username@MacStudio ~ % /Users/username/git/ds4/ds4-server -m /Users/username/git/ds4/gguf/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf --port 5555 --ctx 262144 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 32768
ds4: Metal source metal/flash_attn.metal not found (set DS4_METAL_FLASH_ATTN_SOURCE to override)
ds4: Metal backend unavailable; aborting startup
A simple workaround is to create a shell script and launch that instead:
#!/bin/sh
set -e
cd /Users/username/git/ds4
exec ./ds4-server --port 5555 --ctx 262144 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 32768
llama-swap also attempts to perform a health check (requesting /health by default). If that endpoint doesn't respond, requests are queued indefinitely waiting for a successful health check that will never arrive. So this could be a good candidate for a future endpoint if other software checks /health too.
But this is also easy to work around by simply giving it one of the other endpoints instead. Here's a working config for llama-swap:
models:
"deepseek-v4-flash":
cmd: /Users/username/git/ds4/start-ds4.sh
proxy: http://127.0.0.1:5555
checkEndpoint: /v1/models
This project is amazing. Thank you to all contributors.
The GGUF and Metal files are loaded relative to the current working directory, not relative to the
ds4-serverbinary. Sollama-swapand other external programs can't just easily execute the binary directly:Alright, let's give it the full path to the model:
A simple workaround is to create a shell script and launch that instead:
llama-swapalso attempts to perform a health check (requesting/healthby default). If that endpoint doesn't respond, requests are queued indefinitely waiting for a successful health check that will never arrive. So this could be a good candidate for a future endpoint if other software checks/healthtoo.But this is also easy to work around by simply giving it one of the other endpoints instead. Here's a working config for
llama-swap:This project is amazing. Thank you to all contributors.