Automatically unload models in router mode based on free VRAM, use something like llama-swap matrix mode to determine which model #24666

sebovzeoueb · 2026-06-15T18:17:50Z

sebovzeoueb
Jun 15, 2026

As far as I understand, the only way to automatically unload models in router mode so far is to set the --models-max parameter to the desired number of models.

However in my use case I would like the unloading to be more "intelligent" based on the user's hardware and the size of the models. A user with a lot of VRAM should be able to either load several smaller models or one larger model.

In addition to this, I would want to keep the embeddings model online at all times if possible and prioritise swapping the chat models. I've seen that llama-swap has a matrix config with a DSL that allows you to define weightings and formulae to select which models should be run together if possible. Maybe it's a little overkill for what I'm asking, I think just simple groupings would work, e.g.

[groups]
chat = ["mistral7b", "qwen3"]
embeddings = ["paraphrase-multilingual"]

So in the above example, if I have enough VRAM to run mistral7b and paraphrase-multilingual, or qwen3 and paraphrase-multilingual but not all 3, if I'm chatting with mistral7b, then I switch to qwen3, the router will unload mistral7b but keep paraphrase-multilingual active, and if there's only enough VRAM for 1 model at a time it will load the 1 model and unload all others. Someone with more VRAM would just end up with all 3 models loaded and ready to go.

I was thinking about just using llama-swap instead, but I don't think it actually has the VRAM management feature either? I really think some kind of configurable smart unloading would put llama.cpp above the other options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically unload models in router mode based on free VRAM, use something like llama-swap matrix mode to determine which model #24666

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Automatically unload models in router mode based on free VRAM, use something like llama-swap matrix mode to determine which model #24666

Uh oh!

sebovzeoueb Jun 15, 2026

Replies: 0 comments

sebovzeoueb
Jun 15, 2026