Automatically unload models in router mode based on free VRAM, use something like llama-swap matrix mode to determine which model #24666
sebovzeoueb
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As far as I understand, the only way to automatically unload models in router mode so far is to set the
--models-maxparameter to the desired number of models.However in my use case I would like the unloading to be more "intelligent" based on the user's hardware and the size of the models. A user with a lot of VRAM should be able to either load several smaller models or one larger model.
In addition to this, I would want to keep the embeddings model online at all times if possible and prioritise swapping the chat models. I've seen that llama-swap has a matrix config with a DSL that allows you to define weightings and formulae to select which models should be run together if possible. Maybe it's a little overkill for what I'm asking, I think just simple groupings would work, e.g.
So in the above example, if I have enough VRAM to run mistral7b and paraphrase-multilingual, or qwen3 and paraphrase-multilingual but not all 3, if I'm chatting with mistral7b, then I switch to qwen3, the router will unload mistral7b but keep paraphrase-multilingual active, and if there's only enough VRAM for 1 model at a time it will load the 1 model and unload all others. Someone with more VRAM would just end up with all 3 models loaded and ready to go.
I was thinking about just using llama-swap instead, but I don't think it actually has the VRAM management feature either? I really think some kind of configurable smart unloading would put llama.cpp above the other options.
Beta Was this translation helpful? Give feedback.
All reactions