Allow server to load multiple models at the same time

I'm building a multi-user server where different people can reference different models. Currently the server lets me specify multiple models in the config file, but whenever I switch the model I'm using, it unloads the old model and loads the new one. This takes too long for my use case.

I'd like the server to load multiple models and rely on virtual memory to page them out when it runs out of room. My GPU is big enough for more than one, in fact I've tested it with multiple instances of llama_cpp where the offload memory needed exceeded my VRAM and it worked fine (graceful degredation).

I could write another program to juggle multiple processes, where each process loads one model, but I'd rather not. The config file already allows me to specify multiple models and it sorta works (except for the load/unload problem), so I think it would be natural to support multiple models at once.

One simple use case for this is if you simply want one language model and one embedding model, and you want to go back and forth between using them (like in a RAG app).

I saw code in llama_cpp to implement 'slots', perhaps that will help here.

I see there's another issue open to support multiple completions at once, and while that will be useful it's not the same as what I'm requestiong. This issue requests that the old model not be unloaded unless necessary so that I can request model A, then request model B, then model A again and it won't have to reload A.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow server to load multiple models at the same time #1249

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Allow server to load multiple models at the same time #1249

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions