Skip to content

Allow server to load multiple models at the same time #1249

@eburnette

Description

@eburnette

I'm building a multi-user server where different people can reference different models. Currently the server lets me specify multiple models in the config file, but whenever I switch the model I'm using, it unloads the old model and loads the new one. This takes too long for my use case.

I'd like the server to load multiple models and rely on virtual memory to page them out when it runs out of room. My GPU is big enough for more than one, in fact I've tested it with multiple instances of llama_cpp where the offload memory needed exceeded my VRAM and it worked fine (graceful degredation).

I could write another program to juggle multiple processes, where each process loads one model, but I'd rather not. The config file already allows me to specify multiple models and it sorta works (except for the load/unload problem), so I think it would be natural to support multiple models at once.

One simple use case for this is if you simply want one language model and one embedding model, and you want to go back and forth between using them (like in a RAG app).

I saw code in llama_cpp to implement 'slots', perhaps that will help here.

I see there's another issue open to support multiple completions at once, and while that will be useful it's not the same as what I'm requestiong. This issue requests that the old model not be unloaded unless necessary so that I can request model A, then request model B, then model A again and it won't have to reload A.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions