Replies: 2 comments
-
|
@RonnyPfannschmidt FYI, it would be good to capture your requirements here |
Beta Was this translation helpful? Give feedback.
-
|
my needs can be captured on multiple level shared entrypoint even when running multiple instances with differnt modelsits an utter pain to manage dozens of entpoints to connecto to for rach a single or 2 models - some kind of router is absolutely required shared/linked routersi have multiple computers, and now even a olares one ai homeserver - its very painful to manage multiple models on multiple computers it would be great if there was some kind of linkable ingress, so i can run models on multiple systems, and each of my computers just goes to localhost and selects a model name |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFE: Model Swapping
Related: #2080
It would be great if RamaLama supported Model Swapping - adhoc starting/stopping models based on requested URLs and a swap configuration - similar to llama-swap. For this, however, RamaLama would need some kind of daemon acting as a model router with a reverse-proxy.
The daemon/model router would be a new component, (mostly) independent from RamaLama.
Current state in RamaLama
Currently, there is a daemon implementation inside of RamaLama (sort of) enabling the model routing. However, it shares quite a big part of code with the RamaLama core implementation and is designed to be deployed right next to the inference engine such as llama.cpp - which requires it to be part of the container. Since this is not desirable (and there are plans to remove RamaLama from the container images), the current daemon implementation is not a viable solution.
Proposals
There are various options on how to enable model swapping.
llama-swap
RamaLama could leverage existing tools, such as llama-swap, to quickly implement model swapping. In the case of llama-swap, RamaLama could generate configurations on the fly for the simplest case (only one model at the time) and provide a simplified interface for further configuration:
However, this requires integration with the external tool, e.g. llama-swap, and the user to also install it locally in order to have one model per container (
podman runrequired). If multiple models can/should run in one container, it can be used containerized. Additionally, its also (potentially) limiting future features or special cases.llama.cpp
llama-server also provides a model router out of the box (see here) and has the proper API to load/unload models. In order to support this, RamaLama would just need mount the model store into the inference container in the required structure. For the simple cases this should work out of the box then (or require minimal changes) - more advanced cases require more work.
However, this is not possible with
--nocontainer(having the inference engine on the host, not in the container) due to the specific folder structure required by llama-server. It also enables swapping only for llama.cppCustom, thin model router implementation
Another option would be a model router implementation, similar to the currently existing daemon in RamaLama. However, its responsibilities would only include:
Similar to the other options, it can be implemented as a simple server providing a REST (or any other) API and the proxy, e.g. using flask (python) or Go or ...
However, this would require the ramalama-daemon to run directly on the host (similar to RamaLama) since it needs to be able to start/stop containers.
Others
Many more options on how to achieve this.
Beta Was this translation helpful? Give feedback.
All reactions