RFE: Model Swapping #2282

engelmi · 2026-01-07T16:38:31Z

engelmi
Jan 7, 2026
Maintainer

RFE: Model Swapping

Related: #2080

It would be great if RamaLama supported Model Swapping - adhoc starting/stopping models based on requested URLs and a swap configuration - similar to llama-swap. For this, however, RamaLama would need some kind of daemon acting as a model router with a reverse-proxy.

The daemon/model router would be a new component, (mostly) independent from RamaLama.

Current state in RamaLama

Currently, there is a daemon implementation inside of RamaLama (sort of) enabling the model routing. However, it shares quite a big part of code with the RamaLama core implementation and is designed to be deployed right next to the inference engine such as llama.cpp - which requires it to be part of the container. Since this is not desirable (and there are plans to remove RamaLama from the container images), the current daemon implementation is not a viable solution.

Proposals

There are various options on how to enable model swapping.

llama-swap

RamaLama could leverage existing tools, such as llama-swap, to quickly implement model swapping. In the case of llama-swap, RamaLama could generate configurations on the fly for the simplest case (only one model at the time) and provide a simplified interface for further configuration:

However, this requires integration with the external tool, e.g. llama-swap, and the user to also install it locally in order to have one model per container (podman run required). If multiple models can/should run in one container, it can be used containerized. Additionally, its also (potentially) limiting future features or special cases.

llama.cpp

llama-server also provides a model router out of the box (see here) and has the proper API to load/unload models. In order to support this, RamaLama would just need mount the model store into the inference container in the required structure. For the simple cases this should work out of the box then (or require minimal changes) - more advanced cases require more work.
However, this is not possible with --nocontainer (having the inference engine on the host, not in the container) due to the specific folder structure required by llama-server. It also enables swapping only for llama.cpp

Custom, thin model router implementation

Another option would be a model router implementation, similar to the currently existing daemon in RamaLama. However, its responsibilities would only include:

receive an already assembled inference command (by RamaLama)
provide the reverse-proxy (to monitor the traffic and enable swapping models)
starting/stopping models based on a configuration

Similar to the other options, it can be implemented as a simple server providing a REST (or any other) API and the proxy, e.g. using flask (python) or Go or ...

However, this would require the ramalama-daemon to run directly on the host (similar to RamaLama) since it needs to be able to start/stop containers.

Others

Many more options on how to achieve this.

olliewalsh · 2026-02-25T23:36:16Z

olliewalsh
Feb 25, 2026
Maintainer

@RonnyPfannschmidt FYI, it would be good to capture your requirements here

0 replies

RonnyPfannschmidt · 2026-03-04T08:51:45Z

RonnyPfannschmidt
Mar 4, 2026

my needs can be captured on multiple level

shared entrypoint even when running multiple instances with differnt models

its an utter pain to manage dozens of entpoints to connecto to for rach a single or 2 models - some kind of router is absolutely required

shared/linked routers

i have multiple computers, and now even a olares one ai homeserver - its very painful to manage multiple models on multiple computers

it would be great if there was some kind of linkable ingress, so i can run models on multiple systems, and each of my computers just goes to localhost and selects a model name

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: Model Swapping #2282

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFE: Model Swapping #2282

Uh oh!

Uh oh!

engelmi Jan 7, 2026 Maintainer

RFE: Model Swapping

Current state in RamaLama

Proposals

llama-swap

llama.cpp

Custom, thin model router implementation

Others

Replies: 2 comments

Uh oh!

olliewalsh Feb 25, 2026 Maintainer

Uh oh!

RonnyPfannschmidt Mar 4, 2026

shared entrypoint even when running multiple instances with differnt models

shared/linked routers

engelmi
Jan 7, 2026
Maintainer

olliewalsh
Feb 25, 2026
Maintainer

RonnyPfannschmidt
Mar 4, 2026