Skip to content

Feature Request: Apply LoRA adapters per-request #10377

@ngxson

Description

@ngxson

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Server now supports hot-swapping LoRA adapters via /lora-adapters endpoint, which changes the global adapter config.

With this, the only "safe" moment to apply LoRA changes is when all slots are idle.

However, this is not practical in case the server has a high number of requests (ref: #10374). With continuous batching, the chance of all slots become idle is rare.

Motivation

Possible Implementation

  1. We can group only requests using the same LoRA config to the same batch
  2. Call common_lora_adapters_apply before processing the batch (remember to clear KV if needed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions