-
Notifications
You must be signed in to change notification settings - Fork 55
Open
Labels
enhancementNew feature or requestNew feature or request
Description
If needed, we should optimize huggingface generation to be faster. It is currently synchronous due to the fact that loading an adapter modifies the underlying model. This means that only one "type" of request can happen at a given time (ie base model or a single adapter).
Improvements:
- we could only activate the current lock when adapters are actually added to the model; this would keep non-adapter based generation concurrent
- A conditional "semaphore" that lets requests of the current type through: feat: add lock to hf backend to prevent concurrent generation with conflicting weights #233
- Note: the code from that PR should be refactored to have more clear names
- A modified "conditional semaphore" approach where multiple models are loaded into memory. Then, those models can be run concurrently and we can route requests to each model based on the type. Some additional thoughts on this approach:
- Keep multiple copies of the model (each one is modified by the adapter) on the CPU memory, assuming the main memory is sufficient.
- if CPU inference:
- just use each model, no lock required.
- if GPU inference:
- Compute the amount of GPU memory each huggingface model (= just an instance of nn.Module) requires.
- Divide it by some coarse unit, e.g., 1GB. take the ceiling, not floor
- max_units = [available GPU memory] / 1GB.
- Create a global semaphore with max_units capacity for each GPU
- Each model acquires [model memory]/1GB units of semaphore to be on a GPU.
- Iterate over the semaphores/GPUs and acquire the units when available.
- Move the adapted model to GPU before running the inference. This should be a no-op if the model is already on the GPU.
- Implementation note:
- Releasing the GPU memory should be lazy. (otherwise it gets released every time the semaphore is released)
- Pros: Can run different adapters (and the base model) concurrently.
- Pros: maximizes the GPU usage.
Cons: affects loading multiple m.session instances. (though I assume this is rare / not common yet)- With a global semaphore based on 1GB units, you can mix multiple sessions
Cons: Still not address the case with multiple GPUs- Multiple semaphores.
- Room for improvement: Could be slow if consequtive queries require different adapters.
- A scheduler that groups queries to the same adapter together would address this.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request