-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Currently, checking if the backend aborted is only possible as far as I can tell via:
if (llama_decode(ctx, batch) == 2) {//....This is very hacky, as it means the only way to check the abort status requires an active batch, which if we aborted inference, has possibly been discarded already. The way that abort currently functions as well, means that the backend graph will only ever attempt to abort when decode is called.
My proposal is to decouple the concept of decoding and status checking if possible allowing:
if(llama_check_backend_abort_status(ctx) == 2) { //...This would have to prod backends into checking their compute status and then reporting it, but not in a way tied to the decode and allows us to do this without a real batch.
Motivation
Aborting work is common due to a variety of circumstances in server development. A server does not want to add work to its processing queue using continuous batching if adjacent work is attempting to abort the decode, which will fail the entire continuous batch decode step. Llama decode as the only possible way of checking the backend status is likely to cause fragility in future developments. Because of the interaction between continuous batching and adding in new work after the abort was signaled, we can run into a batch decode error for new work, this means if we'd like to abort work, we have to trap for the abort signal, and then run fake decodes until it returns status code 2. There's a multitude of potential edge cases that arise as a result of this that become difficult to deal with, as this is a source of potentially deadlocking server threads without clear ways to know if it's possible or allowed to resume.
Possible Implementation
It's necessary to modify the backend interface to include a separate path to prod the status that doesn't actually involve computing anything. A cursory glance means a modification to all backends and an addition to the backend iface. Having a status for no processing or similar would also be very welcome.