Skip to content

Feature Request: Direct way to check the status of the abort mechanism. #12525

@CoffeeVampir3

Description

@CoffeeVampir3

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently, checking if the backend aborted is only possible as far as I can tell via:

if (llama_decode(ctx, batch) == 2) {//....

This is very hacky, as it means the only way to check the abort status requires an active batch, which if we aborted inference, has possibly been discarded already. The way that abort currently functions as well, means that the backend graph will only ever attempt to abort when decode is called.

My proposal is to decouple the concept of decoding and status checking if possible allowing:

if(llama_check_backend_abort_status(ctx) == 2) { //...

This would have to prod backends into checking their compute status and then reporting it, but not in a way tied to the decode and allows us to do this without a real batch.

Motivation

Aborting work is common due to a variety of circumstances in server development. A server does not want to add work to its processing queue using continuous batching if adjacent work is attempting to abort the decode, which will fail the entire continuous batch decode step. Llama decode as the only possible way of checking the backend status is likely to cause fragility in future developments. Because of the interaction between continuous batching and adding in new work after the abort was signaled, we can run into a batch decode error for new work, this means if we'd like to abort work, we have to trap for the abort signal, and then run fake decodes until it returns status code 2. There's a multitude of potential edge cases that arise as a result of this that become difficult to deal with, as this is a source of potentially deadlocking server threads without clear ways to know if it's possible or allowed to resume.

Possible Implementation

It's necessary to modify the backend interface to include a separate path to prod the status that doesn't actually involve computing anything. A cursory glance means a modification to all backends and an addition to the backend iface. Having a status for no processing or similar would also be very welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions