Skip to content

feat(runs): async abort reconciler with retry and crash recovery#7080

Merged
AdilFayyaz merged 5 commits intov2from
adil/run-abort-db-watcher
Mar 25, 2026
Merged

feat(runs): async abort reconciler with retry and crash recovery#7080
AdilFayyaz merged 5 commits intov2from
adil/run-abort-db-watcher

Conversation

@AdilFayyaz
Copy link

@AdilFayyaz AdilFayyaz commented Mar 23, 2026

Why are the changes needed?

AbortRun/AbortAction previously called actionsClient.Abort() synchronously after updating the DB. If the call failed, the DB showed ABORTED but the pod kept running with no retry path. Abort needed to be made reliable without blocking the user-facing response.

What changes were proposed in this pull request?

  • Add three columns to actions (abort_requested_at, abort_attempt_count, abort_reason); DB update is immediate so the UI sees ABORTED optimistically
  • Introduce AbortReconciler — a background worker pool that drives pod termination to completion with exponential backoff and a configurable max-attempts cap; uses a deduplicated in-memory queue so concurrent pushes for the same action are no-ops
  • AbortRun/AbortAction push directly to the reconciler queue after the DB update; startup scan re-enqueues any pending aborts left over from a process crash
  • Scope all abort-related DB queries with run_name to correctly target individual actions when node names (e.g. a0) repeat across runs

How was this patch tested?

  • go test ./runs/... passes
  • TestAbortReconciler_SuccessOnFirstAttempt — happy path: worker calls actionsClient.Abort, clears DB flag
  • TestAbortReconciler_RetriesOnFailure — fails N−1 times, succeeds on Nth attempt
  • TestAbortReconciler_GivesUpAtMaxAttempts — always fails; flag cleared after max attempts with no further calls
  • TestAbortReconciler_DeduplicatesQueue — same action pushed twice; actionsClient.Abort called exactly once
  • TestAbortReconciler_StartupScanPicksUpPending — pending aborts from before startup are all processed
  • TestAbortReconciler_NotFoundTreatedAsSuccess — NotFound from actions service clears the flag without retry
  • Manually ran an example from the sdk and tested the abort functionality

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

  • main
    • Flyte 2 WIP #6583
      • feat(runs): async abort reconciler with retry and crash recovery 👈

@AdilFayyaz AdilFayyaz self-assigned this Mar 23, 2026
@AdilFayyaz AdilFayyaz added added Merged changes that add new functionality flyte2 labels Mar 23, 2026
@github-actions github-actions bot mentioned this pull request Mar 23, 2026
3 tasks
@AdilFayyaz AdilFayyaz requested a review from pingsutw March 23, 2026 22:46
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz force-pushed the adil/run-abort-db-watcher branch from e772d6b to 1750cbc Compare March 24, 2026 22:13
@AdilFayyaz AdilFayyaz requested a review from pingsutw March 24, 2026 22:25
Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some nits

Signed-off-by: M. Adil Fayyaz <62440954+AdilFayyaz@users.noreply.github.com>
@AdilFayyaz AdilFayyaz merged commit 3bb0827 into v2 Mar 25, 2026
17 checks passed
@AdilFayyaz AdilFayyaz deleted the adil/run-abort-db-watcher branch March 25, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

added Merged changes that add new functionality flyte2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants