Expand docs for aborting workflows #12800

benjimin · 2024-03-14T05:34:25Z

Docs request

I'd like a page prominently added to the user-guide that explains in detail the process for aborting workflows.

(The page should also cover salvaging interrupted workflows, re-running workflows from scratch, and cleaning up after workflows regardless of how they ended.)

Use-case

It is commonplace that as a user is developing a large workflow they may mistakenly spawn an expensive processing job that does not function as intended, realise their error, and want to urgently relinquish all the compute resources. Next they will want to restore the environment to a clean state from which the job can be rerun successfully, or alternatively they may want to salvage any check-pointed work.

There are various relevant commands e.g. stop, terminate and suspend. (Also resubmit, resume and retry.) The docs should give advice for choosing between these commands, understanding the consequences, and cleaning up afterward.

The guidance should also support users to develop workflows (and container images) that are robust, that shut down gracefully, that save progress at check-points, and do not need fiddly clean up. Should also give users more confidence to kick off large workflows, knowing how to monitor and reliably abort in case of any problem.

Currently the docs for the CLI commands are very terse, and several users have asked for more clarification via discussion in issues (e.g. #4454, #11511, #2742, etc).

Outline of proposed content

Explain what happens to already-running subtasks (e.g. what signals the controller will cause to be sent to running pods, and what governs this sequence). Will the existing pods get interrupted immediately or will they run until a checkpoint?

Explain what happens to stored inputs, outputs, artifacts, etc. In which cases are they retained vs expunged? If retained, are there any issues to be aware of for picking up again where left off? What steps are needed to force purging of stored artifacts etc? By default will this partially generated data be retained indefinitely, or expire after some period? Are there any issues to ensure consecutive runs of a workflow cannot interfere with each other (or conversely, to enable them to leverage previously-generated data)?

Explain the circumstances where an aborted workflow can be salvaged (either fully salvaging the progress and completing the workflow, or just salvaging some intermediate data).

Explain where the archive fits in. (Do different methods of aborting and replacing a workflow alter the archive lifecycle? Can the archive also be used to recover data from, and to resume, previously forcefully aborted runs?)

Disambiguate all related argo commands:

stop/terminate (differs only in whether the shutdown handlers specified in the workflow are invoked),
terminate/delete/kubectl delete (is there any difference for still-running pods, artefacts, etc?)
stop/suspend (??)
resubmit/re-submit/re-kubectl apply (differs only in parameter inheritance?),
retry/resubmit (differs in whether steps that previously completed successfully are rerun again)
retry/resume (??)

Generally impart context illuminating the workflow lifecycle and the interacting components that govern it. For example, are these CLI commands just updating status metadata fields in the k8s workflow resource? What actions will the workflow controller take in response? Will it delete the worker pods, leaving it up to other k8s control components to manage a gracefully staged shutdown? How does the workflows controller ascertain whether the subtask was finished? Is disruption liable to disrupt the capture and preservation of intermediate outputs (e.g. as artifacts) and the invocation of workflow-specified handlers (e.g., is this implemented with sidecar pods and how do they tolerate shutdown signals)? Should include mention of cluster autoscaling delays, which may limit how rapidly the physical compute resources can be relinquished (e.g. back to a cloud provider) when a user wishes to abort their run.

This should also amount to guidance regarding how to package container images for argo workflows, in order to ensure the worker processes actually will shut down gracefully (listening for expected signals and responding in expected timeframes), and will do whatever is necessary to preserve intermediate work to the maximum possible extent, but discourage attempting to implement any checkpointing model that would be incongruent with the argo workflows' lifecycle model.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-03-15T05:21:51Z

disambiguation

Currently the docs for the CLI commands are very terse, and several users have asked for more clarification via discussion in issues (e.g. #4454, #11511, #2742, etc).

Disambiguate all related argo commands:

That is exactly why I did clearly disambiguate these in the CLI docs in #11624, #11625, and #11626.
UI still needs implementation work in #9541

Anecdotally, I haven't seen the same level of confusion since those improvements; I actually haven't really needed to refer to them since I wrote them.
I also say that first-hand too, as I was confused by those too as a user and I made those PRs before I was a maintainer for a very self-service need.

stop/suspend (??)

this was the topic of #11624 / #11511.
they are quite different (although the naming is similar from a non-developer perspective). a suspend is a temporary pause, a stop is a variation of a terminate, as you correctly wrote.

retry/resume (??)

these are also quite different. resume is the opposite of suspend as its docs say. retry only applies to completed Workflows.

resubmit/re-submit/re-kubectl apply (differs only in parameter inheritance?),

effectively each of these is just a shorthand for the other.

terminate/delete/kubectl delete (is there any difference for still-running pods, artefacts, etc?) [sic]

The latter two are identical.
For deleting a currently running Workflow, good question, I'm not totally sure off the top of my head. There's a bunch of race conditions there with GC (e.g. using ownerReferences or not). If you're deleting something while it's running, I would generally think you don't care about the state as you're effectively forcing an indeterminate behavior.

low-level details

For a large chunk of the rest, those are pretty low-level questions. They're not bad questions, but they are not the most pressing questions to most users. This is actually my first time seeing a lot of these questions.
The low-level nature of them actually raises a question of where they would even go in the docs. As an upstream example, the built-in k8s Job docs don't really go into this level of granularity (and a good chunk of more recent docs on edge cases come from users who literally wrote features for those edges). In most projects, people who know this level of low-level detail are almost exclusively contributors.

Several of your questions are also combinations of features. For the most part, features are made to work independently of each other.

Some of them are also implementation details, and some of those are k8s implementation details not specific to Argo. They may be good to know, but some may be more suitable as plain API docs (similar to the Fields Reference) and others are out of scope.

Others are pretty nuanced race conditions, and I don't know if I've ever seen detailed docs about what race conditions to expect when your code is in forcibly stopped into an error / cancelled state. It's an uncommon scenario and races can result in indeterminate behavior in most programming languages.
It can certainly be good to have, but even explaining some race conditions themselves is non-trivial (these are sometimes the subject of multi-line comments after all -- those are not user facing, so converting them into something easily consumable by users is a substantive task).

Related, writing idempotent workflows is not specific to Argo but is a general best practice so that you can reconstruct state whenever needed.

independent features

What steps are needed to force purging of stored artifacts etc? By default will this partially generated data be retained indefinitely, or expire after some period?

ArtifactGC is a more recent feature with its own docs.

Will it delete the worker pods, leaving it up to other k8s control components to manage a gracefully staged shutdown?

Similarly this is the topic of PodGC.

Explain where the archive fits in. (Do different methods of aborting and replacing a workflow alter the archive lifecycle? Can the archive also be used to recover data from, and to resume, previously forcefully aborted runs?)

The Workflow Archive just takes a completed Workflow resource and puts it into a DB. There is an archiveTTL that is independent of the in-cluster TTL.
The rest is intentionally identical to a Workflow in-cluster.

There are of course temporary race conditions when a Workflow is in both (that is unavoidable); the one in-cluster is always preferred to the one in the DB in those cases.

Executor dependent

Explain what happens to already-running subtasks (e.g. what signals the controller will cause to be sent to running pods, and what governs this sequence). Will the existing pods get interrupted immediately or will they run until a checkpoint?

(e.g., is this implemented with sidecar pods and how do they tolerate shutdown signals)

in order to ensure the worker processes actually will shut down gracefully (listening for expected signals and responding in expected timeframes)

Yea these do not have much docs on them. Although it largely follows k8s, which largely follows Unix.
Arguably the best explanation I've found is actually in the Sidecar Injection docs.

Note that I also did not need this knowledge until quite literally earlier this week as I was responding to some quite nuanced sidecar and executor issues/bugs and did an hours-long deep dive.

These are also heavily dependent on your choice of Executor. Nowadays, it's all Emissary, but Executors have went through some iterations and likely will still have more iterations. Part of these iterations are also due to k8s itself evolving its runtime and security model etc. Emissary is probably the least intrusive Executor so far.
Emissary (and future executors) could probably have an entire docs page dedicated to them depending on the granularity you want to get into (for example, zombie reaping in containers has been the topic of blog posts and entire projects for almost as long as Docker has existed, but that is a pretty low level detail and more the scope of init systems than Docker).

but discourage attempting to implement any checkpointing model that would be incongruent with the argo workflows' lifecycle model.

also note that Argo does not know what container you're running or how they work. inputs and outputs are at the beginning and end of a task; the middle is effectively a black box. The result of that is that you don't need to write your container programs with knowledge of Argo.

That's about all I got in me for now. Note that both your questions and some of my limited answers here are well over a page in size all without a single Workflow manifest as an example -- that is a very significant amount of detail. That would also suggest that a single docs page would likely not be sufficient.

I would encourage splitting up some of this into more digestable pieces. That would also help with organizing any docs that would come out of this.
For instance, you titled this as "aborting workflows", however a good chunk of your questions are around things that happen after a Workflow completes: GC, Archive, retry, resubmit, "salvaging" or using partial data, etc. Even within "aborting workflows", there's at least two logical separations: Task-level processing (which would include Executor/Unix-level signaling, which itself could be its own separate page), and Controller-level resource management (which is largely documented per feature, per notes above).

benjimin added the type/feature Feature request label Mar 14, 2024

agilgur5 added the area/docs Incorrect, missing, or mistakes in docs label Mar 15, 2024

agilgur5 added problem/more information needed Not enough information has been provide to diagnose this issue. type/support User support issue - likely not a bug labels Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand docs for aborting workflows #12800

Expand docs for aborting workflows #12800

benjimin commented Mar 14, 2024 •

edited

agilgur5 commented Mar 15, 2024 •

edited

Expand docs for aborting workflows #12800

Expand docs for aborting workflows #12800

Comments

benjimin commented Mar 14, 2024 • edited

agilgur5 commented Mar 15, 2024 • edited

disambiguation

low-level details

independent features

Executor dependent

benjimin commented Mar 14, 2024 •

edited

agilgur5 commented Mar 15, 2024 •

edited