Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support progress and/or rate indication #1658

Closed
brabster opened this issue Oct 8, 2019 · 25 comments · Fixed by #6714
Closed

Support progress and/or rate indication #1658

brabster opened this issue Oct 8, 2019 · 25 comments · Fixed by #6714
Labels
type/feature Feature request

Comments

@brabster
Copy link
Contributor

brabster commented Oct 8, 2019

Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request

What happened:
It would be helpful if it were possible to see how much progress is being made by a long-running task

What you expected to happen:
A progress indication of some kind - ideally something that can be watched at the CLI and rendered in argo-ui.

For example I have a job that runs in the middle of a workflow over a few million lines and it takes a while. I'm trying to tune k8s autoscaling to scale out a service it uses and it would be helpful to know how fast it is going and how far it is through the work. Rather than implement my own solution to log this or produce metrics it would be neat if there were a way to publish this information straight into Argo.

How to reproduce it (as minimally and precisely as possible):
N/A from here down...

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version
  • Kubernetes version :
$ kubectl version -o yaml

Other debugging information (if applicable):

  • workflow result:
$ argo get <workflowname>
  • executor logs:
$ kubectl logs <failedpodname> -c init
$ kubectl logs <failedpodname> -c wait
  • workflow-controller logs:
$ kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
@simster7 simster7 added the type/feature Feature request label Oct 21, 2019
@ddseapy
Copy link
Contributor

ddseapy commented Jan 28, 2020

We are also writing a custom solution for this, where workflows report progress to a REST endpoint. If this were built into argo, that would be nice.

@alexec
Copy link
Contributor

alexec commented Jul 28, 2020

See #3557

@alexec
Copy link
Contributor

alexec commented Jul 28, 2020

We've an ask for this in the user interface, both at a node and workflow level.

This could be done in the UI using existing backend code by getting the last successful execution of the workflow (determine by workflows.argoproj.io/workflow-template label - i.e. only works for templates and only if there was a last successful run). It is a straight forward look up on the workflow.

This is not a very popular issue, so warrants a low cost solution.

@alexec alexec self-assigned this Sep 15, 2020
@alexec alexec added this to the v2.13 milestone Sep 15, 2020
@alexec alexec linked a pull request Sep 18, 2020 that will close this issue
6 tasks
@alexec
Copy link
Contributor

alexec commented Sep 19, 2020

I've linked the PR, which provides a coarse way to track your workflow. It estimates how long it will take to complete and displays progress towards that time. This is purely time-based.

This is not as fined grained or nuanced as some of the ideas in this PR. For example, if a workflow had 10 steps, another way to do this would be to base progress on the number of steps complete.

@brabster @ddseapy please take a look and make suggestions or comments.

@dseapy
Copy link

dseapy commented Sep 19, 2020

Unfortunately most our current workflow templates yield workflows that vary drastically from a couple minutes to several hours, so as the pr mentions in the docs it's not ideal for our current use case but will definitely keep this in mind for future wftmpl.

@alexec
Copy link
Contributor

alexec commented Sep 19, 2020

vary drastically from a couple minutes

Any thought on other ways to estimate this that would work for you?

@dseapy
Copy link

dseapy commented Sep 19, 2020

Currently for most workflows we have a specific parameter (hardcoded name we look for), whose value is the amount that workflow progress should be incremented for once that node is complete. This parameter is passed from a previous step that knows how many items/nodes there will be. Progress is computed by looking at the workflow in a shared informer and summing the parameter values up across all nodes.

For workflows with just a few nodes, we also have a rest service that allows nodes to increment the progress themselves. This progress is stored in a separate postgres table.

Both of these have the downside of the workflow needing to know/report about their own progress. What is in the PR clearly doesn't have that restriction. So Im not sure I have a better generalized suggestion. At least not a performant one that doesn't involve periodically analysing lots of workflows from the archive.

@alexec
Copy link
Contributor

alexec commented Sep 19, 2020

@dseapy that's really interesting. Let me play that back so I can be sure I understand. In effect, you have a method for nodes to report their progress? That is just another way to report progress.

Proposal:

We currently nodes report status by annotating their pod. What if there was an annotation that we recognise as progress your nodes could update this and we could report back via the CLI and UI.

If the annotation was absent, we would default a computed metric.

  • Pod/resource template - annotation or estimated duration
  • DAG/Steps template - number of tasks complete/total number of tasks

@dseapy
Copy link

dseapy commented Sep 19, 2020

Yeah, I believe that would work for our wftmpls.

@alexec
Copy link
Contributor

alexec commented Sep 20, 2020

@dseapy I've tweaked my POC. Setting annotations works today, but (a) requires the workflow role to have pod patch and this is is a security issue we want to remove and (b) exposes implementation details, instead what about just recognising a log line:

#argo progress=25/100

@alexec
Copy link
Contributor

alexec commented Sep 20, 2020

Better still:

#argo progress=25/100 message=custom message

@dseapy
Copy link

dseapy commented Sep 20, 2020

That does indeed sound much better for security and friendlier to the container. I'm guessing there is not too much of a performance hit to do the log parsing/matching for each pod?

@brabster
Copy link
Contributor Author

I think this approach works for me too. I had long running tasks within a workflow that I tracked via logging,

@andyleap
Copy link

I really like the progress stuff, definitely solves some UX issues I have! One thought is: The ability to have multiple independent progress meters would be nice (think things like monitoring rollouts of multiple different kubernetes workloads). Obviously can be handled currently by just breaking those out into distinct leaf nodes, so not really an issue.

@andyleap
Copy link

Another thought would be to have the node config include a regex that would yield the progress information?

@mweibel
Copy link
Contributor

mweibel commented Oct 5, 2020

I think this way of parsing the stdout log works for me too 👍

@alexec
Copy link
Contributor

alexec commented Oct 8, 2020

@brabster @andyleap @dseapy I think we'll be including basic N/M progress in v2.12, but not workflows-self reporting their progress. I think this is a cool feature, and I want to know if one of you would like to take it on based on the design in #4015?

@salanki
Copy link
Contributor

salanki commented Oct 8, 2020

The implementation in #4015, reporting progress via #argo progress=N/M message="m" would work for our use cases!

@alexec
Copy link
Contributor

alexec commented Oct 8, 2020

@salanki thank you. I think what I'm saying is that I don't plan to work on this anymore. But if someone wants to take it on - that'd be great!

@alexec alexec removed their assignment Oct 14, 2020
@alexec alexec removed this from the v2.13 milestone Oct 22, 2020
@mweibel
Copy link
Contributor

mweibel commented Sep 3, 2021

FYI i started working on this in https://github.com/helio/argo-workflows/tree/custom-workflow-progress
ported the code @alexec wrote to the latest master state and started testing it locally (and with tests). Didn't run it with "real" workloads yet and there is much to improve. More of a prototyping state at this point.

But in case anyone wants to have an early look at it, more than welcome :).

I only ported the custom N/M reporting, not yet the message part as the pod reporting status messages has changed quite a bit and I'm not sure if it's even needed TBH.

compare link: https://github.com/argoproj/argo-workflows/compare/master...helio:custom-workflow-progress?expand=1

mweibel added a commit to helio/argo-workflows that referenced this issue Nov 3, 2021
original code from: https://github.com/argoproj/argo-workflows/pull/4015/files
closes argoproj#1658, argoproj#4245

Signed-off-by: Michael Weibel <michael@helio.exchange>
@mweibel
Copy link
Contributor

mweibel commented Nov 3, 2021

in case anyone wants to start using this, docs are here: https://github.com/argoproj/argo-workflows/blob/master/docs/progress.md#self-reporting-progress. Let me know if they're clear enough.

@hatrg
Copy link

hatrg commented Jul 12, 2023

Hi, can I get this information from inside a workflow manifest using a workflow variable, something like {{workflow.progress}}?

@mweibel
Copy link
Contributor

mweibel commented Jul 12, 2023

As far as I'm aware not using a variable (all workflow variables are documented here). However, you could might be able to read the file exposed via ARGO_PROGRESS_FILE though I'm not sure about that.

@hatrg
Copy link

hatrg commented Jul 12, 2023

Thanks, just tried using the https://github.com/argoproj/argo-workflows/blob/729d0a7b35825ff47254bad3fbea3d571ea621c8/examples/dag-coinflip.yaml example, on L50 I replaced echo \"it was heads\" with cat $ARGO_PROGRESS_FILE, the step failed and I got this in the logs:

cat: can't open '/var/run/argo/progress': No such file or directory

looks like it's not two-way binding.

@hatrg
Copy link

hatrg commented Jul 12, 2023

is it possible to get the global total tasks in the workflow so that I can calculate it myself? spent some time checked the doc but couldn't find any way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
9 participants