New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: system level workflow parallelism limits & priorities #740

Closed
jessesuen opened this Issue Feb 14, 2018 · 0 comments

Comments

Projects
None yet
2 participants
@jessesuen
Contributor

jessesuen commented Feb 14, 2018

Capturing a discussion from the google group into a GitHub issue. Read from top to bottom:

On Thursday, February 1, 2018 at 8:14:36 AM UTC-8, Rob Oxspring wrote:

I might have misunderstood but that issue seems to be primarily concerned about parallelism of steps within a workflow, which is interesting but not especially useful for our purpose as we're using very limited parallelism within a workflow anyway.

We typically run workflows of 20-30 steps of which a few might consume large amounts of memory. We size our cluster with the expectation of a maximum of N workflows running at a time. What I'd like is to be able to enforce this limit of N active workflow resources at any time, or perhaps limit count(pods selecting label memoryhog=true)<2N, but I can't see ways of doing that. Admittedly the latter could be useful as a general Kubernetes feature request, but the former would seem reasonable as an Argo feature request?

In the absence of those feature then I guess the options would be to implement a ValidatingAdminissionWebhook, or implement queuing in front of Argo submissions. Any other obvious options I missed?

Thanks,

Rob


Hi Rob,

You are correct, the issue which I linked to is primarily concerned with limiting parallelism of pods within a workflow. It does not cover the limiting of workflows in the system at a time.

Controlling number of concurrent workflows at the controller level isn't something we've thought about, but I do think it's something that warrant's some thought and discussion. The following are the considerations in the implementation that would need to be handled:

  • we would first need to introduce a Pending state at the workflow level. Pending is not actually part of the workflow state machine currently (it's an inferred state only shown in the CLI).
  • If you permit Pending workflows, you now have essentially introduced a priority system in the controller, because when deciding on the next workflow to allow to run, the basic requirement would be to choose it in a FIFO fashion. The problem is that the kubernetes informers framework (of which controllers are built on top of), does not lend itself well to this mode of resource selection.
  • workflow-controller is expected to eventually scale horizontally (e.g. the workflow controller might run with N replicas). When there are multiple controllers managing the same set of workflows, then the book-keeping/shared knowledge of workflow counts would be an issue. But this is a theoretical problem we can deal with at a later time.

In terms of options today, kubernetes admission webhooks might work, but I think this would not permit queuing, so it doesn't seem like a good fit. Also, I'm not sure if that mechanism supports CRDs. Your last suggestion, the external queuing on-top of argo, is probably the best option and most flexible. Admission control is something that typically wants to take a lot of other things into consideration (e.g. priority, permissions, stop-the-world, etc...). Since we view the workflow-controller as a building block for other applications, my preference is to implement the queuing at that layer.

On a related note, Kubernetes 1.9 has an alpha feature for pod prioritization. I would want to understand the implementation of that because workflow prioritization should be implemented in the same manner.

BTW, I will file a new issue for this on github to facilitate further discussion (in case other people request the same/similar feature).

-Jesse

@jessesuen jessesuen added the proposal label Feb 14, 2018

alexmt added a commit to alexmt/argo that referenced this issue Oct 31, 2018

@alexmt alexmt self-assigned this Oct 31, 2018

alexmt added a commit to alexmt/argo that referenced this issue Nov 1, 2018

alexmt added a commit to alexmt/argo that referenced this issue Nov 6, 2018

@alexmt alexmt closed this in #1065 Nov 7, 2018

alexmt added a commit that referenced this issue Nov 7, 2018

Issue #740 - System level workflow parallelism limits & priorities (#…
…1065)

* Issue #740 - System level workflow parallelism limits & priorities

* Apply reviewer notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment