Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Proposal: system level workflow parallelism limits & priorities #740
Capturing a discussion from the google group into a GitHub issue. Read from top to bottom:
On Thursday, February 1, 2018 at 8:14:36 AM UTC-8, Rob Oxspring wrote:
I might have misunderstood but that issue seems to be primarily concerned about parallelism of steps within a workflow, which is interesting but not especially useful for our purpose as we're using very limited parallelism within a workflow anyway.
We typically run workflows of 20-30 steps of which a few might consume large amounts of memory. We size our cluster with the expectation of a maximum of N workflows running at a time. What I'd like is to be able to enforce this limit of N active workflow resources at any time, or perhaps limit count(pods selecting label memoryhog=true)<2N, but I can't see ways of doing that. Admittedly the latter could be useful as a general Kubernetes feature request, but the former would seem reasonable as an Argo feature request?
In the absence of those feature then I guess the options would be to implement a ValidatingAdminissionWebhook, or implement queuing in front of Argo submissions. Any other obvious options I missed?
You are correct, the issue which I linked to is primarily concerned with limiting parallelism of pods within a workflow. It does not cover the limiting of workflows in the system at a time.
Controlling number of concurrent workflows at the controller level isn't something we've thought about, but I do think it's something that warrant's some thought and discussion. The following are the considerations in the implementation that would need to be handled:
In terms of options today, kubernetes admission webhooks might work, but I think this would not permit queuing, so it doesn't seem like a good fit. Also, I'm not sure if that mechanism supports CRDs. Your last suggestion, the external queuing on-top of argo, is probably the best option and most flexible. Admission control is something that typically wants to take a lot of other things into consideration (e.g. priority, permissions, stop-the-world, etc...). Since we view the workflow-controller as a building block for other applications, my preference is to implement the queuing at that layer.
On a related note, Kubernetes 1.9 has an alpha feature for pod prioritization. I would want to understand the implementation of that because workflow prioritization should be implemented in the same manner.
BTW, I will file a new issue for this on github to facilitate further discussion (in case other people request the same/similar feature).