-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Set a max recursion depth limit #11646
Conversation
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com> Signed-off-by: Alan Clucas <alan@clucas.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, I have a nit with the os.Getenv
, but this is how the rest of the codebase does it so not a big deal.
The nit: os.Getenv is a syscall, this value may differ throughout the execution of the operator function. I can see both sides to this, perhaps this is desired behaviour. The functional programmer in me does not like it one bit though.
Regardless I've approved, grepping shows that argo commonly does this, so changing it just here won't change a thing.
@isubasinghe if these env vars were used for anything 'non experimental' I'd suggest we read them at startup and provide them via a service, but that's a bit overkill for the current intended use. |
@Joibel yeah that is fair enough, I think my previous comment was largely motivated by my subjective sense of "niceness" rather than practicality. |
workflow/controller/controller.go
Outdated
@@ -67,6 +67,8 @@ import ( | |||
plugin "github.com/argoproj/argo-workflows/v3/workflow/util/plugins" | |||
) | |||
|
|||
const defaultMaxStackDepth = 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a reasonable default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number comes from #4193, where it started off as 500, and was amended to 150. #10785 removed 10 as the limit, but that feels unreasonably low and likely to cause issues with existing workflows.
Note, this is not actually a default because it's uncontrollable (again #4193 started with it configurable via the environment, but switched to a disable instead). I will change it's name.
My opinion is that 150 is too big, if I was plucking a number out of the air I'd set it lower at around 50. What's your guess at a good number @terrytangyuan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a number we can use where the controller is having trouble once the limit is reached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm think by the time the controller is having trouble we've gone too far.
Without status offloading you'll probably have wasted a lot of resources but hit failure to store status before you get the controller in trouble. I don't think we should just be aiming to stop the controller getting into trouble.
I think this number should aim for a value which is reasonable for workflows already in existence so as few users as possible consider this a regression, but is otherwise low enough to prevent wasting time doing work which will be wasted because we're actually in an infinite recursion loop.
I'm going to propose reducing the 150 to 100 and see what happens in the real world, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Also we might want to change the default to disabled to avoid breaking existing workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to leave it enabled as a saner default. Switching it 'later' doesn't really help with not breaking stuff, and I'm expecting this to go in a major not minor release (hence flagged as feat
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this kind of recursion prevention, we produced infinite recursion due to equal names of a step and an entrypoint. Spent a lot of time to stop/terminate/delete the workflow that kept calling its own entrypoint before we spent even more time finding out why this occurred.
This prevention would have saved us much time, so I think it should be merged to save time for others in future.
Signed-off-by: Alan Clucas <alan@clucas.org>
Head branch was pushed to by a user without write access
Co-authored-by: Yuan (Terry) Tang <terrytangyuan@gmail.com> Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Yuan (Terry) Tang <terrytangyuan@gmail.com> Signed-off-by: Alan Clucas <alan@clucas.org>
This is a resurrection of #4193 to address #11499
Motivation
Intentional or unintentionally it is possible to write templates that indefinitely recurse. This change prevents that by monitoring the stack depth of templates, and stopping when it reaches 150 which seems big enough for anyone.
It can be disabled globally with a controller environment variable in case anyone comes across problems and we can address those problems with new fixes if they are reported.