Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Set a max recursion depth limit #11646

Merged
merged 8 commits into from
Sep 5, 2023
Merged

Conversation

Joibel
Copy link
Member

@Joibel Joibel commented Aug 22, 2023

This is a resurrection of #4193 to address #11499

Motivation

Intentional or unintentionally it is possible to write templates that indefinitely recurse. This change prevents that by monitoring the stack depth of templates, and stopping when it reaches 150 which seems big enough for anyone.

It can be disabled globally with a controller environment variable in case anyone comes across problems and we can address those problems with new fixes if they are reported.

Signed-off-by: Alan Clucas <alan@clucas.org>
docs/scaling.md Outdated Show resolved Hide resolved
Joibel and others added 2 commits August 23, 2023 21:40
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Copy link
Member

@isubasinghe isubasinghe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, I have a nit with the os.Getenv, but this is how the rest of the codebase does it so not a big deal.

The nit: os.Getenv is a syscall, this value may differ throughout the execution of the operator function. I can see both sides to this, perhaps this is desired behaviour. The functional programmer in me does not like it one bit though.

Regardless I've approved, grepping shows that argo commonly does this, so changing it just here won't change a thing.

@Joibel
Copy link
Member Author

Joibel commented Aug 29, 2023

@isubasinghe if these env vars were used for anything 'non experimental' I'd suggest we read them at startup and provide them via a service, but that's a bit overkill for the current intended use.

@isubasinghe
Copy link
Member

@Joibel yeah that is fair enough, I think my previous comment was largely motivated by my subjective sense of "niceness" rather than practicality.
I've approved it anyway.

@@ -67,6 +67,8 @@ import (
plugin "github.com/argoproj/argo-workflows/v3/workflow/util/plugins"
)

const defaultMaxStackDepth = 150
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a reasonable default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number comes from #4193, where it started off as 500, and was amended to 150. #10785 removed 10 as the limit, but that feels unreasonably low and likely to cause issues with existing workflows.
Note, this is not actually a default because it's uncontrollable (again #4193 started with it configurable via the environment, but switched to a disable instead). I will change it's name.
My opinion is that 150 is too big, if I was plucking a number out of the air I'd set it lower at around 50. What's your guess at a good number @terrytangyuan?

Copy link
Member

@terrytangyuan terrytangyuan Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a number we can use where the controller is having trouble once the limit is reached?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm think by the time the controller is having trouble we've gone too far.
Without status offloading you'll probably have wasted a lot of resources but hit failure to store status before you get the controller in trouble. I don't think we should just be aiming to stop the controller getting into trouble.
I think this number should aim for a value which is reasonable for workflows already in existence so as few users as possible consider this a regression, but is otherwise low enough to prevent wasting time doing work which will be wasted because we're actually in an infinite recursion loop.
I'm going to propose reducing the 150 to 100 and see what happens in the real world, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Also we might want to change the default to disabled to avoid breaking existing workflows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to leave it enabled as a saner default. Switching it 'later' doesn't really help with not breaking stuff, and I'm expecting this to go in a major not minor release (hence flagged as feat).

Copy link

@deHaar deHaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this kind of recursion prevention, we produced infinite recursion due to equal names of a step and an entrypoint. Spent a lot of time to stop/terminate/delete the workflow that kept calling its own entrypoint before we spent even more time finding out why this occurred.
This prevention would have saved us much time, so I think it should be merged to save time for others in future.

docs/environment-variables.md Outdated Show resolved Hide resolved
docs/scaling.md Outdated Show resolved Hide resolved
@Joibel Joibel changed the title feat: Set a configurable max recursion depth limit feat: Set a max recursion depth limit Sep 5, 2023
auto-merge was automatically disabled September 5, 2023 21:21

Head branch was pushed to by a user without write access

Joibel and others added 2 commits September 5, 2023 22:21
Co-authored-by: Yuan (Terry) Tang <terrytangyuan@gmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Yuan (Terry) Tang <terrytangyuan@gmail.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
@terrytangyuan terrytangyuan enabled auto-merge (squash) September 5, 2023 21:37
@terrytangyuan terrytangyuan merged commit 633c5e9 into argoproj:master Sep 5, 2023
22 checks passed
@Joibel Joibel deleted the max-depth branch September 6, 2023 08:26
qudtjs0753 pushed a commit to qudtjs0753/argo-workflows that referenced this pull request Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants