Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to clear node state on any termination #4596

Merged

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Dec 13, 2023

Tracking issue

Implements the first part of #4569

Why are the changes needed?

Reduce un-needed information stored in etcd. This allows flyte to scale to larger workflows before hitting etcd size limits.

What changes were proposed in this pull request?

  • Creates a config option node-config.enable-cr-debug-metadata
    • When enabled behaviour is the same as before.
    • Disabled by default
    • When disabled all terminal states get handled the same way as success and skipped states - all fields of the NodeStatus are cleared except for Phase, StoppedAt and Error.
    • Behaviour for success and skipped states should be equivalent.
    • I was quite uncertain about how to implement the feature flag but I wanted to avoid making it a setting stored in the NodeStatus which therefore needs to be written to etcd.
  • Its not strictly related to this PR but I added a test case TestNodeStatus_UpdatePhase -> non-terminal-timing-out to fix a code coverage failure.

How was this patch tested?

Updated the unittests
We have also been using very similar changes in our prod flyte deployment for a few weeks.

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

@Tom-Newton Tom-Newton changed the title Tomnewton/collapse sub nodes on failures Option to clear node state on any termination Dec 13, 2023
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 3e55c74 to 625ab3d Compare December 13, 2023 18:41
@Tom-Newton
Copy link
Contributor Author

There are still a couple of TODO comments that I want to address but if anyone has any early feedback that would be appreciated. I was rather uncertain about how to put this behind a feature flag.

Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (3f4ba2e) 58.98% compared to head (64d7fa2) 58.15%.
Report is 44 commits behind head on master.

Files Patch % Lines
flytepropeller/pkg/controller/nodes/executor.go 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4596      +/-   ##
==========================================
- Coverage   58.98%   58.15%   -0.83%     
==========================================
  Files         621      626       +5     
  Lines       52483    53790    +1307     
==========================================
+ Hits        30957    31282     +325     
- Misses      19059    20000     +941     
- Partials     2467     2508      +41     
Flag Coverage Δ
unittests 58.15% <97.61%> (-0.83%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from e50ebb8 to 42fb049 Compare December 14, 2023 13:08
… with `clearStateOnAnyTermination=true`

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 5e07651 to 453a69f Compare December 15, 2023 14:51
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from b812954 to fb9bd90 Compare December 15, 2023 14:56
@Tom-Newton
Copy link
Contributor Author

I'm still not super confident about how I implemented the config option but I think its time to get a review on this.

@Tom-Newton Tom-Newton marked this pull request as ready for review December 15, 2023 14:59
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 15, 2023
@dosubot dosubot bot added the enhancement New feature or request label Dec 15, 2023
@hamersaw
Copy link
Contributor

@Tom-Newton this looks pretty good to me. Do you know how big part 2 (ie. leaving error when workflow failure policy is set) is? It feels to me like it makes sense to submit these in the same PR to make tracking the changes easier. Thoughts?

Also, in the original PRs I thought we needed to keep the error around for one round (ie. transition node from FAILING to FAILED) because the actual admin event is sent during the transition, while the error is recorded when transitioning to FAILING. Am I remembering this incorrectly?

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Dec 18, 2023

Thanks for taking a look

Do you know how big part 2 is?

The actual change is pretty small but I haven't written any tests for it yet.

Also, in the original PRs I thought we needed to keep the error around for one round

This PR actually doesn't change the error field. I think before you looked at the sum of part 1 and part 2. I had always intended to split it like this but I made a bit of a mess with rebasing. Part 2 just clears the previous error whenever a new one occurs.

I'll make the second part as a stacked PR on top of this and we can decide whether to combine it into this PR before or after this one is merged.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Dec 21, 2023

@hamersaw Tom-Newton#6 implements part 2 (apparently I can't create a PR for the upstream that is stacked on a branch from my fork)

@hamersaw
Copy link
Contributor

hamersaw commented Dec 22, 2023

@Tom-Newton thanks so much for these! They look great, I'll have to run some tests locally just to clarify everything is working the way I think it should.

In the 2nd PR here, is there a reason we only strip if the workflow failure policy says to keep executing? It seems to me that we can always delete the error after it's been reported to flyteadmin right?

In the mean-time, I was thinking that leaving the additional metadata in the CRD is really only usable for manual debugging purposes. Which there are very few users that do. I'm not sure the fine-grained controls of managing specific metadata is necessary (ie. having separate flags for deletion on all terminal states and removing errors after they have been reported). Do you think it makes sense to put both of these things behind a flag like disableCRDDebugMetadata (or an enable... variant) and just make this an all or nothing? If you want all the metadata we'll leave it, if you want us to strip it bare-bones to reduces etcd size then we'll do that? And then, it seems to me like stripping all the metadata should be the default and users can enable the CRD debug in certain scenarios. VERY interested in what you think here?

@Tom-Newton
Copy link
Contributor Author

In the 2nd PR here, is there a reason we only strip if the workflow failure policy says to keep executing? It seems to me that we can always delete the error after it's been reported to flyteadmin right?

Good question. I don't think it makes a lot of difference but that would provide slightly smaller workflow state. I did actually try that initially but I couldn't find the right place to clear the error to make sure it was after the last usage. Personally I'm quite happy with the current solution because I think it's relatively easy to understand.

Do you think it makes sense to put both of these things behind a flag like disableCRDDebugMetadata (or an enable... variant) and just make this an all or

I'm not sure. While it's very likely that users of one of these options will also want to use the other I also like to have lots of granularity in configuration. I think it's possible that there will be more options similar to this that may be more controversial such that some users will want more granular control than just one config option.

Do you have any thoughts on whether you want to review and merge both parts in one PR? Personally I would go for 2 subsequent PRs as it is now.

@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 865f63f to 05208ec Compare January 12, 2024 18:10
@Tom-Newton
Copy link
Contributor Author

We agreed to rename the config option to enable-cr-debug-metadata and use the same config for Tom-Newton#6

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/collapse_sub_nodes_on_failures branch from 05208ec to 64d7fa2 Compare January 12, 2024 18:12
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 16, 2024
@hamersaw hamersaw merged commit d208ed3 into flyteorg:master Jan 16, 2024
44 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants