Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transient secret sync error handling #4310

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Oct 26, 2023

Tracking issue

Closes #4309

Describe your changes

  • Add a new grace period configuration create-container-config-error-grace-period for CreateContainerConfigErrors similar to create-container-error-grace-period for CreateContainerError.
  • I was a bit torn on whether to return a retry-able error and whether to enable cleanup.
    • Without cleanup I think I observed some issues where it was trying to make an invalid state change from FAILED -> SUCCEEDED when the grace period was exceeded but the pod continued to completion.
    • Ultimately I decided to keep it as a permanent failure without cleanup. This means non-flaky failures can still be debugged by looking at the pod events. If we enabled cleanup I think it would delete the pod and make that debugging harder.
    • I can solve my problem just by setting a long enough timeout so that all the flaky failures recover before flyte decides they have failed.
  • I set the default grace period to 0 so the default behaviour should be unchanged.
  • Adjust the error messages so they mention the grace period if there are failures after the grace period expires.

Check all the applicable boxes

  • I updated the documentation accordingly. - The new config has the same level of documentation as create-container-error-grace-period I don't think there is anything else I need to do.
  • All new and existing tests passed. - I think so. Presumably the CI checks will be the definitive answer
  • All commits are signed-off.

Screenshots

Note to reviewers

  • I've never used golang before so I'm just pattern matching with the existing code

@welcome
Copy link

welcome bot commented Oct 26, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_transient_secret_sync_error_handling branch 2 times, most recently from bfcaae3 to 9aff13b Compare October 27, 2023 09:31
@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_transient_secret_sync_error_handling branch 2 times, most recently from c6d51d6 to 30875f5 Compare November 29, 2023 19:07
@Tom-Newton Tom-Newton marked this pull request as ready for review November 29, 2023 19:51
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 29, 2023
Copy link

codecov bot commented Dec 4, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c049865) 58.89% compared to head (823711a) 58.90%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4310      +/-   ##
==========================================
+ Coverage   58.89%   58.90%   +0.01%     
==========================================
  Files         620      620              
  Lines       52440    52458      +18     
==========================================
+ Hits        30883    30901      +18     
  Misses      19090    19090              
  Partials     2467     2467              
Flag Coverage Δ
unittests 58.90% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hamersaw
Copy link
Contributor

hamersaw commented Dec 4, 2023

@Tom-Newton this looks great! Could you fix the unit test (should be a string match update) and then I think we're good to merge.

You can run these from the root of the repo using make -C flyteplugins test_unit for the flyteplugins unit tests as an example.

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_transient_secret_sync_error_handling branch from 74ab23a to cc1a119 Compare December 4, 2023 23:40
Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_transient_secret_sync_error_handling branch from cc1a119 to 7a40d98 Compare December 4, 2023 23:41
@Tom-Newton
Copy link
Contributor Author

Could you fix the unit test (should be a string match update)

Done

You can run these from the root of the repo using make -C flyteplugins test_unit for the flyteplugins unit tests as an example.

Thanks. I hadn't worked out how to run all tests.

@hamersaw hamersaw merged commit 76114aa into flyteorg:master Dec 6, 2023
41 checks passed
pvditt pushed a commit that referenced this pull request Dec 13, 2023
* Extra logging

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Hopefully fix it

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Remove debug prints

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Reduce duplication

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Tidy

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Add a test

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Use a CreateContainerConfigErrorGracePeriod

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Add grace period to error message

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Retryable failure with cleanup

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Longer grace period

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Fix test

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Set default grace period to 0 to avoid changing default behaviour

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Update comment

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Use permanent error with clean up

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Don't use cleanup

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

* Fix test

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>

---------

Signed-off-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Dan Rammer <daniel@union.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flyte task permanently marked as failed after transient failed to sync secret error
2 participants