Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement task retry with delay #5263

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mucahitkantepe
Copy link
Contributor

@mucahitkantepe mucahitkantepe commented Apr 22, 2024

Tracking issue

Closes #2333
Related to flyteorg/flytekit#2368

@mucahitkantepe mucahitkantepe changed the title add retry with delay Implement task retry with delay Apr 22, 2024
Mücahit Kantepe and others added 2 commits April 22, 2024 12:58
Signed-off-by: mucahitkantepe <mucahitkantepe@gmail.com>
Signed-off-by: mucahitkantepe <mucahitkantepe@gmail.com>
Copy link

codecov bot commented Apr 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.44%. Comparing base (7287470) to head (1fe65cc).
Report is 399 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5263      +/-   ##
==========================================
+ Coverage   58.60%   59.44%   +0.84%     
==========================================
  Files         568      336     -232     
  Lines       51121    25289   -25832     
==========================================
- Hits        29958    15033   -14925     
+ Misses      18748     8741   -10007     
+ Partials     2415     1515     -900     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eapolinario
Copy link
Contributor

eapolinario commented May 7, 2024

@mucahitkantepe , can we help you move this PR out of draft so that we can get it properly reviewed?

@mucahitkantepe mucahitkantepe marked this pull request as ready for review May 7, 2024 18:23
@mucahitkantepe
Copy link
Contributor Author

@eapolinario sure, this is not the complete implementation as it's lacking the tests/docs etc but as the codebase is new for me, I wanted to make sure the logic I added is at the right place

sleepDuration := time.Until(nodeStatus.GetTaskNodeStatus().GetLastPhaseUpdatedAt().Add(currentNode.GetRetryStrategy().RetryDelay.Duration))
if sleepDuration > 0 {
logger.Infof(currentNodeCtx, "Sleeping for [%v] before retrying", sleepDuration)
time.Sleep(sleepDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh we do not want to add sleep to the event loop. We just need to return and not actually do a retry till the time has elapsed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I thought this was part of a green thread.

If we just return and the delay is configured to be let's say 1 hour. Will propeller receive the same event in a loop for an hour that will overload it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wont overload, it wont check it for next refresh interval. Another way to do this is to implement a timer.wheel that can check for active timers and fire the necessary workflow event. We can do that on a separate goroutine

@@ -180,4 +180,7 @@ message RetryStrategy {
// Number of retries. Retries will be consumed when the job fails with a recoverable error.
// The number of retries must be less than or equals to 10.
uint32 retries = 5;

// Delay between retries.
google.protobuf.Duration retry_delay = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we make this more elaborate -

min_delay
max_delay
exponent

we can simply implement exponent of 1 for now and we can add other exponents later? we can help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core feature] Exponential backoff retry
3 participants