Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Finish simplify retry behavior PR #623

Merged

Conversation

fellhorn
Copy link

TL;DR

As discussed with @hamersaw this brings the last finishing touches and fixes some one off bugs, see comments below.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
Signed-off-by: Dennis Keck <26092524+fellhorn@users.noreply.github.com>
maxAttempts = uint32(*nCtx.Node().GetRetryStrategy().MinAttempts)
}
isEligible = currentAttempt < maxAttempts
currentAttempt = nodeStatus.GetAttempts() + 1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed nodeStatus.GetSystemFailures(). As you can see in

currentAttempt = (nodeStatus.GetAttempts() + 1) - nodeStatus.GetSystemFailures()

(line 813)

GetAttempts() is the sum of system failures and user failures

When creating the test table we did not notice this as we always only used a single retry for system failures. For our final tests we now did some more extensive search for one off errors.

}
maxAttempts = uint32(config.GetConfig().NodeConfig.DefaultMaxAttempts)
if nCtx.Node().GetRetryStrategy() != nil && nCtx.Node().GetRetryStrategy().MinAttempts != nil && *nCtx.Node().GetRetryStrategy().MinAttempts != 1 {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*nCtx.Node().GetRetryStrategy().MinAttempts != 1 # Changed this 0 -> 1

If a user sets 0 retries, MinAttempts will be 1:
https://github.com/fellhorn/flytepropeller/blob/feature/simplify-retries/pkg/compiler/transformers/k8s/utils.go#L24-L25

Comment on lines -2783 to +2787
{"IneligibleSystemRetries", false, 0, 1, 1, 1, core.ExecutionError_SYSTEM, false},
{"IneligibleSystemRetries", false, 1, 1, 1, 1, core.ExecutionError_SYSTEM, false},
{"IgnoreCauseEligibleUserRetries", true, 0, 0, 2, 0, core.ExecutionError_USER, true},
{"IgnoreCauseIneligibleUserRetries", true, 1, 0, 2, 0, core.ExecutionError_USER, false},
{"IgnoreCauseEligibleSystemRetries", true, 0, 0, 2, 0, core.ExecutionError_SYSTEM, true},
{"IgnoreCauseIneligibleSystemRetries", true, 0, 1, 2, 0, core.ExecutionError_SYSTEM, false},
{"IgnoreCauseIneligibleSystemRetries", true, 1, 1, 2, 0, core.ExecutionError_SYSTEM, false},
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests were incorrect as attempts must be >= systemFailures (attempts is sum of user failures and system retries)

interruptible = false
c.metrics.InterruptedThresholdHit.Inc(ctx)
}
} else {
// Else a node is not considered interruptible if the system failures have exceeded the configured threshold
if interruptible && ((c.interruptibleFailureThreshold > 0 && int32(s.GetSystemFailures()) >= c.interruptibleFailureThreshold) ||
(c.interruptibleFailureThreshold <= 0 && int32(s.GetSystemFailures()) > int32(c.maxNodeRetriesForSystemFailures)+c.interruptibleFailureThreshold)) {
if interruptible && isAboveInterruptibleFailureThreshold(s.GetSystemFailures(), c.maxNodeRetriesForSystemFailures+1, c.interruptibleFailureThreshold) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 because we use >= and use a shared function isAboveInterruptibleFailureThreshold which takes maxAttempts as argument

@fellhorn fellhorn marked this pull request as ready for review September 27, 2023 12:47
@hamersaw hamersaw merged commit a9b52af into flyteorg:feature/simplify-retries Sep 29, 2023
2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants