-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-32895][Scheduler] Introduce the max attempts for Exponential Delay Restart Strategy #23247
[FLINK-32895][Scheduler] Introduce the max attempts for Exponential Delay Restart Strategy #23247
Conversation
7272d54
to
92401d9
Compare
92401d9
to
c0de3c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @RocMarshal @qinf , would you mind helping take a look this PR in your free time? thanks~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@1996fanrui Thank you for your PR. I have a few comments, please take a look.
final ExponentialDelayRestartBackoffTimeStrategy restartStrategy = | ||
new ExponentialDelayRestartBackoffTimeStrategy( | ||
new ManualClock(), 1L, 3L, 2.0, 4L, 0.25); | ||
new ManualClock(), 1L, 3L, 2.0, 4L, 0.25, maxAttempts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of ConfigOption
restart-strategy.exponential-delay.backoff-multiplier
has been set to 1.2
. Is it necessary to update backoffMultiplier to 1.2
?
final long initialBackoffMS = 1L; | ||
final long resetBackoffThresholdMS = 8L; | ||
final ManualClock clock = new ManualClock(); | ||
|
||
final ExponentialDelayRestartBackoffTimeStrategy restartStrategy = | ||
new ExponentialDelayRestartBackoffTimeStrategy( | ||
clock, initialBackoffMS, 5L, 2.0, resetBackoffThresholdMS, 0.25); | ||
clock, initialBackoffMS, 5L, 2.0, resetBackoffThresholdMS, 0.25, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help explain why set the attemptsBeforeResetBackoff
to 100 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have strong reason, I just set it to a big value to ensure this test doesn't reach this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update attemptsBeforeResetBackoff
to Integer.MAX_VALUE
for all tests not related to maxAttempts
.
Integer.MAX_VALUE
makes it easier for other developers to know that attemptsBeforeResetBackoff does not take effect.
exponentialDelayConfig.getJitterFactor(), | ||
RestartStrategyOptions.RESTART_STRATEGY_EXPONENTIAL_DELAY_ATTEMPTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ExponentialDelayRestartStrategyConfiguration
contains other parameters needed to create ExponentialDelayRestartBackoffTimeStrategyFactory
. Is it necessary to add a parameter attemptsBeforeResetBackoff
to ExponentialDelayRestartStrategyConfiguration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't do it because ExponentialDelayRestartStrategyConfiguration
is deprecated since 1.19. The RestartStrategies
has been deprecated, and ExponentialDelayRestartStrategyConfiguration
is the inner class of it.
So the new option won't be supportted in the deprecated api.
c0de3c8
to
5374606
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhuzhurk , would you mind helping double check this PR in your free time? Thanks~
This PR is the subtask1 and subtask2 of improving FLIP-364: Exponential Delay Restart Strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @1996fanrui for the contribution~
LGTM +1.
Description.builder() | ||
.text( | ||
"The number of times that Flink retries the execution before the job is declared as failed " | ||
+ "before reset the backoff to its initial value if %s has been set to %s.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset
-> resetting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RocMarshal for the review! Updated~
@@ -184,7 +184,7 @@ public class RestartStrategyOptions { | |||
public static final ConfigOption<Double> RESTART_STRATEGY_EXPONENTIAL_DELAY_BACKOFF_MULTIPLIER = | |||
ConfigOptions.key("restart-strategy.exponential-delay.backoff-multiplier") | |||
.doubleType() | |||
.defaultValue(2.0) | |||
.defaultValue(1.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we set this to 1.5
to make it a bit less aggressive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mxm for the feedback!
1.5
is fine for me, and I'd like to cc @zhuzhurk who propose change these 2 default value as well!
Max and Mason have a little feedback after voting in the user mail list, and Max and I had a offline discussion yesterday. Max think the 1.2 is a little small or aggressive(delay time is too short). Here is the reason:
- Every time the job restarts, it will make a bunch of calls to the Kubernetes API, e.g. read/write to config maps, create task managers.
- When his producation had the default fixed-delay(1s) restart strategy turned on. A Kubernetes cluster became instable.
Following is the relationship between restart-attempts and retry-delay-time:
- The
delay-time
will reach 1 min after 12 attempts whenbackoff-multiplier
is 1.5 - The
delay-time
will reach 1 min after 24 attempts whenbackoff-multiplier
is 1.2
Hey @zhuzhurk , what do you think about setting1.5
as the default value? If you agree it, I will update the FLIP and feedback it to the user mail list, and go ahead this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.5 sounds good to me. Yet I think we need to send the updates to the FLIP discussion ML and open another vote for it, in case it turned out to be a surprise for those who have reviewed and voted on the FLIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuzhurk for the quick feedback!
I have updated it in the dev&user
mail list. Let us wait for more feedback for a while. Also, I pause FLINK-33736 first. Therefore, I will remove the first commit from this PR and continue working on the rest of the JIRA for FLIP-364.
[1] https://lists.apache.org/thread/6glz0d57r8gtpzq4f71vf9066c5x6nyw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value related commit is migrated to a new separate PR, we can follow it there, thanks~
Also, if this PR(FLINK-32895) no other comments within 72 hours, I will merge it, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for considering our suggestion!
d6bdb4a
to
3189fc0
Compare
…g for ExponentialDelayRestartBackoffTimeStrategy due to some information will be updated
3189fc0
to
f91f422
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR! @1996fanrui
I just have one minor comment for it.
Description.builder() | ||
.text( | ||
"The number of times that Flink retries the execution before the job is declared as failed " | ||
+ "before resetting the backoff to its initial value if %s has been set to %s.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe separate the long statement to make it easier for reading? e.g.
The number of times that Flink retries the execution before failing the job if %s has been set to %s. The number will be reset once the backoff is reset to its initial value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuzhurk for the great suggestion! It's easier to understand~
I updated the original commit directly due to this change is easy enough, and the change can be viewed by this link[1].
f91f422
to
08efd3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks everyone for the review, merging~ |
What is the purpose of the change
See FLIP-364: Improve the exponential-delay restart-strategy . This PR includes the subtask2 of FLIP-364.
Brief change log
currentBackoffMS
andlastFailureTimestamp
restart-strategy.exponential-delay.attempts-before-reset-backoff
Verifying this change
ExponentialDelayRestartBackoffTimeStrategyTest
.testAlwaysRestart
totestMaxAttempts
inExponentialDelayRestartBackoffTimeStrategyTest
.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yesDocumentation
restart-strategy.exponential-delay.attempts-before-reset-backoff
.