-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-12669][runtime] Implement FixedDelayRestartBackoffTimeStrategy #8698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-12669][runtime] Implement FixedDelayRestartBackoffTimeStrategy #8698
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @eaglewatcherwb. I had three minor comments which we should address before merging this PR.
| @PublicEvolving | ||
| public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME = ConfigOptions | ||
| .key("restart-backoff-time-strategy.fixed-delay.backoff-time") | ||
| .defaultValue("1 min") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of specifying the backoff time in milli seconds in order to get eventually rid of Scala's FiniteDuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_ATTEMPTS); | ||
| this.backoffTimeMS = getIntervalMSFromConfiguration( | ||
| configuration, | ||
| RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to move the parsing of the configuration out of the factory into the factory method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| private long getIntervalMSFromConfiguration(Configuration configuration, ConfigOption<String> configOption) { | ||
| Duration interval = Duration.apply(configuration.getString(configOption)); | ||
| return Time.milliseconds(interval.toMillis()).toMilliseconds(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate logic wrt the FailureRateRestartBackoffTimeStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
zhuzhurk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Bo for the PR. I have a few minor comments.
| @PublicEvolving | ||
| public static final ConfigOption<Integer> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_ATTEMPTS = ConfigOptions | ||
| .key("restart-backoff-time-strategy.fixed-delay.attempts") | ||
| .defaultValue(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, a large number would be better here, e.g. Int.Max.
Most of our Flink streaming customers would like the job to try more times before it fails. Especially when the cluster is not very stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy default value of restart-strategy.fixed-delay.attempts is 1, or Integer.MAX_VALUE if activated by checkpointing, let's make it Integer.MAX_VALUE
| @PublicEvolving | ||
| public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME = ConfigOptions | ||
| .key("restart-backoff-time-strategy.fixed-delay.backoff-time") | ||
| .defaultValue("1 min") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 min seems to be a bit too long here.
The legacy FixedDelayRestartStrategy take 0 as the default value.
In our production practice we usually set it to be 10s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, 10s is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually be in favor of 0 because a delay is not strictly needed anymore with proper support for queued scheduling.
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @eaglewatcherwb and the review @zhuzhurk. I had two last comments and then we can merge this PR.
| <td><h5>restart-backoff-time-strategy.fixed-delay.backoff-time</h5></td> | ||
| <td style="word-wrap: break-word;">"1 min"</td> | ||
| <td>Backoff time between two consecutive restart attempts.</td> | ||
| </tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be regenerated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
| @PublicEvolving | ||
| public static final ConfigOption<Long> RESTART_BACKOFF_TIME_STRATEGY_FIXED_DELAY_BACKOFF_TIME = ConfigOptions | ||
| .key("restart-backoff-time-strategy.fixed-delay.backoff-time") | ||
| .defaultValue(10_000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set the default value to 0. With queued scheduling there is no longer a strict need for a delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
* use ManualClock in unit test and add configuration description This closes apache#8573.
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @eaglewatcherwb. LGTM. Merging this PR.
5b922dc to
2d993b2
Compare
What is the purpose of the change
Implement FixedDelayRestartBackoffTimeStrategy.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation