-
Notifications
You must be signed in to change notification settings - Fork 13.7k
[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy #8573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy #8573
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
flink-core/src/main/java/org/apache/flink/configuration/ConfigConstants.java
Outdated
Show resolved
Hide resolved
...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java
Show resolved
Hide resolved
...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Bo for this PR. The change looks good overall. Just a few minor comments.
For the config options, let's keep them in RestartBackoffTimeStrategyOptions at the moment. Future customized restart strategies are supposed to keep their config options by themselves.
|
||
private final String strategyString; | ||
|
||
public FailureRateRestartBackoffTimeStrategy(int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor can be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
@Override | ||
public long getBackoffTime() { | ||
return backoffTimeMS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have exception to get backoff time when restarting is suppressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getBackoffTime
is called only when canRestart
returns true
, with this contract restarting will not be suppressed.
for (int failuresLeft = numFailures; failuresLeft > 0; failuresLeft--) { | ||
assertTrue(restartStrategy.canRestart()); | ||
restartStrategy.notifyFailure(failure); | ||
Thread.sleep(2 * intervalMS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread.sleep time may not be very stable.
I think you can refer to FailureRateRestartStrategyTest.sleepGuaranteed(long) implementations to guarantee the sleep interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo sleep is not acceptable in unit tests. I think we need to be able to control the time, as such:
Lines 190 to 214 in 3508465
clock.setEpochMilli(0); | |
harness.open(); | |
harness.processElement("42", 0); | |
final OperatorSubtaskState snapshot = harness.snapshot(0, 1); | |
harness.notifyOfCompletedCheckpoint(1); | |
throwException.set(true); | |
closeTestHarness(); | |
setUpTestHarness(); | |
final long transactionTimeout = 1000; | |
sinkFunction.setTransactionTimeout(transactionTimeout); | |
sinkFunction.ignoreFailuresAfterTransactionTimeout(); | |
try { | |
harness.initializeState(snapshot); | |
fail("Expected exception not thrown"); | |
} catch (RuntimeException e) { | |
assertEquals("Expected exception", e.getMessage()); | |
} | |
clock.setEpochMilli(transactionTimeout + 1); |
You can use Flink's clock interface:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the example, I will do like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @eaglewatcherwb , I left some comments
|
||
private final long backoffTimeMS; | ||
|
||
public FailureRateRestartBackoffTimeStrategyFactory(final Configuration configuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we take into account the legacy options somewhere? used in FailureRateRestartStrategy.createFactory
we should also plan when we are going to deprecate the legacy options, probably later when this code is active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backward compatible work is done in this jira ticket
@@ -0,0 +1,26 @@ | |||
<table class="table table-bordered"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it manually created file or generated with https://github.com/apache/flink/tree/master/flink-docs#configuration-documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generated by the method in the doc.
* Configuration options for the RestartBackoffTimeStrategy. | ||
*/ | ||
@PublicEvolving | ||
public class RestartBackoffTimeStrategyOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we do not want a config class for each restart strategy (basically a separate high level chapter in generated user docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config of both FixedDelayRestartBackoffTimeStrategy
and FailureRateRestartBackoffTimeStrategy
will put in this file, which are strategies for the new scheduler.
*/ | ||
@PublicEvolving | ||
public static final ConfigOption<Integer> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_MAX_FAILURES_PER_INTERVAL = | ||
key("restart-backoff-time-strategy.failure-rate.max-failures-per-interval").defaultValue(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description would be also nice to add for user doc table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
final long intervalMS = 10_000L; | ||
|
||
final FailureRateRestartBackoffTimeStrategy restartStrategy = | ||
new FailureRateRestartBackoffTimeStrategy(SystemClock.getInstance(), numFailures, intervalMS, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use ManualClock here, as the system time elapsing is considered unstable in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
final long backoffTimeMS = 10_000L; | ||
|
||
final FailureRateRestartBackoffTimeStrategy restartStrategy = | ||
new FailureRateRestartBackoffTimeStrategy(SystemClock.getInstance(), 1, 1, backoffTimeMS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. It's better to use ManualClock here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
private final Clock clock; | ||
|
||
FailureRateRestartBackoffTimeStrategy(Clock clock, int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failuresIntervalMS
would be better to show that it is in milliseconds format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating the FailureRateRestartBackoffTimeStrategy
@eaglewatcherwb. I had some comments and questions which we should resolve before moving forward.
@PublicEvolving | ||
public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL = ConfigOptions | ||
.key("restart-backoff-time-strategy.failure-rate.failure-rate-interval") | ||
.defaultValue("1 min") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of specifying the interval in milliseconds because in the long run we want to get rid of Scala in the runtime. Thus, at some point we must no longer rely on Scala's FiniteDuration
for the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@PublicEvolving | ||
public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME = ConfigOptions | ||
.key("restart-backoff-time-strategy.failure-rate.backoff-time") | ||
.defaultValue("1 min") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of specifying the backoff time in milliseconds because in the long run we want to get rid of Scala in the runtime. Thus, at some point we must no longer rely on Scala's FiniteDuration
for the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL); | ||
this.backoffTimeMS = getIntervalMSFromConfiguration( | ||
configuration, | ||
RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the configuration parsing logic into the createFactory
method because it now mixes concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Long now = clock.absoluteTimeMillis(); | ||
Long earliestFailure = failureTimestamps.peek(); | ||
|
||
return (now - earliestFailure) > failuresIntervalMS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly is the interval defined? Assuming my interval is 2L
and I have two failures T(0), T(2)
occurring at time 0
and 2
, then it would not restart. Thus, the boundaries of the interval [0, 2]
are inclusive. An alternative could be to make the right boundary exclusive [0, 2)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think both are OK. Let’s keep it as now, since it is also consistent with the definition of legacy FailureRateRestartStrategy
.
for (int failuresLeft = numFailures; failuresLeft > 0; failuresLeft--) { | ||
assertTrue(restartStrategy.canRestart()); | ||
restartStrategy.notifyFailure(failure); | ||
clock.advanceTime(intervalMS + 1, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason why we increment by intervalMS + 1
is that the right boundary of the failure interval is inclusive, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @eaglewatcherwb. I had two last comments which we need to address.
@PublicEvolving | ||
public static final ConfigOption<Long> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME = ConfigOptions | ||
.key("restart-backoff-time-strategy.failure-rate.backoff-time") | ||
.defaultValue(10_000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set the default value to 0
because with queued scheduling there is no strict need for a delay between restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
<td>Maximum number of failures in given time interval before failing a job.</td> | ||
</tr> | ||
</tbody> | ||
</table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be regenerated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @eaglewatcherwb. LGTM. Merging this PR.
* use ManualClock in unit test and add configuration description This closes apache#8573.
5038ff2
to
45ddafd
Compare
* use ManualClock in unit test and add configuration description This closes apache#8573.
Change-Id: I57b9ae57ea2ac3c501e96a4c84d39346e46ab8bc
What is the purpose of the change
Implement FailureRateRestartBackoffTimeStrategy
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation