Skip to content

Conversation

eaglewatcherwb
Copy link
Contributor

@eaglewatcherwb eaglewatcherwb commented May 30, 2019

Change-Id: I57b9ae57ea2ac3c501e96a4c84d39346e46ab8bc

What is the purpose of the change

Implement FailureRateRestartBackoffTimeStrategy

Brief change log

  • Implement FailureRateRestartBackoffTimeStrategy

Verifying this change

This change added tests and can be verified as follows:

  • Added unit tests in FailureRateRestartBackoffTimeStrategyTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not documented)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bo for this PR. The change looks good overall. Just a few minor comments.

For the config options, let's keep them in RestartBackoffTimeStrategyOptions at the moment. Future customized restart strategies are supposed to keep their config options by themselves.


private final String strategyString;

public FailureRateRestartBackoffTimeStrategy(int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor can be private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


@Override
public long getBackoffTime() {
return backoffTimeMS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have exception to get backoff time when restarting is suppressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBackoffTime is called only when canRestart returns true, with this contract restarting will not be suppressed.

for (int failuresLeft = numFailures; failuresLeft > 0; failuresLeft--) {
assertTrue(restartStrategy.canRestart());
restartStrategy.notifyFailure(failure);
Thread.sleep(2 * intervalMS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread.sleep time may not be very stable.
I think you can refer to FailureRateRestartStrategyTest.sleepGuaranteed(long) implementations to guarantee the sleep interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo sleep is not acceptable in unit tests. I think we need to be able to control the time, as such:

clock.setEpochMilli(0);
harness.open();
harness.processElement("42", 0);
final OperatorSubtaskState snapshot = harness.snapshot(0, 1);
harness.notifyOfCompletedCheckpoint(1);
throwException.set(true);
closeTestHarness();
setUpTestHarness();
final long transactionTimeout = 1000;
sinkFunction.setTransactionTimeout(transactionTimeout);
sinkFunction.ignoreFailuresAfterTransactionTimeout();
try {
harness.initializeState(snapshot);
fail("Expected exception not thrown");
} catch (RuntimeException e) {
assertEquals("Expected exception", e.getMessage());
}
clock.setEpochMilli(transactionTimeout + 1);

You can use Flink's clock interface:

https://github.com/apache/flink/blob/350846507450ba7afa0159a2c574bd2f44bacaac/flink-runtime/src/main/java/org/apache/flink/runtime/util/clock/Clock.java

https://github.com/apache/flink/blob/350846507450ba7afa0159a2c574bd2f44bacaac/flink-runtime/src/test/java/org/apache/flink/runtime/util/clock/ManualClock.java

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the example, I will do like that.

Copy link
Contributor

@azagrebin azagrebin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @eaglewatcherwb , I left some comments


private final long backoffTimeMS;

public FailureRateRestartBackoffTimeStrategyFactory(final Configuration configuration) {
Copy link
Contributor

@azagrebin azagrebin Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we take into account the legacy options somewhere? used in FailureRateRestartStrategy.createFactory

we should also plan when we are going to deprecate the legacy options, probably later when this code is active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backward compatible work is done in this jira ticket

@@ -0,0 +1,26 @@
<table class="table table-bordered">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generated by the method in the doc.

* Configuration options for the RestartBackoffTimeStrategy.
*/
@PublicEvolving
public class RestartBackoffTimeStrategyOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we do not want a config class for each restart strategy (basically a separate high level chapter in generated user docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config of both FixedDelayRestartBackoffTimeStrategy and FailureRateRestartBackoffTimeStrategy will put in this file, which are strategies for the new scheduler.

*/
@PublicEvolving
public static final ConfigOption<Integer> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_MAX_FAILURES_PER_INTERVAL =
key("restart-backoff-time-strategy.failure-rate.max-failures-per-interval").defaultValue(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description would be also nice to add for user doc table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

final long intervalMS = 10_000L;

final FailureRateRestartBackoffTimeStrategy restartStrategy =
new FailureRateRestartBackoffTimeStrategy(SystemClock.getInstance(), numFailures, intervalMS, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use ManualClock here, as the system time elapsing is considered unstable in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

final long backoffTimeMS = 10_000L;

final FailureRateRestartBackoffTimeStrategy restartStrategy =
new FailureRateRestartBackoffTimeStrategy(SystemClock.getInstance(), 1, 1, backoffTimeMS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. It's better to use ManualClock here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

private final Clock clock;

FailureRateRestartBackoffTimeStrategy(Clock clock, int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failuresIntervalMS would be better to show that it is in milliseconds format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the FailureRateRestartBackoffTimeStrategy @eaglewatcherwb. I had some comments and questions which we should resolve before moving forward.

@PublicEvolving
public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL = ConfigOptions
.key("restart-backoff-time-strategy.failure-rate.failure-rate-interval")
.defaultValue("1 min")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of specifying the interval in milliseconds because in the long run we want to get rid of Scala in the runtime. Thus, at some point we must no longer rely on Scala's FiniteDuration for the parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@PublicEvolving
public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME = ConfigOptions
.key("restart-backoff-time-strategy.failure-rate.backoff-time")
.defaultValue("1 min")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of specifying the backoff time in milliseconds because in the long run we want to get rid of Scala in the runtime. Thus, at some point we must no longer rely on Scala's FiniteDuration for the parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL);
this.backoffTimeMS = getIntervalMSFromConfiguration(
configuration,
RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the configuration parsing logic into the createFactory method because it now mixes concerns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Long now = clock.absoluteTimeMillis();
Long earliestFailure = failureTimestamps.peek();

return (now - earliestFailure) > failuresIntervalMS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly is the interval defined? Assuming my interval is 2L and I have two failures T(0), T(2) occurring at time 0 and 2, then it would not restart. Thus, the boundaries of the interval [0, 2] are inclusive. An alternative could be to make the right boundary exclusive [0, 2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think both are OK. Let’s keep it as now, since it is also consistent with the definition of legacy FailureRateRestartStrategy.

for (int failuresLeft = numFailures; failuresLeft > 0; failuresLeft--) {
assertTrue(restartStrategy.canRestart());
restartStrategy.notifyFailure(failure);
clock.advanceTime(intervalMS + 1, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the reason why we increment by intervalMS + 1 is that the right boundary of the failure interval is inclusive, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @eaglewatcherwb. I had two last comments which we need to address.

@PublicEvolving
public static final ConfigOption<Long> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME = ConfigOptions
.key("restart-backoff-time-strategy.failure-rate.backoff-time")
.defaultValue(10_000L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the default value to 0 because with queued scheduling there is no strict need for a delay between restarts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

<td>Maximum number of failures in given time interval before failing a job.</td>
</tr>
</tbody>
</table>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be regenerated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @eaglewatcherwb. LGTM. Merging this PR.

* use ManualClock in unit test and add configuration description

This closes apache#8573.
@tillrohrmann tillrohrmann force-pushed the FLINK-12670-failure-rate-strategy branch from 5038ff2 to 45ddafd Compare June 13, 2019 14:57
@eaglewatcherwb eaglewatcherwb deleted the FLINK-12670-failure-rate-strategy branch June 14, 2019 07:40
sjwiesman pushed a commit to sjwiesman/flink that referenced this pull request Jun 26, 2019
* use ManualClock in unit test and add configuration description

This closes apache#8573.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants