[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy #8573

eaglewatcherwb · 2019-05-30T09:50:07Z

Change-Id: I57b9ae57ea2ac3c501e96a4c84d39346e46ab8bc

What is the purpose of the change

Implement FailureRateRestartBackoffTimeStrategy

Brief change log

Implement FailureRateRestartBackoffTimeStrategy

Verifying this change

This change added tests and can be verified as follows:

Added unit tests in FailureRateRestartBackoffTimeStrategyTest

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not documented)

flinkbot · 2019-05-30T09:52:23Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flink-core/src/main/java/org/apache/flink/configuration/ConfigConstants.java

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java

zhuzhurk

Thanks Bo for this PR. The change looks good overall. Just a few minor comments.

For the config options, let's keep them in RestartBackoffTimeStrategyOptions at the moment. Future customized restart strategies are supposed to keep their config options by themselves.

zhuzhurk · 2019-06-06T03:32:15Z

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

+
+	private final String strategyString;
+
+	public FailureRateRestartBackoffTimeStrategy(int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) {


The constructor can be private.

zhuzhurk · 2019-06-06T07:01:33Z

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

+
+	@Override
+	public long getBackoffTime() {
+		return backoffTimeMS;


Shall we have exception to get backoff time when restarting is suppressed?

getBackoffTime is called only when canRestart returns true, with this contract restarting will not be suppressed.

zhuzhurk · 2019-06-06T07:58:35Z

...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java

+		for (int failuresLeft = numFailures; failuresLeft > 0; failuresLeft--) {
+			assertTrue(restartStrategy.canRestart());
+			restartStrategy.notifyFailure(failure);
+			Thread.sleep(2 * intervalMS);


Thread.sleep time may not be very stable.
I think you can refer to FailureRateRestartStrategyTest.sleepGuaranteed(long) implementations to guarantee the sleep interval.

Imo sleep is not acceptable in unit tests. I think we need to be able to control the time, as such:

flink/flink-streaming-java/src/test/java/org/apache/flink/streaming/api/functions/sink/TwoPhaseCommitSinkFunctionTest.java

Lines 190 to 214 in 3508465

clock.setEpochMilli(0);

harness.open();

harness.processElement("42", 0);

final OperatorSubtaskState snapshot = harness.snapshot(0, 1);

harness.notifyOfCompletedCheckpoint(1);

throwException.set(true);

closeTestHarness();

setUpTestHarness();

final long transactionTimeout = 1000;

sinkFunction.setTransactionTimeout(transactionTimeout);

sinkFunction.ignoreFailuresAfterTransactionTimeout();

try {

harness.initializeState(snapshot);

fail("Expected exception not thrown");

} catch (RuntimeException e) {

assertEquals("Expected exception", e.getMessage());

}

clock.setEpochMilli(transactionTimeout + 1);

You can use Flink's clock interface:

https://github.com/apache/flink/blob/350846507450ba7afa0159a2c574bd2f44bacaac/flink-runtime/src/main/java/org/apache/flink/runtime/util/clock/Clock.java

https://github.com/apache/flink/blob/350846507450ba7afa0159a2c574bd2f44bacaac/flink-runtime/src/test/java/org/apache/flink/runtime/util/clock/ManualClock.java

Thanks a lot for the example, I will do like that.

azagrebin

Thanks for the PR @eaglewatcherwb , I left some comments

azagrebin · 2019-06-06T09:29:25Z

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

+
+		private final long backoffTimeMS;
+
+		public FailureRateRestartBackoffTimeStrategyFactory(final Configuration configuration) {


do we take into account the legacy options somewhere? used in FailureRateRestartStrategy.createFactory

we should also plan when we are going to deprecate the legacy options, probably later when this code is active.

The backward compatible work is done in this jira ticket

azagrebin · 2019-06-06T09:30:17Z

docs/_includes/generated/restart_backoff_time_strategy_configuration.html

@@ -0,0 +1,26 @@
+<table class="table table-bordered">


Is it manually created file or generated with https://github.com/apache/flink/tree/master/flink-docs#configuration-documentation?

It is generated by the method in the doc.

azagrebin · 2019-06-06T09:50:36Z

flink-core/src/main/java/org/apache/flink/configuration/RestartBackoffTimeStrategyOptions.java

+ * Configuration options for the RestartBackoffTimeStrategy.
+ */
+@PublicEvolving
+public class RestartBackoffTimeStrategyOptions {


maybe we do not want a config class for each restart strategy (basically a separate high level chapter in generated user docs)

The config of both FixedDelayRestartBackoffTimeStrategy and FailureRateRestartBackoffTimeStrategy will put in this file, which are strategies for the new scheduler.

azagrebin · 2019-06-06T09:51:15Z

flink-core/src/main/java/org/apache/flink/configuration/RestartBackoffTimeStrategyOptions.java

+	 */
+	@PublicEvolving
+	public static final ConfigOption<Integer> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_MAX_FAILURES_PER_INTERVAL =
+		key("restart-backoff-time-strategy.failure-rate.max-failures-per-interval").defaultValue(1);


description would be also nice to add for user doc table

zhuzhurk · 2019-06-11T04:10:56Z

...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java

+		final long intervalMS = 10_000L;
+
+		final FailureRateRestartBackoffTimeStrategy restartStrategy =
+			new FailureRateRestartBackoffTimeStrategy(SystemClock.getInstance(), numFailures, intervalMS, 0);


It's better to use ManualClock here, as the system time elapsing is considered unstable in this case.

zhuzhurk · 2019-06-11T04:12:21Z

...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java

+		final long backoffTimeMS = 10_000L;
+
+		final FailureRateRestartBackoffTimeStrategy restartStrategy =
+			new FailureRateRestartBackoffTimeStrategy(SystemClock.getInstance(), 1, 1, backoffTimeMS);


Same as above. It's better to use ManualClock here.

zhuzhurk · 2019-06-11T04:14:51Z

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

+	private final Clock clock;
+
+	FailureRateRestartBackoffTimeStrategy(Clock clock, int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) {
+


failuresIntervalMS would be better to show that it is in milliseconds format.

tillrohrmann

Thanks for creating the FailureRateRestartBackoffTimeStrategy @eaglewatcherwb. I had some comments and questions which we should resolve before moving forward.

tillrohrmann · 2019-06-13T09:14:59Z

flink-core/src/main/java/org/apache/flink/configuration/RestartBackoffTimeStrategyOptions.java

+	@PublicEvolving
+	public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL = ConfigOptions
+		.key("restart-backoff-time-strategy.failure-rate.failure-rate-interval")
+		.defaultValue("1 min")


I'd be in favor of specifying the interval in milliseconds because in the long run we want to get rid of Scala in the runtime. Thus, at some point we must no longer rely on Scala's FiniteDuration for the parsing.

tillrohrmann · 2019-06-13T09:15:11Z

flink-core/src/main/java/org/apache/flink/configuration/RestartBackoffTimeStrategyOptions.java

+	@PublicEvolving
+	public static final ConfigOption<String> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME = ConfigOptions
+		.key("restart-backoff-time-strategy.failure-rate.backoff-time")
+		.defaultValue("1 min")


I'd be in favor of specifying the backoff time in milliseconds because in the long run we want to get rid of Scala in the runtime. Thus, at some point we must no longer rely on Scala's FiniteDuration for the parsing.

tillrohrmann · 2019-06-13T09:23:18Z

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

+				RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_INTERVAL);
+			this.backoffTimeMS = getIntervalMSFromConfiguration(
+				configuration,
+				RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME);


I would move the configuration parsing logic into the createFactory method because it now mixes concerns.

tillrohrmann · 2019-06-13T09:32:25Z

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java

+			Long now = clock.absoluteTimeMillis();
+			Long earliestFailure = failureTimestamps.peek();
+
+			return (now - earliestFailure) > failuresIntervalMS;


How exactly is the interval defined? Assuming my interval is 2L and I have two failures T(0), T(2) occurring at time 0 and 2, then it would not restart. Thus, the boundaries of the interval [0, 2] are inclusive. An alternative could be to make the right boundary exclusive [0, 2).

Yes, I think both are OK. Let’s keep it as now, since it is also consistent with the definition of legacy FailureRateRestartStrategy.

tillrohrmann · 2019-06-13T09:32:58Z

...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java

+		for (int failuresLeft = numFailures; failuresLeft > 0; failuresLeft--) {
+			assertTrue(restartStrategy.canRestart());
+			restartStrategy.notifyFailure(failure);
+			clock.advanceTime(intervalMS + 1, TimeUnit.MILLISECONDS);


I guess the reason why we increment by intervalMS + 1 is that the right boundary of the failure interval is inclusive, right?

Yes, you are right.

tillrohrmann

Thanks for addressing my comments @eaglewatcherwb. I had two last comments which we need to address.

tillrohrmann · 2019-06-13T13:07:09Z

flink-core/src/main/java/org/apache/flink/configuration/RestartBackoffTimeStrategyOptions.java

+	@PublicEvolving
+	public static final ConfigOption<Long> RESTART_BACKOFF_TIME_STRATEGY_FAILURE_RATE_FAILURE_RATE_BACKOFF_TIME = ConfigOptions
+		.key("restart-backoff-time-strategy.failure-rate.backoff-time")
+		.defaultValue(10_000L)


Let's set the default value to 0 because with queued scheduling there is no strict need for a delay between restarts.

tillrohrmann · 2019-06-13T13:07:20Z

docs/_includes/generated/restart_backoff_time_strategy_configuration.html

+            <td>Maximum number of failures in given time interval before failing a job.</td>
+        </tr>
+    </tbody>
+</table>


This needs to be regenerated.

tillrohrmann

Thanks for addressing my comments @eaglewatcherwb. LGTM. Merging this PR.

* use ManualClock in unit test and add configuration description This closes apache#8573.

rmetzger added review=description? component=Runtime/Coordination labels May 30, 2019

zhuzhurk reviewed May 30, 2019

View reviewed changes

flink-core/src/main/java/org/apache/flink/configuration/ConfigConstants.java Outdated Show resolved Hide resolved

klion26 reviewed May 31, 2019

View reviewed changes

...pache/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategy.java Show resolved Hide resolved

...e/flink/runtime/executiongraph/failover/flip1/FailureRateRestartBackoffTimeStrategyTest.java Show resolved Hide resolved

zhuzhurk reviewed Jun 6, 2019

View reviewed changes

azagrebin reviewed Jun 6, 2019

View reviewed changes

zhuzhurk reviewed Jun 11, 2019

View reviewed changes

tillrohrmann requested changes Jun 13, 2019

View reviewed changes

tillrohrmann approved these changes Jun 13, 2019

View reviewed changes

[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy

45ddafd

* use ManualClock in unit test and add configuration description This closes apache#8573.

tillrohrmann force-pushed the FLINK-12670-failure-rate-strategy branch from 5038ff2 to 45ddafd Compare June 13, 2019 14:57

tillrohrmann closed this in 668172e Jun 14, 2019

eaglewatcherwb deleted the FLINK-12670-failure-rate-strategy branch June 14, 2019 07:40

sjwiesman pushed a commit to sjwiesman/flink that referenced this pull request Jun 26, 2019

[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy

1166837

* use ManualClock in unit test and add configuration description This closes apache#8573.


		private final String strategyString;

		public FailureRateRestartBackoffTimeStrategy(int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) {

	clock.setEpochMilli(0);

	harness.open();
	harness.processElement("42", 0);

	final OperatorSubtaskState snapshot = harness.snapshot(0, 1);
	harness.notifyOfCompletedCheckpoint(1);

	throwException.set(true);

	closeTestHarness();
	setUpTestHarness();

	final long transactionTimeout = 1000;
	sinkFunction.setTransactionTimeout(transactionTimeout);
	sinkFunction.ignoreFailuresAfterTransactionTimeout();

	try {
	harness.initializeState(snapshot);
	fail("Expected exception not thrown");
	} catch (RuntimeException e) {
	assertEquals("Expected exception", e.getMessage());
	}

	clock.setEpochMilli(transactionTimeout + 1);


		private final long backoffTimeMS;

		public FailureRateRestartBackoffTimeStrategyFactory(final Configuration configuration) {

		private final Clock clock;

		FailureRateRestartBackoffTimeStrategy(Clock clock, int maxFailuresPerInterval, long failuresInterval, long backoffTimeMS) {

[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy #8573

[FLINK-12670][runtime] Implement FailureRateRestartBackoffTimeStrategy #8573

Uh oh!

Conversation

eaglewatcherwb commented May 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented May 30, 2019

Review Progress

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuzhurk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azagrebin left a comment

Choose a reason for hiding this comment

Uh oh!

azagrebin Jun 6, 2019 • edited by GJL Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaglewatcherwb commented May 30, 2019 •

edited

Loading

zhuzhurk left a comment •

edited

Loading

azagrebin Jun 6, 2019 •

edited by GJL

Loading