[FLINK-7216] [distr. coordination] Guard against concurrent global failover #4364

StephanEwen · 2017-07-19T08:33:43Z

This is one of the blocker issues for the 1.3.2 release.

What is the purpose of the change

This fixed the bug FLINK-7216 where some race conditions can trigger concurrent failovers, triggering a restart-storm.

The heart of the bug is the fact that we allow initiating another restart while already being in state RESTARTING. That was introduced as a safety net to catch exceptions (implementation bugs) that are reported in that state and need a full recovery to ensure consistency.

However, this means that accidentally, multiple restarts may be triggered/queued and then execute after another. While one attempt is executing the failover, the next one will interfere or abort (as detected conflicting) and schedule another recovery, leading to the above mentioned restart storm. The restart storm subsides once one restart attempt makes enough progress (before the other interferes) to actually finish the scheduling phase.

Brief change log

This contains three issues, because the first two were needed for a preparing the fix.

FLINK-6665 and FLINK-6667 introduce an indirection where the RestartStrategy does no longer call restart() on the ExecutionGraph directly. Instead, they call a callback to initiate the restart.
The actual fix makes sure that the globalModVersion (which tracks global changes such as full restarts in the ExecutionGraph) is unchanged between triggering the restart and executing it. When scheduling multiple restart requests, only one will actually take effect, while the others detect being subsumed.

Verifying this change

This change added the following tests:

ExecutionGraphRestartTest#testConcurrentGlobalFailAndRestarts() tests explicitly that setting
ExecutionGraphRestartTest#testConcurrentLocalFailAndRestart() tests a similar setup

The general working of that mechanism is also covered by various existing test in org.apache.flink.runtime.executiongraph.restart

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: yes:

It the change affects the restart logic on the JobManager.

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

…d-hadoop2-uber' This closes apache#3960

…nd a ScheduledExecutor for ExecutionGraph restarts Initial work by zjureel@gmail.com , improved by sewen@apache.org.

zentol

Had some checkstyle-related comments, to reduce the number changes necessary once we introduce to this part of flink-runtime.

zentol · 2017-07-19T09:25:15Z

...main/java/org/apache/flink/runtime/executiongraph/restart/ExecutionGraphRestartCallback.java

+	/** The ExecutionGraph to restart */
+	private final ExecutionGraph execGraph;
+
+	/** Atomic flag to make sure this is used only once */


Please add a period here.

zentol · 2017-07-19T09:25:21Z

...main/java/org/apache/flink/runtime/executiongraph/restart/ExecutionGraphRestartCallback.java

+ */
+public class ExecutionGraphRestartCallback implements RestartCallback {
+
+	/** The ExecutionGraph to restart */


Please add a period here.

zentol · 2017-07-19T09:25:35Z

...main/java/org/apache/flink/runtime/executiongraph/restart/ExecutionGraphRestartCallback.java

+
+/**
+ * A {@link RestartCallback} that abstracts restart calls on an {@link ExecutionGraph}. 
+ * 


Please remove the trailing space.

zentol · 2017-07-19T09:27:22Z

...k-runtime/src/main/java/org/apache/flink/runtime/executiongraph/restart/RestartStrategy.java

+	 * Called by the ExecutionGraph to eventually trigger a full recovery.
+	 * The recovery must be triggered on the given callback object, and may be delayed
+	 * with the help of the given scheduled executor.
+	 * 


Please remove the trailing space.

zentol · 2017-07-19T09:28:44Z

...main/java/org/apache/flink/runtime/executiongraph/restart/ExecutionGraphRestartCallback.java

@@ -37,15 +37,26 @@
 	/** Atomic flag to make sure this is used only once */
 	private final AtomicBoolean used;

-	public ExecutionGraphRestartCallback(ExecutionGraph execGraph) {
+	/** The globalModVersion that the ExecutionGraph needs to have for the restart to go through */


Please add a period.

StephanEwen · 2017-07-19T12:12:30Z

Concerning the 'period' check style rule:

I think that the common language rules (not JavaDoc specific) are to add a period after complete sentences. That would mean that parameter descriptions, when not complete sentences, are not terminated by a period.

Are we rolling a rule that every text line has to be terminated in a period/fullstop?

zentol · 2017-07-19T12:24:59Z

With the current rules, the first sentence of any javadoc must end in a period.

So, this is invalid:

/** some parameter */
private final int myParameter ...

But, this is fine:

// some parameter
private final int myParamter ...

StephanEwen · 2017-07-19T15:45:20Z

Okay, will update the periods. The linguist in my heart cries a bit, but I guess it makes sense that we cannot expect checkstyle to figure out if a sentence is a complete sentence or not...

…ilover

aljoscha

I had a few nitpicks about comments and questions about some parts and the tests.

aljoscha · 2017-07-20T10:05:25Z

...runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java

@@ -159,34 +161,6 @@ public void testRestartAutomatically() throws Exception {
 	}

 	@Test
-	public void taskShouldFailWhenFailureRateLimitExceeded() throws Exception {


These tests are superseded by the newly added tests in FailureRateRestartStrategyTest?

Yes, as part of introducing the "callback" indirection, we can now also test the restart strategies much better, without always setting up a full ExecutionGraph. I added it to the refactoring.

aljoscha · 2017-07-20T10:07:28Z

...est/java/org/apache/flink/runtime/executiongraph/restart/FailureRateRestartStrategyTest.java

+	// ------------------------------------------------------------------------
+
+	/**
+	 * This method makes sure that the actual interval and is not spuriously waking up.


"This method makes sure to sleep for the required interval and that we don't spuriously wake up."?

Also, what happens if Thread.sleep() is interrupted?

Then the whole method and test anyways aborts exceptionally.

aljoscha · 2017-07-20T10:14:01Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java

 		try {
 			synchronized (progressLock) {
-				JobStatus current = state;
+				// check and increment the global version to move this recovery up


"check the current global version to determine whether our recovery attempt is still current"?

It's not incrementing the global version here.

aljoscha · 2017-07-20T10:18:50Z

...runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java

+		}
+	}
+
+	private static final class TriggeredRestartStrategy implements RestartStrategy {


"A {@link RestartStrategy} that blocks restarting on a given {@link OneShotLatch}."?

aljoscha · 2017-07-20T10:20:13Z

...k-runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphTestUtils.java

+		// from the TaskManager, which comes strictly after that. For tests, we use mock TaskManagers
+		// which cannot easily tell us when that condition has happened, unfortunately.
+		try {
+			Thread.sleep(2);


😢 but it seems there's no way around it. Could this lead to flaky tests?

In very rare cases, it might. I want to change the Execution a bit on the master to make this unnecessary.

However, that is too much surgery in a critical part for a bugfix release, so I decided to be conservative in the runtime code and rather pay this price in the tests.

aljoscha · 2017-07-20T10:34:28Z

...runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java

@@ -581,6 +565,106 @@ public void testSuspendWhileRestarting() throws Exception {
 		assertEquals(JobStatus.SUSPENDED, eg.getState());
 	}

+	@Test
+	public void testConcurrentLocalFailAndRestart() throws Exception {


This only verifies that we don't break the existing and working local failover, right? This test should also succeed on the current master and I checked and it indeed does.

Right, this one was a test that should have been there in the first place and I took this chance to add it.

aljoscha · 2017-07-20T11:40:33Z

...runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java

+		failTrigger.trigger();
+
+		waitUntilJobStatus(eg, JobStatus.FAILING, 1000);
+		completeCancellingForAllVertices(eg);


By the way, I noticed that completeCancellingForAllVertices() and finishAllVertices() have slightly misleading Javadoc. That threw me off a bit when reviewing.

True, those docs are copy/paste wrong ;-) I fixed them...

aljoscha · 2017-07-20T11:54:43Z

...runtime/src/test/java/org/apache/flink/runtime/executiongraph/ExecutionGraphRestartTest.java

+	}
+
+	@Test
+	public void testConcurrentGlobalFailAndRestarts() throws Exception {


I tried running this on current master and the test failed but I didn't see a "storm of restarts"

From the offline chat: I think you are missing the asynchrony in the restarting, leading to a lock in the cherrypicked code.

Jip, I think so too.

StephanEwen · 2017-07-20T13:24:42Z

Thanks for the reviews. Addressing the comments, rerunning tests, and merging...

aljoscha · 2017-07-20T13:36:59Z

+1 for merging!

[FLINK-6654] [build] Let 'flink-dist' properly depend on 'flink-shade…

01265fe

…d-hadoop2-uber' This closes apache#3960

This was referenced Jul 19, 2017

[FLINK-6665] Pass a ScheduledExecutorService to the RestartStrategy #4220

Closed

[FLINK-6667] Pass a callback type to the RestartStrategy, rather than the full ExecutionGraph #4277

Closed

[FLINK-6665] [FLINK-6667] [distributed coordination] Use a callback a…

f055645

…nd a ScheduledExecutor for ExecutionGraph restarts Initial work by zjureel@gmail.com , improved by sewen@apache.org.

zentol reviewed Jul 19, 2017

View reviewed changes

StephanEwen force-pushed the concurrent_restarts_13 branch from ef88524 to ba402ae Compare July 19, 2017 16:00

[FLINK-7216] [distr. coordination] Guard against concurrent global fa…

11e2144

…ilover

StephanEwen force-pushed the concurrent_restarts_13 branch from ba402ae to 11e2144 Compare July 19, 2017 16:19

StephanEwen mentioned this pull request Jul 19, 2017

[FLINK-7231] [distr. coordination] Fix slot release affecting SlotSharingGroup cleanup #4370

Closed

aljoscha approved these changes Jul 20, 2017

View reviewed changes

StephanEwen closed this Jul 24, 2017

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-7216] [distr. coordination] Guard against concurrent global failover #4364

[FLINK-7216] [distr. coordination] Guard against concurrent global failover #4364

StephanEwen commented Jul 19, 2017

zentol left a comment

zentol Jul 19, 2017

zentol Jul 19, 2017

zentol Jul 19, 2017

zentol Jul 19, 2017

zentol Jul 19, 2017

StephanEwen commented Jul 19, 2017

zentol commented Jul 19, 2017 •

edited

Loading

StephanEwen commented Jul 19, 2017

aljoscha left a comment

aljoscha Jul 20, 2017

StephanEwen Jul 20, 2017

aljoscha Jul 20, 2017

StephanEwen Jul 20, 2017

aljoscha Jul 20, 2017

aljoscha Jul 20, 2017

aljoscha Jul 20, 2017

aljoscha Jul 20, 2017

StephanEwen Jul 20, 2017

aljoscha Jul 20, 2017

StephanEwen Jul 20, 2017

aljoscha Jul 20, 2017

StephanEwen Jul 20, 2017

aljoscha Jul 20, 2017

StephanEwen Jul 20, 2017

aljoscha Jul 20, 2017

StephanEwen commented Jul 20, 2017

aljoscha commented Jul 20, 2017

[FLINK-7216] [distr. coordination] Guard against concurrent global failover #4364

[FLINK-7216] [distr. coordination] Guard against concurrent global failover #4364

Conversation

StephanEwen commented Jul 19, 2017

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

zentol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanEwen commented Jul 19, 2017

zentol commented Jul 19, 2017 • edited Loading

StephanEwen commented Jul 19, 2017

aljoscha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanEwen commented Jul 20, 2017

aljoscha commented Jul 20, 2017

zentol commented Jul 19, 2017 •

edited

Loading