[FLINK-10255] Only react to onAddedJobGraph signal when being leader #6678

tillrohrmann · 2018-09-10T13:37:33Z

What is the purpose of the change

The Dispatcher should only react to the onAddedJobGraph signal if it is the leader.
In all other cases the signal should be ignored since the jobs will be recovered once
the Dispatcher becomes the leader.

In order to still support non-blocking job recoveries, this commit serializes all
recovery operations by introducing a recoveryOperation future which first needs to
complete before a subsequent operation is started. That way we can avoid race conditions
between granting and revoking leadership as well as the onAddedJobGraph signals. This is
important since we can only lock each JobGraph once and, thus, need to make sure that
we don't release a lock of a properly recovered job in a concurrent operation.

cc @GJL

Brief change log

Only react to SubmittedJobGraphListener#onAddedJobGraph when being the leader
Serialize recovery operations by introducing a recoveryOperation future in order to avoid wrong unlocking of guarded resources

Verifying this change

Added ZooKeeperHADispatcherTest#testStandbyDispatcherJobExecution and ZooKeeperHADispatcherTest#testStandbyDispatcherJobRecovery

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

tisonkun · 2018-09-11T05:24:49Z

flink-runtime/src/main/java/org/apache/flink/runtime/akka/ActorUtils.java

@@ -85,5 +87,13 @@
 		return FutureUtils.completeAll(terminationFutures);
 	}

+	public static void stopActor(AkkaActorGateway akkaActorGateway) {


In this PR we introduce stopActor which is used at one place. After checking all our project, we have define many stopActor here and there. Most usages of them are from TestingUtils but there are also some from MesosResourceManager and FlinkUntypedActorTest. Sometimes use PoisonPill and sometimes use Kill.
Apart from this PR, since all stuff interact with Akka would depend on flink-runtime, let's unify stopActor Utils.
I think here, ActorUtils is the best place.

Good point. I will clean this up as a follow up.

tisonkun · 2018-09-11T05:32:23Z

Travis show relevant failures, will take a close look later.

testGrantingRevokingLeadership(org.apache.flink.runtime.dispatcher.DispatcherHATest)  Time elapsed: 0.024 sec  <<< ERROR!
org.apache.flink.runtime.util.TestingFatalErrorHandler$TestingException: java.lang.UnsupportedOperationException: Should not be called.
	at org.apache.flink.runtime.util.TestingFatalErrorHandler.rethrowError(TestingFatalErrorHandler.java:51)
	at org.apache.flink.runtime.dispatcher.DispatcherHATest.teardown(DispatcherHATest.java:98)
Caused by: java.lang.UnsupportedOperationException: Should not be called.
	at org.apache.flink.runtime.dispatcher.DispatcherHATest$BlockingSubmittedJobGraphStore.releaseJobGraph(DispatcherHATest.java:306)
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:809)
	at org.apache.flink.util.function.BiFunctionWithException.apply(BiFunctionWithException.java:49)
	at java.util.concurrent.CompletableFuture.biApply(CompletableFuture.java:1105)
	at java.util.concurrent.CompletableFuture$BiApply.tryFire(CompletableFuture.java:1070)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.45 sec <<< FAILURE! - in org.apache.flink.runtime.dispatcher.DispatcherHATest
testGrantingRevokingLeadership(org.apache.flink.runtime.dispatcher.DispatcherHATest)  Time elapsed: 0.028 sec  <<< ERROR!
org.apache.flink.runtime.util.TestingFatalErrorHandler$TestingException: java.lang.UnsupportedOperationException: Should not be called.
	at org.apache.flink.runtime.util.TestingFatalErrorHandler.rethrowError(TestingFatalErrorHandler.java:51)
	at org.apache.flink.runtime.dispatcher.DispatcherHATest.teardown(DispatcherHATest.java:98)
Caused by: java.lang.UnsupportedOperationException: Should not be called.
	at org.apache.flink.runtime.dispatcher.DispatcherHATest$BlockingSubmittedJobGraphStore.releaseJobGraph(DispatcherHATest.java:306)
	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:809)
	at org.apache.flink.util.function.BiFunctionWithException.apply(BiFunctionWithException.java:49)
	at java.util.concurrent.CompletableFuture.biApply(CompletableFuture.java:1105)
	at java.util.concurrent.CompletableFuture$BiApply.tryFire(CompletableFuture.java:1070)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

A TaskManager should go into a clean state in case of a JobManager failure(org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase)  Time elapsed: 121.247 sec  <<< FAILURE!
java.lang.AssertionError: assertion failed: timeout (119585594930 nanoseconds) during expectMsg while waiting for Acknowledge
	at scala.Predef$.assert(Predef.scala:170)
	at akka.testkit.TestKitBase$class.expectMsg_internal(TestKit.scala:387)
	at akka.testkit.TestKitBase$class.expectMsg(TestKit.scala:364)
	at akka.testkit.TestKit.expectMsg(TestKit.scala:814)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase$$anonfun$1$$anonfun$apply$mcV$sp$3$$anonfun$apply$mcV$sp$4.apply$mcV$sp(JobManagerFailsITCase.scala:118)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase$$anonfun$1$$anonfun$apply$mcV$sp$3$$anonfun$apply$mcV$sp$4.apply(JobManagerFailsITCase.scala:104)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase$$anonfun$1$$anonfun$apply$mcV$sp$3$$anonfun$apply$mcV$sp$4.apply(JobManagerFailsITCase.scala:104)
	at akka.testkit.TestKitBase$class.within(TestKit.scala:345)
	at akka.testkit.TestKit.within(TestKit.scala:814)
	at akka.testkit.TestKitBase$class.within(TestKit.scala:359)
	at akka.testkit.TestKit.within(TestKit.scala:814)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase$$anonfun$1$$anonfun$apply$mcV$sp$3.apply$mcV$sp(JobManagerFailsITCase.scala:104)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(JobManagerFailsITCase.scala:85)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(JobManagerFailsITCase.scala:85)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.WordSpecLike$$anon$1.apply(WordSpecLike.scala:953)
	at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
	at org.apache.flink.api.scala.runtime.jobmanager.JobManagerFailsITCase.withFixture(JobManagerFailsITCase.scala:37)

tillrohrmann · 2018-09-11T08:23:28Z

Thanks for the comments @tisonkun. I've fixed the failing DispatcherTest#testOnAddedJobGraphWithFinishedJob.

GJL · 2018-09-12T14:01:14Z

flink-core/src/main/java/org/apache/flink/util/function/FunctionWithThrowable.java

@@ -0,0 +1,48 @@
+/*


I think adding this file should be in a commit that is before [FLINK-10255] Only react to onAddedJobGraph signal when being leader, or it should be squashed. Without this class your previous commits would not work.

Yes, I will rearrange the commits before merging. Locally 3a07ee8 is before 868c7dd which makes it work.

GJL

This is not a review.

GJL · 2018-09-12T14:04:47Z

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/TestingDispatcher.java

 		return callAsyncWithoutFencing(
 			() -> getJobTerminationFuture(jobId),
 			timeout).thenCompose(Function.identity());
 	}
+
+	@VisibleForTesting


Not sure about @VisibleForTesting here. This class is already a test utility. It is even in the test sources directory.

You're right. Will change it.

GJL · 2018-09-12T14:05:49Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

@@ -861,14 +861,18 @@ public void grantLeadership(final UUID newLeaderSessionID) {
 			getMainThreadExecutor());
 	}

-	protected CompletableFuture<Void> getJobTerminationFuture(JobID jobId) {
+	CompletableFuture<Void> getJobTerminationFuture(JobID jobId) {


Should be private.

It cannot be private since the TestingDispatcher needs to access it.

True, I missed it.

GJL · 2018-09-12T14:07:34Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

 		if (jobManagerRunners.containsKey(jobId)) {
 			return FutureUtils.completedExceptionally(new DispatcherException(String.format("Job with job id %s is still running.", jobId)));
 		} else {
 			return jobManagerTerminationFutures.getOrDefault(jobId, CompletableFuture.completedFuture(null));
 		}
 	}

+	CompletableFuture<Void> getRecoveryOperation() {


This method has wider visibility scope than necessary, and is part of production code. I think @VisibleForTesting should be added.

Good point. Will add it.

GJL · 2018-09-12T15:35:00Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

+					final CompletableFuture<Void> submissionFuture = recoveredJob.thenComposeAsync(
+						(FunctionWithThrowable<JobGraph, CompletableFuture<Void>, Exception>) (JobGraph jobGraph) -> tryRunRecoveredJobGraph(jobGraph, dispatcherId)
+							.thenAcceptAsync(
+								(ConsumerWithException<Boolean, Exception>) (Boolean isRecoveredJobRunning) -> {


Imo we are doing this wrong. The code would be much more readible with static factory methods:

/** * {@link Consumer} that can throw checked exceptions. */ @FunctionalInterface public interface CheckedConsumer<T> { void checkedAccept(T t) throws Exception; static <T> Consumer<T> unchecked(CheckedConsumer<T> checkedConsumer) { return (t) -> { try { checkedConsumer.checkedAccept(t); } catch (Exception e) { ExceptionUtils.rethrow(e); } }; } }

This allows for:

.thenAcceptAsync(CheckedConsumer.unchecked(isRecoveredJobRunning -> { ... })); ...

No casts are required. Also when interacting with the Java API, it does not matter what exact type of exception can be thrown – what matters is that the checked exception becomes unchecked. We do not need to generify the exception type in ConsumerWithException.

Good point. I like this approach better. Will adapt the existing interfaces.

GJL · 2018-09-12T15:35:21Z

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

+
+					final DispatcherId dispatcherId = getFencingToken();
+					final CompletableFuture<Void> submissionFuture = recoveredJob.thenComposeAsync(
+						(FunctionWithThrowable<JobGraph, CompletableFuture<Void>, Exception>) (JobGraph jobGraph) -> tryRunRecoveredJobGraph(jobGraph, dispatcherId)


See my comment regarding the ConsumerWithException.

GJL · 2018-09-12T15:42:16Z

flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/ZooKeeperHADispatcherTest.java

@@ -167,7 +181,7 @@ public void testSubmittedJobGraphRelease() throws Exception {
 				// recover the job
 				final SubmittedJobGraph submittedJobGraph = otherSubmittedJobGraphStore.recoverJobGraph(jobId);

-				assertThat(submittedJobGraph, Matchers.is(Matchers.notNullValue()));
+				assertThat(submittedJobGraph, is(Matchers.notNullValue()));


You added a static import for is but not for notNullValue. I think this should be consistent.

Good catch. Will change it.

tillrohrmann · 2018-09-13T08:36:02Z

Thanks for the review @GJL. I've addressed your comments and after Travis gives green light, I'll merge it.

The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.

…ime timeout

…xcpetion into Function

The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.

…umer

…tion

…nsumer ThrowingRunnable#unchecked converts a ThrowingRunnable into a Runnable which throws checked exceptions as unchecked ones. FunctionUtils#uncheckedConsmer(ThrowingConsumer) converts a ThrowingConsumer into a Consumer which throws checked exceptions as unchecked ones. This is necessary because ThrowingConsumer is public and we cannot add new methods to the interface.

…point This is necessary to support the command line syntax used by the multi master standalone start-up scripts.

The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.

Clarkkkkk · 2018-09-18T08:57:33Z

Hi @tillrohrmann , is it possible that two async operation that modifies the same recoveryOperation at the same time? Would that be serializable in that case?

tillrohrmann · 2018-09-21T08:26:37Z

I think it should not be possible to have two async recovery operations ongoing since either of them will have to wait for the other to complete. That was the idea of the fix.

Clarkkkkk · 2018-09-21T15:10:36Z

Thanks for the reply, that'll make sense.

GJL self-assigned this Sep 10, 2018

tisonkun reviewed Sep 11, 2018

View reviewed changes

GJL reviewed Sep 12, 2018

View reviewed changes

tillrohrmann force-pushed the fixJobRecovery branch from 868c7dd to d8c544f Compare September 12, 2018 15:01

GJL reviewed Sep 12, 2018

View reviewed changes

GJL approved these changes Sep 12, 2018

View reviewed changes

tillrohrmann mentioned this pull request Sep 12, 2018

[FLINK-10329] [FLINK-10328] Fail ZooKeeperSubmittedJobGraphStore#removeJobGraph if job cannot be removed & Release all locks when stopping the ZooKeeperSubmittedJobGraphStore #6686

Closed

tillrohrmann force-pushed the fixJobRecovery branch from d8c544f to 0788a54 Compare September 13, 2018 08:33

tillrohrmann force-pushed the fixJobRecovery branch 3 times, most recently from f255bb2 to 3d94a81 Compare September 13, 2018 16:26

tillrohrmann force-pushed the fixJobRecovery branch from 3d94a81 to 6fb8bc2 Compare September 13, 2018 17:24

tillrohrmann added 6 commits September 14, 2018 09:16

[hotfix] Add LeaderRetrievalUtils#retrieveLeaderConnectionInfo with T…

2ae2a4a

…ime timeout

[hotfix] Add FunctionUtils#uncheckedFunction to convert FunctionWithE…

bf0aeb2

…xcpetion into Function

[hotfix] Add BiConsumerWithException#unchecked to convert into BiCons…

0236362

…umer

[hotfix] Add BiFunctionWithException#unchecked to convert into BiFunc…

4a46df1

…tion

[hotfix] Add --host and --executionMode config option to ClusterEntry…

740443e

…point This is necessary to support the command line syntax used by the multi master standalone start-up scripts.

tillrohrmann force-pushed the fixJobRecovery branch from 34e3b73 to 740443e Compare September 14, 2018 07:16

asfgit closed this in 3e5d07c Sep 14, 2018

tillrohrmann deleted the fixJobRecovery branch September 14, 2018 13:21

tisonkun mentioned this pull request Sep 15, 2018

[FLINK-10349] Unify stopActor utils #6701

Closed

rmetzger added the component=Runtime/Coordination label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-10255] Only react to onAddedJobGraph signal when being leader #6678

[FLINK-10255] Only react to onAddedJobGraph signal when being leader #6678

tillrohrmann commented Sep 10, 2018

tisonkun Sep 11, 2018

tillrohrmann Sep 11, 2018

tisonkun commented Sep 11, 2018

tillrohrmann commented Sep 11, 2018

GJL Sep 12, 2018 •

edited

Loading

tillrohrmann Sep 12, 2018

GJL left a comment

GJL Sep 12, 2018

tillrohrmann Sep 12, 2018

GJL Sep 12, 2018

tillrohrmann Sep 12, 2018 •

edited

Loading

GJL Sep 12, 2018

GJL Sep 12, 2018

tillrohrmann Sep 12, 2018

GJL Sep 12, 2018 •

edited

Loading

tillrohrmann Sep 13, 2018

GJL Sep 12, 2018

GJL Sep 12, 2018

tillrohrmann Sep 13, 2018

tillrohrmann commented Sep 13, 2018

Clarkkkkk commented Sep 18, 2018

tillrohrmann commented Sep 21, 2018

Clarkkkkk commented Sep 21, 2018

[FLINK-10255] Only react to onAddedJobGraph signal when being leader #6678

[FLINK-10255] Only react to onAddedJobGraph signal when being leader #6678

Conversation

tillrohrmann commented Sep 10, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Sep 11, 2018

tillrohrmann commented Sep 11, 2018

GJL Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GJL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GJL Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann commented Sep 13, 2018

Clarkkkkk commented Sep 18, 2018

tillrohrmann commented Sep 21, 2018

Clarkkkkk commented Sep 21, 2018

GJL Sep 12, 2018 •

edited

Loading

tillrohrmann Sep 12, 2018 •

edited

Loading

GJL Sep 12, 2018 •

edited

Loading