-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-10255] Only react to onAddedJobGraph signal when being leader #6678
Conversation
@@ -85,5 +87,13 @@ | |||
return FutureUtils.completeAll(terminationFutures); | |||
} | |||
|
|||
public static void stopActor(AkkaActorGateway akkaActorGateway) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR we introduce stopActor
which is used at one place. After checking all our project, we have define many stopActor
here and there. Most usages of them are from TestingUtils
but there are also some from MesosResourceManager
and FlinkUntypedActorTest
. Sometimes use PoisonPill
and sometimes use Kill
.
Apart from this PR, since all stuff interact with Akka would depend on flink-runtime
, let's unify stopActor Utils.
I think here, ActorUtils
is the best place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will clean this up as a follow up.
Travis show relevant failures, will take a close look later.
|
Thanks for the comments @tisonkun. I've fixed the failing |
@@ -0,0 +1,48 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding this file should be in a commit that is before [FLINK-10255] Only react to onAddedJobGraph signal when being leader
, or it should be squashed. Without this class your previous commits would not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a review.
return callAsyncWithoutFencing( | ||
() -> getJobTerminationFuture(jobId), | ||
timeout).thenCompose(Function.identity()); | ||
} | ||
|
||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about @VisibleForTesting
here. This class is already a test utility. It is even in the test
sources directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Will change it.
@@ -861,14 +861,18 @@ public void grantLeadership(final UUID newLeaderSessionID) { | |||
getMainThreadExecutor()); | |||
} | |||
|
|||
protected CompletableFuture<Void> getJobTerminationFuture(JobID jobId) { | |||
CompletableFuture<Void> getJobTerminationFuture(JobID jobId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be private
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be private since the TestingDispatcher
needs to access it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I missed it.
if (jobManagerRunners.containsKey(jobId)) { | ||
return FutureUtils.completedExceptionally(new DispatcherException(String.format("Job with job id %s is still running.", jobId))); | ||
} else { | ||
return jobManagerTerminationFutures.getOrDefault(jobId, CompletableFuture.completedFuture(null)); | ||
} | ||
} | ||
|
||
CompletableFuture<Void> getRecoveryOperation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has wider visibility scope than necessary, and is part of production code. I think @VisibleForTesting
should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will add it.
868c7dd
to
d8c544f
Compare
final CompletableFuture<Void> submissionFuture = recoveredJob.thenComposeAsync( | ||
(FunctionWithThrowable<JobGraph, CompletableFuture<Void>, Exception>) (JobGraph jobGraph) -> tryRunRecoveredJobGraph(jobGraph, dispatcherId) | ||
.thenAcceptAsync( | ||
(ConsumerWithException<Boolean, Exception>) (Boolean isRecoveredJobRunning) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo we are doing this wrong. The code would be much more readible with static factory methods:
/**
* {@link Consumer} that can throw checked exceptions.
*/
@FunctionalInterface
public interface CheckedConsumer<T> {
void checkedAccept(T t) throws Exception;
static <T> Consumer<T> unchecked(CheckedConsumer<T> checkedConsumer) {
return (t) -> {
try {
checkedConsumer.checkedAccept(t);
} catch (Exception e) {
ExceptionUtils.rethrow(e);
}
};
}
}
This allows for:
.thenAcceptAsync(CheckedConsumer.unchecked(isRecoveredJobRunning -> {
...
}));
...
No casts are required. Also when interacting with the Java API, it does not matter what exact type of exception can be thrown – what matters is that the checked exception becomes unchecked. We do not need to generify the exception type in ConsumerWithException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I like this approach better. Will adapt the existing interfaces.
|
||
final DispatcherId dispatcherId = getFencingToken(); | ||
final CompletableFuture<Void> submissionFuture = recoveredJob.thenComposeAsync( | ||
(FunctionWithThrowable<JobGraph, CompletableFuture<Void>, Exception>) (JobGraph jobGraph) -> tryRunRecoveredJobGraph(jobGraph, dispatcherId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment regarding the ConsumerWithException
.
@@ -167,7 +181,7 @@ public void testSubmittedJobGraphRelease() throws Exception { | |||
// recover the job | |||
final SubmittedJobGraph submittedJobGraph = otherSubmittedJobGraphStore.recoverJobGraph(jobId); | |||
|
|||
assertThat(submittedJobGraph, Matchers.is(Matchers.notNullValue())); | |||
assertThat(submittedJobGraph, is(Matchers.notNullValue())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added a static import for is
but not for notNullValue
. I think this should be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Will change it.
d8c544f
to
0788a54
Compare
Thanks for the review @GJL. I've addressed your comments and after Travis gives green light, I'll merge it. |
f255bb2
to
3d94a81
Compare
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
3d94a81
to
6fb8bc2
Compare
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
…xcpetion into Function
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
…nsumer ThrowingRunnable#unchecked converts a ThrowingRunnable into a Runnable which throws checked exceptions as unchecked ones. FunctionUtils#uncheckedConsmer(ThrowingConsumer) converts a ThrowingConsumer into a Consumer which throws checked exceptions as unchecked ones. This is necessary because ThrowingConsumer is public and we cannot add new methods to the interface.
…point This is necessary to support the command line syntax used by the multi master standalone start-up scripts.
34e3b73
to
740443e
Compare
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader. In all other cases the signal should be ignored since the jobs will be recovered once the Dispatcher becomes the leader. In order to still support non-blocking job recoveries, this commit serializes all recovery operations by introducing a recoveryOperation future which first needs to complete before a subsequent operation is started. That way we can avoid race conditions between granting and revoking leadership as well as the onAddedJobGraph signals. This is important since we can only lock each JobGraph once and, thus, need to make sure that we don't release a lock of a properly recovered job in a concurrent operation. This closes apache#6678.
Hi @tillrohrmann , is it possible that two async operation that modifies the same recoveryOperation at the same time? Would that be serializable in that case? |
I think it should not be possible to have two async recovery operations ongoing since either of them will have to wait for the other to complete. That was the idea of the fix. |
Thanks for the reply, that'll make sense. |
What is the purpose of the change
The Dispatcher should only react to the onAddedJobGraph signal if it is the leader.
In all other cases the signal should be ignored since the jobs will be recovered once
the Dispatcher becomes the leader.
In order to still support non-blocking job recoveries, this commit serializes all
recovery operations by introducing a recoveryOperation future which first needs to
complete before a subsequent operation is started. That way we can avoid race conditions
between granting and revoking leadership as well as the onAddedJobGraph signals. This is
important since we can only lock each JobGraph once and, thus, need to make sure that
we don't release a lock of a properly recovered job in a concurrent operation.
cc @GJL
Brief change log
SubmittedJobGraphListener#onAddedJobGraph
when being the leaderrecoveryOperation
future in order to avoid wrong unlocking of guarded resourcesVerifying this change
ZooKeeperHADispatcherTest#testStandbyDispatcherJobExecution
andZooKeeperHADispatcherTest#testStandbyDispatcherJobRecovery
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation