New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-11665] Wait for job termination before recovery in Dispatcher #7889
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
@@ -313,8 +318,6 @@ private void stopDispatcherServices() throws Exception { | |||
|
|||
final CompletableFuture<JobManagerRunner> jobManagerRunnerFuture = createJobManagerRunner(jobGraph); | |||
|
|||
jobManagerRunnerFutures.put(jobGraph.getJobID(), jobManagerRunnerFuture); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this changes? Please revert if it is irrelevant. Separated futures/bugfixes and code cleans can help review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR probably changed timing for events happening in some unit tests and triggered a concurrency issue fixed by this change. The PR would have test instability without this change. I think it is small enough to be part of this PR, though, I can factor it out into a separate commit.
@@ -351,7 +356,8 @@ private JobManagerRunner startJobManagerRunner(JobManagerRunner jobManagerRunner | |||
(ArchivedExecutionGraph archivedExecutionGraph, Throwable throwable) -> { | |||
// check if we are still the active JobManagerRunner by checking the identity | |||
//noinspection ObjectEquality | |||
if (jobManagerRunner == jobManagerRunnerFutures.get(jobId).getNow(null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this changes? Please revert if it is irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, please revert this change only introduces r
which is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we wait on the previous termination future one job for one job? In your diff it seems before any job recovery, we have to wait all jobs to terminate.
@tisonkun |
@azagrebin agree that we gain not too much speed up. My concern is that it is a more natural dependency. After all, a job cannot be recovered because an irrelevant job hasn't terminated sounds a bit weird. |
@flinkbot attention @tillrohrmann |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR @azagrebin. I think your solution can work.
However, I would rather revisit how Dispatchers
interact with the SubmittedJobGraphStore
and also the lifecycle of Dispatchers
wrt to leader sessions. We recently discovered another problem with keeping one Dispatcher
alive across multiple leader sessions which can cause the Dispatcher
to deadlock (see FLINK-11843). I would like to make the Dispatcher
not being used across leader sessions and instead creating a new Dispatcher
for every new leader session. By not sharing the SubmittedJobGraphStore
between different Dispatchers
we should be able to solve this issue here as well. Therefore I would be in favor of closing this PR for the moment.
@@ -351,7 +356,8 @@ private JobManagerRunner startJobManagerRunner(JobManagerRunner jobManagerRunner | |||
(ArchivedExecutionGraph archivedExecutionGraph, Throwable throwable) -> { | |||
// check if we are still the active JobManagerRunner by checking the identity | |||
//noinspection ObjectEquality | |||
if (jobManagerRunner == jobManagerRunnerFutures.get(jobId).getNow(null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, please revert this change only introduces r
which is not necessary.
new DispatcherException( | ||
String.format("Termination of previous JobManager for job %s failed. Cannot submit job under the same job id.", jobId), | ||
throwable)); }) | ||
.thenApply(ignored -> jobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thenApply
can be removed if CompletableFuture.completedFuture(jobId)
final CompletableFuture<Acknowledge> persistAndRunFuture = getJobTerminationFuture(jobId) | ||
.thenComposeAsync( | ||
FunctionUtils.uncheckedFunction((ignored) -> { | ||
jobManagerTerminationFutures.remove(jobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have to spread the removal of the job manager termination futures all over the place?
@@ -176,8 +183,7 @@ public void testSubmittedJobGraphRelease() throws Exception { | |||
leaderElectionService.notLeader(); | |||
|
|||
// wait for the job to properly terminate | |||
final CompletableFuture<Void> jobTerminationFuture = dispatcher.getJobTerminationFuture(jobId, TIMEOUT); | |||
jobTerminationFuture.get(); | |||
dispatcher.getJobTerminationFuture(jobId, TIMEOUT).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactorings should go into separate commits.
@@ -279,6 +285,7 @@ public void testStandbyDispatcherJobRecovery() throws Exception { | |||
Dispatcher dispatcher1 = null; | |||
Dispatcher dispatcher2 = null; | |||
|
|||
//noinspection TryFinallyCanBeTryWithResources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
} | ||
|
||
private static class WaitingSubmittedJobGraphStore implements SubmittedJobGraphStore { | ||
private static final long WAIT_RECOVER_MILLI = 1000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The waiting time is too long. It should be ok to set it to something like 10-50
milli seconds
recoverJobGraphLatch.await(WAIT_RECOVER_MILLI, TimeUnit.MILLISECONDS); | ||
} catch (TimeoutException e) { | ||
// ignore | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do the TimeoutException
handling in the test code because otherwise one might think that we actually recover if you are only looking at the test code.
dispatcher.getJobTerminationFuture(jobGraph.getJobID(), TIMEOUT).get(); | ||
|
||
assertTrue(submittedJobGraphStore.isJobRemoved()); | ||
assertFalse(submittedJobGraphStore.getJobIds().contains(jobGraph.getJobID())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to use hamcrest assertions in the future because they give better error messages and are more expressive.
sounds good to me. We might not maintain a failing contender(the leader) but just crash it and restart a new one(contender). Following this pattern we can exit on dispatcher lost leadership but it looks like we need a cluster manager to take care of creating a new |
Thanks for the reviews @tillrohrmann @tisonkun |
What is the purpose of the change
This PR addresses a concurrency problem when dispatcher loses leadership and immediately gains it again. When dispatcher recovers the job it has to wait for the termination of previous job run. Otherwise, it can happen job recovery for the next run adds execution graph in HA store and the concurrent termination of the previous run can subsequently remove it which leads to a further inconsistent state of the storage.
Brief change log
Verifying this change
run unit tests
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation