-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-9427] Fix registration and request slot race condition in TaskExecutor #6067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-9427] Fix registration and request slot race condition in TaskExecutor #6067
Conversation
…Executor This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before, it could happen that the ResourceManager sends requestSlots message before the TaskExecutor registration was completed. Due to this, the TaskExecutor did not have all information it needed to accept task submissions. The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t he SlotManager could already assign these slots to pending slot requests. With this commit, the registration protocol changes such that the TaskExecutor first registers at the ResourceManager and only after completing this step, it will announce the available slots to the SlotManager.
…Executor This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before, it could happen that the ResourceManager sends requestSlots message before the TaskExecutor registration was completed. Due to this, the TaskExecutor did not have all information it needed to accept task submissions. The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t he SlotManager could already assign these slots to pending slot requests. With this commit, the registration protocol changes such that the TaskExecutor first registers at the ResourceManager and only after completing this step, it will announce the available slots to the SlotManager. This closes apache#6067.
…Executor This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before, it could happen that the ResourceManager sends requestSlots message before the TaskExecutor registration was completed. Due to this, the TaskExecutor did not have all information it needed to accept task submissions. The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t he SlotManager could already assign these slots to pending slot requests. With this commit, the registration protocol changes such that the TaskExecutor first registers at the ResourceManager and only after completing this step, it will announce the available slots to the SlotManager. This closes apache#6067.
|
|
||
| log.debug(message); | ||
| throw new SlotAllocationException(message); | ||
| throw new TaskManagerException(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return an exceptional future here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already talked about it offline.
| ResourceManagerGateway resourceManagerGateway = resourceManagerConnection.getTargetGateway(); | ||
| resourceManagerGateway.disconnectTaskManager(getResourceID(), cause); | ||
| if (log.isDebugEnabled()) { | ||
| log.debug("Close ResourceManager connection {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether cause can get logged twice:
log.debug("Close ResourceManager connection {}.",
resourceManagerResourceId, cause);
log.debug("Terminating registration attempts towards ResourceManager {}.",
resourceManagerConnection.getTargetAddress(), cause);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already talked about it offline.
|
|
||
| final TaskExecutorGateway taskExecutorGateway = taskExecutor.getSelfGateway(TaskExecutorGateway.class); | ||
|
|
||
| assertThat(registrationQueue.isEmpty(), is(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assertThat(registrationQueue, is(empty())); gives better failure messages. Now it's just a fancy way of writing assertTrue().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Will change it.
|
|
||
| final TestingTaskExecutorGateway testingTaskExecutorGateway = new TestingTaskExecutorGatewayBuilder() | ||
| .setRequestSlotFunction(slotIDJobIDAllocationIDStringResourceManagerIdTuple5 -> { | ||
| requestSlotQueue.offer(slotIDJobIDAllocationIDStringResourceManagerIdTuple5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are using spaces for indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrg, spaces.... How did they make it into this PR? Will get rid of them.
| final BlockingQueue<ResourceID> registrationQueue = new ArrayBlockingQueue<>(1); | ||
|
|
||
| testingResourceManagerGateway.setRegisterTaskExecutorFunction(stringResourceIDSlotReportIntegerHardwareDescriptionTuple5 -> { | ||
| registrationQueue.offer(stringResourceIDSlotReportIntegerHardwareDescriptionTuple5.f1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: spaces are used for indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrg, the same.
…Executor This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before, it could happen that the ResourceManager sends requestSlots message before the TaskExecutor registration was completed. Due to this, the TaskExecutor did not have all information it needed to accept task submissions. The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t he SlotManager could already assign these slots to pending slot requests. With this commit, the registration protocol changes such that the TaskExecutor first registers at the ResourceManager and only after completing this step, it will announce the available slots to the SlotManager. This closes apache#6067.
…Executor This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before, it could happen that the ResourceManager sends requestSlots message before the TaskExecutor registration was completed. Due to this, the TaskExecutor did not have all information it needed to accept task submissions. The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t he SlotManager could already assign these slots to pending slot requests. With this commit, the registration protocol changes such that the TaskExecutor first registers at the ResourceManager and only after completing this step, it will announce the available slots to the SlotManager. This closes #6067.
…Executor This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before, it could happen that the ResourceManager sends requestSlots message before the TaskExecutor registration was completed. Due to this, the TaskExecutor did not have all information it needed to accept task submissions. The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t he SlotManager could already assign these slots to pending slot requests. With this commit, the registration protocol changes such that the TaskExecutor first registers at the ResourceManager and only after completing this step, it will announce the available slots to the SlotManager. This closes apache#6067.
What is the purpose of the change
This commit fixes a race condition between the TaskExecutor and the ResourceManager. Before,
it could happen that the ResourceManager sends requestSlots message before the TaskExecutor
registration was completed. Due to this, the TaskExecutor did not have all information it needed
to accept task submissions.
The problem was that the TaskExecutor sent the SlotReport at registration time. Due to this, t
he SlotManager could already assign these slots to pending slot requests. With this commit, the
registration protocol changes such that the TaskExecutor first registers at the ResourceManager
and only after completing this step, it will announce the available slots to the SlotManager.
cc @GJL
Brief change log
TaskExecutorResourceManagerregistration protocol to announce the available slots after the completion of the registrationTaskExecutor#requestSlotto only accept the call if there is an established connection to aResourceManagerVerifying this change
SlotManagerTest#testSlotRequestFailureTaskExecutorTest#testIgnoringSlotRequestsIfNotRegistered,testReconnectionAttemptIfExplicitlyDisconnected,testInitialSlotReportandtestInitialSlotReportFailureDoes this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation