-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-13166] Add support for batch slot requests to SlotPoolImpl #9058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 1de0f63 (Tue Aug 06 15:40:46 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
618d068 to
5b0aa0e
Compare
|
Hi @tillrohrmann, thanks for opening this PR. I have one concern about the way we handle batch slot requests.
From above descriptions, it seems to me that you are assuming a batch request the is not fulfilled in the first place can only wait for another slot in the slot pool to be freed. I'm not sure about this. I think we need to consider the possibility that a pending slot request in slot pool can also be satisfied by the resource manager after the slot request to RM failed in the first place.
Therefore, I would suggest to make the slot pool retry requesting slot from the resource manager for batch slot requests before their timeouts. |
|
Hi @xintongsong, you are right that we don't retry slot requests from within the I'm actually not so sure whether the Moreover, I think that this retry logic would be an optimization of the existing logic and, thus, not strictly release critical. Since the batch request is anyway only a band aid/temporary bridge, I would like to avoid to add even more logic which we need to rework later on just before the feature freeze. Please object if you think that this feature is a release blocker and needs to be added. |
|
I agree with Till here. The logic is not yet perfect, but should be an improvement over the current state. Under fine-grained recovery, the current state would lead to failure of a task and individual recovery, re-triggering a request to the RM. That is good, but the downside is that it takes away recovery attempts. I think this is tricky for users to understand, that we rely on failure / recovery to re-request resources. It makes re-try attempts meaningless and brings users to debug jobs (because they see unexpected failures) when really nothing is wrong. With this change here, we don't rely on failure/recovery any more, but do not re-trigger timed out requests within a stage. It may hence be that a stage does not optimally use its resources. Requests come again in the next stage. Like Till suggested, for 1.10, we should consider a different model. Requests from the SlotPool to the RM should not time out (unless there is an actual failure) and resources that appear at the RM make it to the SlotPool. Letting the SlotPool periodically request resources seems like a workaround to me. |
|
Another thing which deserves discussion is how we handle failed slot requests to the However, with the #8740, we fail the If we decide to do this, then we should do it as a follow up to this PR. |
|
Sorry to bring it back here.
I'm worried about that is is not only a matter of whether optimally use the resources, but may cause jobs fail to execute when the total resource of the cluster is actually enough. Consider we have two vertex A (2 parallelism, 100MB managed memory) and B (1 parallelism, 200MB managed memory), and the cluster has one TM with 200MB managed memory in total. Ideally, we would expect the two tasks of vertex A run on the TM concurrently, and then one task of vertex B. However, if we first request two 100MB slots, RM can not allocate any more 200MB slot for B, and there is no 200MB in the slot pool neither. As a result, the slot request for B is failed, leading to the job failure. Please correct me if I was wrong. I'm not very sure about whether the failure of slot request for B would cause the job to fail, or whether this can be resolved by the fine grained recovery. |
|
Hi Xingtong, I think you are right that this improvement cannot handle the case you describes. However, the fine-grained recovery can work as a fallback. It uses re-scheduling as a retry for resources. In this way B will finally get assigned with the resources that is released from A and returned to RM. |
|
Thank you for the clarification, @zhuzhurk. If the fine grained recover can work as a fallback, I think we can eventually get jobs in scenarios that I described above to work. As long as we don't fails jobs that should be able to run, then this should be an optimization that we don't necessarily do for this version. And @tillrohrmann, I want to make one more clarification.
I think currently we are failing the ResourceManager slot request if it cannot be fulfilled not only for the moment request arrived with #8740. To be more specific, the logic is to fail slot request that can not be fulfilled by neither a registered slot (after the startup period, for standalone), nor a pending slot that can be allocated (for Yarn/Mesos). In other words, we are failing slot requests that can, to the best of our knowledge, never be fulfilled. I would prefer the current way (fail requests that can never be fulfilled), while @StephanEwen said we could also consider the other way (fail requests that can not be fulfilled at the moment). What do you think? |
|
Yes, I think the fine grained recovery should effectively retrigger the request as part of its failover. This is not nice, but should do the trick. For the problem with differently sized slots which originate from the same TM, this might actually be an argument to not include the change which allows to share managed memory between slots in this release @StephanEwen. It seems to me that not all implications are clear at the moment. Being able to dynamically size the slots complicates the slot allocation protocol further because with this change requests might become fulfillable depending on what one releases. |
StephanEwen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good all in all.
Some minor comments in line.
As a follow-up, it might make sense to start using Duration/Deadline in the SlotPool instread of long values. Makes it more explicit and avoids the danger of accidentally confusing millis/nanos. That is orthogonal to this change, though.
|
|
||
| releaseTaskManager(slotPool, directMainThreadExecutor, taskManagerResourceId); | ||
|
|
||
| clock.advanceTime(1L, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can probably be removed from the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will remove.
|
|
||
| try { | ||
| slotFuture.get(); | ||
| fail("Expected TimeoutFuture."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That error message seems a bit off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, will correct it.
|
+1 to merge this |
This commit makes sure that slot requests enqueued at the SlotPoolImpl will get completed in the order in which they were requested. This makes sure that the original request order will be respected, hence preventing deadlock situations where dependent requests get completed first. This closes apache#9043.
Move testFailingAllocationFailsPendingSlotRequests from SlotPoolImplTest to SlotPoolPendingRequestFailureTest.
…Request to SlotPoolPendingRequestFailureTest
This commit adds a new type of slot request which can be issued to the SlotPoolImpl. The batch slot request is intended for batch jobs which can be executed with a single slot (having at least one slot for every requested resource profile equivalence class). Usually, a job which fulfills this criterion must not contain a pipelined shuffle. The new batch slot request behaves in the following aspects differently than the normal slot request: * Batch slot request don't time out if the SlotPool contains at least one allocated slot which can fulfill the pending slot request * Batch slot request don't react to the failAllocation signal from the ResourceManager * Batch slot request don't fail if the slot request to the resource manager fails In order to time out batch slot request which cannot be fulfilled with an allocated slot, the SlotPoolImpl schedules a periodic task which checks for this condition. If a slot cannot be fulfilled, it is marked as unfulfillable and the current timestamp is recorded. If the slot cannot be marked as fulfillable until the batch slot timeout has been exceeded, the slot request will be timed out. The batch slot request will be requested by calling SlotPool#requestNewAllocatedBatchSlot. Add SlotPool#requestNewAllocatedBatchSlot This closes apache#9058.
In order to not clutter the production implementation with testing methods, this commit introduces the TestingSlotPoolImpl and moves the trigger timeout methods and the convenience constructor to this class.
5b0aa0e to
1de0f63
Compare
|
Thanks for the review @StephanEwen. I've addressed your comments. Merging once Travis gives green light. |
What is the purpose of the change
This PR is based on #9043.
This commit adds a new type of slot request which can be issued to the SlotPoolImpl.
The batch slot request is intended for batch jobs which can be executed with a single
slot (having at least one slot for every requested resource profile equivalence class).
Usually, a job which fulfills this criterion must not contain a pipelined shuffle.
The new batch slot request behaves in the following aspects differently than the normal
slot request:
which can fulfill the pending slot request
In order to time out batch slot request which cannot be fulfilled with an allocated slot,
the SlotPoolImpl schedules a periodic task which checks for this condition. If a slot cannot
be fulfilled, it is marked as unfulfillable and the current timestamp is recorded. If the
slot cannot be marked as fulfillable until the batch slot timeout has been exceeded, the
slot request will be timed out.
The batch slot request will be requested by calling SlotPool#requestNewAllocatedBatchSlot.
cc @xintongsong @StephanEwen
Verifying this change
Added
SlotPoolBatchSlotRequestTestDoes this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation