-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-19552][runtime] Get preferred locations of an ExecutionSlotSharingGroup only for its executions from the scheduled bulk #13628
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 6187d12 (Wed Oct 14 07:28:15 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
…t allow immediate failed logical slot request A newly created logical slot request should be either pending, or fulfilled successfully with an available physical slot from slot pool.
…ringGroup only for its executions from the scheduled bulk Otherwise the preferred location futures of un-scheduled executions can be pending for long, or even canceled/failed which results in unexpected errors.
6187d12
to
4150a04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this fix @zhuzhurk. If I understand the problem correctly, then I think the PR fixes the problem. However, I have a couple of related questions:
PipelinedRegionSchedulingConcurrentFailureTest. testNoImmediateSlotAllocationFailureOnConcurrentFailure
seems to be able to reproduce the problem because theRestartPipelinedRegionFailoverStrategy
also cancels not yet started downstream consumers. Why is this the case? Shouldn't it only be necessary to cancelExecutions
which are notCREATED
?ExecutionGraphToInputsLocationsRetrieverAdapter
seems to solve the problem of deadlocking theMergingSharedSlotProfileRetriever
by returning an input future of an execution which is about to be scheduled by checking on theCREATED
state. However, it returns the input location future if theExecution
has a different state. To me this seems an odd responsibility of anInputLocationsRetriever
. I think it would make more sense to filter such dependencies out at a different level.- Why does
MergingSharedSlotProfileRetriever.getSlotProfileFuture
returns a future? At the time of scheduling a pipelined region, shouldn't all its inputs and their locations be known? - The fix limits the set of input locations to the specified bulk. However, the bulk does need to contain all
Executions
which can share the slot. Hence, by doing this filtering, we give up potential information about other input locations for tasks which can also run in the same slot. Shouldn't we try to obtain their input locations and only ignore them if we cannot get them? - In
PipelinedRegionSchedulingConcurrentFailureTest
, it seems as ifExecutions
ofv3
andv4
are in the sameExecutionSlotSharingGroup
asExecutions
ofv1
andv2
. I think this is also the reason why theseExecutionVertices
are actually asked for their input locations. Given thatv2
andv3
is separated by a blocking data exchange, how canExecutions
of these vertices share a slot? Wouldn't it also mean that we are asking for a too large slot becauseSlotSharingExecutionSlotAllocator.getPhysicalSlotResourceProfile
includesv1, v2, v3, v4
in one slot?
If there is no strong need for waiting for a future input location, I would say that it would probably be easiest if the MergingSharedSlotProfileRetriever
checks all Executions
of its ExecutionSlotSharingGroup
and collects the available input locations. Based on this information it can then make an approximate decision. That way, it should also ignore currently failed tasks. Moreover, we should check how the ExecutionSlotSharingGroups
are calculated. There might be small inaccuracy in it.
cc @azagrebin
A somewhat related question concerning the
here |
Regarding whether we should make v1 and v3 in the same |
Thanks for the thoughtful comments. @tillrohrmann Below are replies to each listed question:
In summary, here are the actions to take:
WDYT? |
Hi @zhuzhurk, thanks for answering my questions so swiftly. I agree with your proposal how to fix the problem with the 3 steps.
|
I generally agree with the proposed steps. We should try to take into account locations of all available producers for a given I have a note about
Although, this assignment simplifies things,
v11 and v31 executions cannot run at the same time if there is only one slot. |
@azagrebin I feel that it is not very necessary to put v1 and v3 in the same SlotSharingGroup. The reasons are:
So I think assigning each logical pipelined region to a different SlotSharingGroup is enough and simpler. |
I meant execution speed up by running V1 and v3 at the same time in the same slot (makes sense w/o dynamic slot allocation as it is mostly the case now).
This is indeed not an issue when we have dynamic slot allocation.
Agree, it is generally hard for Flink to decide which regions can run together in batch. |
@zhuzhurk and @azagrebin have you already created the follow up issues? |
@tillrohrmann I just opened FLINK-19712 and FLINK-19714 to track them. |
Closing in favour of #13730. |
What is the purpose of the change
This PR fixes the issue reported in FLINK-19552, that JM can crash if concurrent failures happens.
The root cause is that
MergingSharedSlotProfileRetrieverFactory
incorrectly retrieves preferred locations for vertices which are not scheduled yet, while some of their producer vertex are canceled and so are their location futures.Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation