-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-23806][runtime] Avoid StackOverflowException when a large scale job failed to acquire enough slots in time #16842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit bd4e67d (Mon Aug 16 09:40:12 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
@flinkbot run azure |
|
The failing test case in the second test run is most likely unrelated: https://issues.apache.org/jira/browse/FLINK-23829. |
tillrohrmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @zhuzhurk. LGTM. +1 for merging into master, 1.13 and 1.12
|
Thanks for reviewing! @tillrohrmann |
|
I will try to get a green CI before starting the merging. |
|
@flinkbot run azure |
…e job failed to acquire enough slots in time
bd4e67d to
1951f91
Compare
…e job failed to acquire enough slots in time This closes apache#16842.
…e job failed to acquire enough slots in time This closes apache#16842.
What is the purpose of the change
This fix is to avoid StackOverflowException which will lead to JM crash.
When requested slots are not fulfilled in time, task failure will be triggered and all related tasks will be canceled and restarted. However, in this process, if a task is already assigned a slot, the slot will be returned to the slot pool and it will be immediately used to fulfill pending slot requests of the tasks which will soon be canceled. The execution version of those tasks are already bumped in DefaultScheduler#restartTasksWithDelay(...) so that the assignment will fail immediately and the slot will be returned to the slot pool and again used to fulfill pending slot requests. StackOverflow can happen in this way when there are many vertices, and fatal error can happen and lead to JM crash.
To fix the problem, this PR will cancel the pending requests of all the tasks which will be canceled soon(i.e. tasks with version bumped) before canceling these tasks.
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation