-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-15936] Harden TaskExecutorTest#testSlotAcceptance #11667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test called taskSlotTable.allocateSlot from the test main thread, concurrently with taskSlotTable.createSlotReport while trying to register RM in the main TM thread. This silently failed the RM registration in TM.runAsync. As a result, RM.notifySlotAvailable was not called in TM. The commmit refactors the test to wait properly for RM registration and allocate slots through gateway in TM thread. The commit also uses proper testing RM/JM instead of mocks. To verify, the test has been looped locally 30k times.
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 475a7a5 (Wed Apr 08 07:45:21 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
zentol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the calls to taskSlotTable#tryMarkSlotActive? Similar checks were removed in FLINK-14742.
flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskExecutorTest.java
Outdated
Show resolved
Hide resolved
|
Other calls technically happen when the state is synced with the test thread. |
| // submit the task without having acknowledge the offered slots | ||
| return Either.Left(tmGateway.submitTask(tdd, jobMasterGateway.getFencingToken(), timeout).join()); | ||
| } catch (CompletionException e) { | ||
| return Either.Right(e.getCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just let the cause bubble up, possibly enriching it?
You could then use assertThrows for checking the exception.
| .build(); | ||
|
|
||
| try { | ||
| // submit the task without having acknowledge the offered slots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't necessarily true is it?
zentol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
The test called taskSlotTable.allocateSlot from the test main thread, concurrently with taskSlotTable.createSlotReport while trying to register RM in the main TM thread. This silently failed the RM registration in TM.runAsync. As a result, RM.notifySlotAvailable was not called in TM. The commmit refactors the test to wait properly for RM registration and allocate slots through gateway in TM thread. The commit also uses proper testing RM/JM instead of mocks. To verify, the test has been looped locally 30k times. This closes #11667.
What is the purpose of the change
Concurrent
TaskSlotTableaccess in the unstable testThe test called
taskSlotTable.allocateSlotfrom the test main thread,concurrently with
taskSlotTable.createSlotReportwhile trying to registerRM in the main TM thread. This silently failed the RM registration in
TM.runAsync.As a result, RM.notifySlotAvailable was not called in TM.
The taskSlotTable is not thread-safe and must be accessed only from the main RPC thread of TM.
Failure in
establishResourceManagerConnectionFailure of
TaskExecutor#establishResourceManagerConnectionis not expected.It completely breaks the connection mechanism to RM in TM.
As a hotfix, the PR suggests to log it on error level at least.
Alternatively, we can consider calling
TM.onFatalErroras a follow-up.Brief change log
The PR refactors the test to wait properly for RM registration
and allocate slots through gateway in TM thread.
The PR also uses proper testing RM/JM instead of mocks.
Verifying this change
To verify, the test has been looped locally 30k times.