-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-23417][tests] Harden MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot #16543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ndalone.start-up-time
…NotEnoughSlot This commit hardens the MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot by configuring a higher ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME than JobManagerOptions. SLOT_REQUEST_TIMEOUT. This ensures that the slot request times out instead of being failed by the SlotManager.
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 012ae4a (Sat Aug 28 12:22:38 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
nicoweidner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good, I think. However, I may need some explanations...
- I couldn't reproduce the failure without the change. I ran 3k executions locally - is there some way to reproduce?
- I changed the config to the following, expecting the test to fail because the other exception should be thrown, but it didn't. It still saw the slot request timeout exception after 10s:
configuration.setLong(JobManagerOptions.SLOT_REQUEST_TIMEOUT, 10000L);
configuration.setLong(ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 50L);
- I even increased parallelism and added vertices, to no effect
- I believe during testing I once got the
UnfulfillableSlotRequestExceptionfrom https://github.com/apache/flink/blob/release-1.12/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L970-L972, but I couldn't reproduce it and it still passed the test (maybe because some other slot requests had produced the slot timeout exception)
Am I lacking some fundamental understanding here?
| configuration.setLong( | ||
| ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 10000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The premise of the test confuses me: We run a job with parallelism 2 on a cluster that has only 1 slot available. Shouldn't we expect to get the "request not fulfillable" failure instead of the slot request timeout? To test the timeout, wouldn't it make more sense to have a request that would be fulfillable, but slots are blocked with another task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test tests that batch slot requests are timed out on the JobMaster side via the slot request timeout. Since they are batch slot requests and there is at least on slot that can fulfill all slot requests, they won't be failed on the ResourceManager side. This, however, only works if the TaskExecutor registers within the STANDALONE_CLUSTER_STARTUP_PERIOD_TIME because otherwise there is no slot on the RM registered that can fulfill the slot requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations! I got confused by:
- Not understanding properly how slot requests are handled for batch jobs
- Spending too much time in breakpoints while debugging, causing a different exception to be thrown :D (because timeout had elapsed in the meantime...)
|
One further comment: Seeing as the real failure reason is not visible in the test output, maybe it would be a good idea to add the actual exception (including the cause chain) to the log output in case the test fails? |
|
In order to reproduce the problem you have to delay the registration of the |
Irrelevant now - thanks for explaining to me where to find the interesting logs @tillrohrmann |
This commit hardens the MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot by configuring
a higher ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME than JobManagerOptions.
SLOT_REQUEST_TIMEOUT. This ensures that the slot request times out instead of being failed by
the SlotManager.