[FLINK-23417][tests] Harden MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot #16543

tillrohrmann · 2021-07-20T13:38:58Z

This commit hardens the MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot by configuring
a higher ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME than JobManagerOptions.
SLOT_REQUEST_TIMEOUT. This ensures that the slot request times out instead of being failed by
the SlotManager.

…ndalone.start-up-time

…NotEnoughSlot This commit hardens the MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot by configuring a higher ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME than JobManagerOptions. SLOT_REQUEST_TIMEOUT. This ensures that the slot request times out instead of being failed by the SlotManager.

flinkbot · 2021-07-20T13:43:13Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 012ae4a (Sat Aug 28 12:22:38 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-07-20T14:09:41Z

CI report:

012ae4a Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

nicoweidner

The change looks good, I think. However, I may need some explanations...

I couldn't reproduce the failure without the change. I ran 3k executions locally - is there some way to reproduce?
I changed the config to the following, expecting the test to fail because the other exception should be thrown, but it didn't. It still saw the slot request timeout exception after 10s:

configuration.setLong(JobManagerOptions.SLOT_REQUEST_TIMEOUT, 10000L);
configuration.setLong(ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 50L);

I even increased parallelism and added vertices, to no effect
I believe during testing I once got the UnfulfillableSlotRequestException from https://github.com/apache/flink/blob/release-1.12/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L970-L972, but I couldn't reproduce it and it still passed the test (maybe because some other slot requests had produced the slot timeout exception)

Am I lacking some fundamental understanding here?

nicoweidner · 2021-07-21T13:55:32Z

flink-runtime/src/test/java/org/apache/flink/runtime/minicluster/MiniClusterITCase.java

+        configuration.setLong(
+                ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 10000L);


The premise of the test confuses me: We run a job with parallelism 2 on a cluster that has only 1 slot available. Shouldn't we expect to get the "request not fulfillable" failure instead of the slot request timeout? To test the timeout, wouldn't it make more sense to have a request that would be fulfillable, but slots are blocked with another task?

The test tests that batch slot requests are timed out on the JobMaster side via the slot request timeout. Since they are batch slot requests and there is at least on slot that can fulfill all slot requests, they won't be failed on the ResourceManager side. This, however, only works if the TaskExecutor registers within the STANDALONE_CLUSTER_STARTUP_PERIOD_TIME because otherwise there is no slot on the RM registered that can fulfill the slot requests.

Thanks for the explanations! I got confused by:

Not understanding properly how slot requests are handled for batch jobs

Spending too much time in breakpoints while debugging, causing a different exception to be thrown :D (because timeout had elapsed in the meantime...)

nicoweidner · 2021-07-21T14:40:17Z

One further comment: Seeing as the real failure reason is not visible in the test output, maybe it would be a good idea to add the actual exception (including the cause chain) to the log output in case the test fails?
This may apply to (many many...) other tests as well though :-D

tillrohrmann · 2021-07-22T15:11:21Z

In order to reproduce the problem you have to delay the registration of the TaskExecutor a bit. You could do this by adding Thread.sleep(1000L) to the TaskExecutor.onStart() method. That way you can reproduce the race condition between the TaskExecutor registration and the arrival of the slot request that would fail after the start up period is over. I hope this makes the problem a bit clearer.

nicoweidner · 2021-07-26T14:42:54Z

One further comment: Seeing as the real failure reason is not visible in the test output, maybe it would be a good idea to add the actual exception (including the cause chain) to the log output in case the test fails?
This may apply to (many many...) other tests as well though :-D

Irrelevant now - thanks for explaining to me where to find the interesting logs @tillrohrmann

tillrohrmann added 2 commits July 20, 2021 15:32

[hotfix][conf] Correct config option reference in resourcemanager.sta…

ac48f01

…ndalone.start-up-time

rmetzger added the review=description? label Jul 20, 2021

rmetzger added the component=Runtime/Coordination label Jul 20, 2021

nicoweidner reviewed Jul 21, 2021

View reviewed changes

nicoweidner approved these changes Jul 26, 2021

View reviewed changes

zentol self-assigned this Jul 26, 2021

zentol merged commit f07bfa6 into apache:release-1.12 Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-23417][tests] Harden MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot #16543

[FLINK-23417][tests] Harden MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot #16543

Uh oh!

tillrohrmann commented Jul 20, 2021

Uh oh!

flinkbot commented Jul 20, 2021 •

edited

Loading

Uh oh!

flinkbot commented Jul 20, 2021 •

edited

Loading

Uh oh!

nicoweidner left a comment

Uh oh!

nicoweidner Jul 21, 2021

Uh oh!

tillrohrmann Jul 22, 2021

Uh oh!

nicoweidner Jul 26, 2021

Uh oh!

nicoweidner commented Jul 21, 2021

Uh oh!

tillrohrmann commented Jul 22, 2021

Uh oh!

nicoweidner commented Jul 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		configuration.setLong(
		ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 10000L);

[FLINK-23417][tests] Harden MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot #16543

[FLINK-23417][tests] Harden MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot #16543

Uh oh!

Conversation

tillrohrmann commented Jul 20, 2021

Uh oh!

flinkbot commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

nicoweidner left a comment

Choose a reason for hiding this comment

Uh oh!

nicoweidner Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

tillrohrmann Jul 22, 2021

Choose a reason for hiding this comment

Uh oh!

nicoweidner Jul 26, 2021

Choose a reason for hiding this comment

Uh oh!

nicoweidner commented Jul 21, 2021

Uh oh!

tillrohrmann commented Jul 22, 2021

Uh oh!

nicoweidner commented Jul 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flinkbot commented Jul 20, 2021 •

edited

Loading

flinkbot commented Jul 20, 2021 •

edited

Loading