Skip to content

Conversation

@tillrohrmann
Copy link
Contributor

This commit hardens the MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot by configuring
a higher ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME than JobManagerOptions.
SLOT_REQUEST_TIMEOUT. This ensures that the slot request times out instead of being failed by
the SlotManager.

…NotEnoughSlot

This commit hardens the MiniClusterITCase.testHandleBatchJobsWhenNotEnoughSlot by configuring
a higher ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME than JobManagerOptions.
SLOT_REQUEST_TIMEOUT. This ensures that the slot request times out instead of being failed by
the SlotManager.
@flinkbot
Copy link
Collaborator

flinkbot commented Jul 20, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 012ae4a (Sat Aug 28 12:22:38 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 20, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

Copy link

@nicoweidner nicoweidner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good, I think. However, I may need some explanations...

  • I couldn't reproduce the failure without the change. I ran 3k executions locally - is there some way to reproduce?
  • I changed the config to the following, expecting the test to fail because the other exception should be thrown, but it didn't. It still saw the slot request timeout exception after 10s:
configuration.setLong(JobManagerOptions.SLOT_REQUEST_TIMEOUT, 10000L);
configuration.setLong(ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 50L);

Am I lacking some fundamental understanding here?

Comment on lines +163 to +164
configuration.setLong(
ResourceManagerOptions.STANDALONE_CLUSTER_STARTUP_PERIOD_TIME, 10000L);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The premise of the test confuses me: We run a job with parallelism 2 on a cluster that has only 1 slot available. Shouldn't we expect to get the "request not fulfillable" failure instead of the slot request timeout? To test the timeout, wouldn't it make more sense to have a request that would be fulfillable, but slots are blocked with another task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test tests that batch slot requests are timed out on the JobMaster side via the slot request timeout. Since they are batch slot requests and there is at least on slot that can fulfill all slot requests, they won't be failed on the ResourceManager side. This, however, only works if the TaskExecutor registers within the STANDALONE_CLUSTER_STARTUP_PERIOD_TIME because otherwise there is no slot on the RM registered that can fulfill the slot requests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations! I got confused by:

  • Not understanding properly how slot requests are handled for batch jobs
  • Spending too much time in breakpoints while debugging, causing a different exception to be thrown :D (because timeout had elapsed in the meantime...)

@nicoweidner
Copy link

One further comment: Seeing as the real failure reason is not visible in the test output, maybe it would be a good idea to add the actual exception (including the cause chain) to the log output in case the test fails?
This may apply to (many many...) other tests as well though :-D

@tillrohrmann
Copy link
Contributor Author

In order to reproduce the problem you have to delay the registration of the TaskExecutor a bit. You could do this by adding Thread.sleep(1000L) to the TaskExecutor.onStart() method. That way you can reproduce the race condition between the TaskExecutor registration and the arrival of the slot request that would fail after the start up period is over. I hope this makes the problem a bit clearer.

@nicoweidner
Copy link

One further comment: Seeing as the real failure reason is not visible in the test output, maybe it would be a good idea to add the actual exception (including the cause chain) to the log output in case the test fails?
This may apply to (many many...) other tests as well though :-D

Irrelevant now - thanks for explaining to me where to find the interesting logs @tillrohrmann

@zentol zentol self-assigned this Jul 26, 2021
@zentol zentol merged commit f07bfa6 into apache:release-1.12 Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants