Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-12015] Fix TaskManagerRunnerTest instability #8053

Closed

Conversation

aljoscha
Copy link
Contributor

Before, the was a race condition between the termination future in
TaskManagerRunner completing and the asynchronous shutdown part here:
https://github.com/apache/flink/blob/70107c4647ecac3df9b2b8c7920e7cb99ad550f1/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L258

The test would go out of the block that was waiting on the future but
the shutdown code that is executed after the future completes is
executed asynchronously, so is not guaranteed to have run at that point.

This also refactors the code a bit to make it more obvious what is
happening and removes the SecurityManagerContext because it was
obscuring the problem.

Verifying this change

This change is already covered by existing tests, such as (please describe tests).

Before, the was a race condition between the termination future in
TaskManagerRunner completing and the asynchronous shutdown part here:
https://github.com/apache/flink/blob/70107c4647ecac3df9b2b8c7920e7cb99ad550f1/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L258

The test would go out of the block that was waiting on the future but
the shutdown code that is executed after the future completes is
executed asynchronously, so is not guaranteed to have run at that point.

This also refactors the code a bit to make it more obvious what is
happening and removes the SecurityManagerContext because it was
obscuring the problem.

This was analyzed by Igal and me, and mostly fixed by Igal.
@aljoscha aljoscha force-pushed the flink-12015-fix-taskmanagerrunnertest branch from 04783d9 to 19d56f2 Compare March 26, 2019 16:51
@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for hardening the test @aljoscha. This is a good fix. I would try to avoid busy looping by letting the SystemExitTrackingSecurityManager return a future which is completed once System.exit is called.


eventually(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of busy looping, I would prefer to change the SystemExitTrackingSecurityManager to return a CompletableFuture<Integer> which is completed with the exit code with which System.exit(exitCode) is called.

@aljoscha
Copy link
Contributor Author

@tillrohrmann I updated the PR. Btw, this test only ensures that we have at least one call to System.exit(). This is also true for the current version on master but this fact is hidden.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thanks for the hardening of this test @aljoscha. +1 for merging.

@aljoscha aljoscha closed this Mar 30, 2019
@aljoscha aljoscha deleted the flink-12015-fix-taskmanagerrunnertest branch March 30, 2019 12:27
@aljoscha
Copy link
Contributor Author

Thanks! I'll merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants