Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27382][runtime] Moves cluster shutdown to when the job cleanup is done in job mode #19567

Merged
merged 2 commits into from
Apr 28, 2022

Conversation

XComp
Copy link
Contributor

@XComp XComp commented Apr 25, 2022

What is the purpose of the change

In job mode, we triggered the shutdown as soon as the job reached a globally terminal state. This was fine in 1.14- because we didn't do any promises on the cleanup anyway. With 1.15, we introduced retries for cleanup which results in the final termination taking longer. During cluster shutdown the ResourceManager is informed about deregistering the cluster which results in the workers being shutdown in case of active RMs (i.e. k8s and YARN). See further details in FLINK-26772 (parent issue of this issue).

Brief change log

  • removed overwriting of Dispatcher#jobReachedTerminalState in MiniDispatcher
  • Introduced new method that is called when the job reached a globally terminal state which then gets implemented by MiniDispatcher

Verifying this change

  • I extended existing tests to verify that the shutdown happens after the cleanup

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Apr 25, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@XComp
Copy link
Contributor Author

XComp commented Apr 26, 2022

The force-push also included a rebase to fix the master conflict resolution

@XComp
Copy link
Contributor Author

XComp commented Apr 26, 2022

I added a hotfix commit in addition to the actual change. PTAL.

@XComp
Copy link
Contributor Author

XComp commented Apr 27, 2022

Force-pushed to fix spotless error

@XComp XComp force-pushed the FLINK-27382 branch 2 times, most recently from b0874a8 to 568d79f Compare April 28, 2022 12:10
@XComp
Copy link
Contributor Author

XComp commented Apr 28, 2022

Fixed compilation error and force-pushed...

…unner result completes exceptionally

Additionally, I moved the test's documentation into the
production code because it makes more sense to have the
reasoning over there.
@XComp XComp merged commit d940af6 into apache:master Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants