forked from apache/flink
-
Notifications
You must be signed in to change notification settings - Fork 1
[FLINK-25432] Introduces ResourceCleaner that gets triggered for finished jobs that are not cleaned up, yet, according to the JobResultStore #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
df7c6be
to
1c5508a
Compare
71c4ff4
to
1ee0e26
Compare
e307bea
to
1712b69
Compare
c591b43
to
c98cf4a
Compare
7a2dce0
to
95ea415
Compare
003c378
to
d9dd04d
Compare
133e036
to
4827de5
Compare
4787b66
to
34089ff
Compare
… GloballyCleanableResource
…leResource and GloballyCleanableResource
…esource and GloballyCleanableResource
Moves JobManagerRunnerRegistry instantiation into separate constructor. The purpose of this change is to improve the testability of the code.
…egrates it into Dispatcher Additionally, testHABlobsAreNotRemovedIfHAJobGraphRemovalFails is deleted because this case doesn't apply anymore. The job graph deletion and the HA file deletion happen independently from each other now: The JobResult is already added to the JobResultStore as a dirty entry and all data can be deleted. There is a dependency between cleaning up the JobManagerRegistry (i.e. closing the JobMaster) and cleaning up the HighAvailabilityServices for the job. This issue is solved by a JobManager-wide leader election (FLINK-24038). But we have to consider closing the JobMaster before cleaning the HA job data until the legacy leader election functionality is removed entirely.
…eric We need to extend the TestingCompletedCheckpointStore to enable triggering an Exception in the shutdown process on execution. This selective requirement could have been made with less changes. Instead, I decided to provide the most generic Testing* implementation offering a proper TestingCompletedCheckpointStore implementation which makes it more future-proof.
We need to extend the TestingCheckpointIDCounter to enable triggering an Exception in the shutdown process on execution. This selective requirement could have been made with less changes. Instead, I decided to provide the most generic Testing* implementation offering a proper TestingCheckpointIDCounter implementation which makes it more future-proof.
…neric createSparseArchivedExecutionGraph We will need to create the ArchivedExecutionGraph skeleton also in the cleanup JobManagerRunner. The renaming is necessary to align with the usage of this method.
…xtraction into utility method The number of retained checkpoints needs to be extracted in the Scheduler and the cleanup process. Therefore, the extraction logic is moved from SchedulerUtils into DefaultCompletedCheckpointStoreUtils. Additionally, DefaultCompletedCheckpointStoreUtils is renamed into CompletedCheckpointStoreUtils since its utility methods are related to the interface, not the implementation.
We need to provide a reverse search to retrieve the JobStatus from the ApplicationStatus that is provided by the JobResult. This change is straight-forward because we can expect jobs being finished that end up in the JobResultStore. Therefore, we only have to provide a transition from ApplicationStatus to JobStatus for values we there's a symmetric mapping possible.
The intention is to unify the TestingDispatcherBuilder logic used in AbstractDispatcherTest and DispatcherResourceCleaner DispatcherResourceCleanupTest
Adds CleanupRunnerFactory and an implementing class. Efforts about adding the job name are out-sourced into FLINK-25632.
…is created properly by the Dispatcher
…CleanupTest Initially, the BlobServer cleanup with implicitly tested in DispatcherResourceCleanupTest. I moved the tests into BlobServerCleanupTest and added additional test cases.
…spatcherResourceCleanupTest
Close this PR in favor of Flink PR #18536 |
XComp
pushed a commit
that referenced
this pull request
Oct 21, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
Integrates
ResourceCleaner
intoDispatcher
Brief change log
GloballyCleanableResource
andLocallyCleanableResource
interfacesJobManagerRunnerRegistry
to implement the*CleanableResource
interfaceJobGraphWriter
,BlobServer
,JobManagerMetricGroup
andHighAvailabilityServices
implement/extend the*CleanableReource
interfacesResourceCleaner
andResourceCleanerFactory
that provides aResourceCleaner
for local and one for global cleanupCheckpointResourcesCleanupRunner
implementingJobManagerRunner
CheckpointResourcesCleanupRunner
along theJobMasterLeadershipRunner
in theDispatcher
FileSystemBlobStore#delete
that didn't return the right value if the resource that shall be deleted doesn't existSee individual commits for change logs
Verifying this change
FileSystemBlobStoreTest
ResourceCleanerTest
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation