Skip to content

Conversation

XComp
Copy link
Owner

@XComp XComp commented Dec 23, 2021

What is the purpose of the change

Integrates ResourceCleaner into Dispatcher

Brief change log

  • Introduces GloballyCleanableResource and LocallyCleanableResource interfaces
  • Introduces JobManagerRunnerRegistry to implement the *CleanableResource interface
  • Make JobGraphWriter, BlobServer, JobManagerMetricGroup and HighAvailabilityServices implement/extend the *CleanableReource interfaces
  • Introduces ResourceCleaner and ResourceCleanerFactory that provides a ResourceCleaner for local and one for global cleanup
  • Introduces CheckpointResourcesCleanupRunner implementing JobManagerRunner
  • Integrates CheckpointResourcesCleanupRunner along the JobMasterLeadershipRunner in the Dispatcher
  • Fixes inconsistent implementation of FileSystemBlobStore#delete that didn't return the right value if the resource that shall be deleted doesn't exist

See individual commits for change logs

Verifying this change

  • Adds FileSystemBlobStoreTest
  • Adds ResourceCleanerTest
  • TODO: [...]

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? TODO (not applicable / docs / JavaDocs / not documented)

@XComp XComp changed the title [FLINK-25432] [FLINK-25432] Introduces ResourceCleaner that gets triggered for finished jobs that are not cleaned up, yet, according to the JobResultStore Dec 23, 2021
@XComp XComp force-pushed the FLINK-25432 branch 2 times, most recently from df7c6be to 1c5508a Compare January 3, 2022 17:40
@XComp XComp force-pushed the FLINK-25432 branch 2 times, most recently from 71c4ff4 to 1ee0e26 Compare January 4, 2022 10:23
@XComp XComp force-pushed the FLINK-25432 branch 4 times, most recently from e307bea to 1712b69 Compare January 7, 2022 08:51
@XComp XComp force-pushed the FLINK-25430 branch 2 times, most recently from c591b43 to c98cf4a Compare January 11, 2022 15:41
@XComp XComp force-pushed the FLINK-25432 branch 5 times, most recently from 7a2dce0 to 95ea415 Compare January 13, 2022 19:01
@XComp XComp force-pushed the FLINK-25432 branch 2 times, most recently from 003c378 to d9dd04d Compare January 17, 2022 08:57
@XComp XComp force-pushed the FLINK-25432 branch 2 times, most recently from 133e036 to 4827de5 Compare January 21, 2022 07:05
@XComp XComp force-pushed the FLINK-25430 branch 2 times, most recently from 4787b66 to 34089ff Compare January 21, 2022 16:50
XComp added 5 commits January 27, 2022 17:50
Moves JobManagerRunnerRegistry instantiation into separate
constructor. The purpose of this change is to improve the
testability of the code.
XComp added 11 commits January 28, 2022 17:47
…egrates it into Dispatcher

Additionally, testHABlobsAreNotRemovedIfHAJobGraphRemovalFails is deleted
because this case doesn't apply anymore. The job graph deletion and the
HA file deletion happen independently from each other now: The JobResult
is already added to the JobResultStore as a dirty entry and all data can
be deleted.

There is a dependency between cleaning up the JobManagerRegistry (i.e.
closing the JobMaster) and cleaning up the HighAvailabilityServices for
the job. This issue is solved by a JobManager-wide leader election
(FLINK-24038). But we have to consider closing the JobMaster before
cleaning the HA job data until the legacy leader election functionality
is removed entirely.
…eric

We need to extend the TestingCompletedCheckpointStore to enable
triggering an Exception in the shutdown process on execution.
This selective requirement could have been made with less changes.
Instead, I decided to provide the most generic Testing*
implementation offering a proper TestingCompletedCheckpointStore
implementation which makes it more future-proof.
We need to extend the TestingCheckpointIDCounter to enable
triggering an Exception in the shutdown process on execution.
This selective requirement could have been made with less changes.
Instead, I decided to provide the most generic Testing*
implementation offering a proper TestingCheckpointIDCounter
implementation which makes it more future-proof.
…neric createSparseArchivedExecutionGraph

We will need to create the ArchivedExecutionGraph skeleton also
in the cleanup JobManagerRunner. The renaming is necessary to align
with the usage of this method.
…xtraction into utility method

The number of retained checkpoints needs to be extracted
in the Scheduler and the cleanup process. Therefore, the
extraction logic is moved from SchedulerUtils into
DefaultCompletedCheckpointStoreUtils.

Additionally, DefaultCompletedCheckpointStoreUtils is renamed
into CompletedCheckpointStoreUtils since its utility
methods are related to the interface, not the implementation.
We need to provide a reverse search to retrieve the JobStatus
from the ApplicationStatus that is provided by the JobResult.

This change is straight-forward because we can expect jobs
being finished that end up in the JobResultStore. Therefore,
we only have to provide a transition from ApplicationStatus
to JobStatus for values we there's a symmetric mapping
possible.
The intention is to unify the TestingDispatcherBuilder
logic used in AbstractDispatcherTest and
DispatcherResourceCleaner DispatcherResourceCleanupTest
Adds CleanupRunnerFactory and an implementing class.
Efforts about adding the job name are out-sourced into FLINK-25632.
XComp added 5 commits January 31, 2022 09:09
…CleanupTest

Initially, the BlobServer cleanup with implicitly tested in
DispatcherResourceCleanupTest. I moved the tests into
BlobServerCleanupTest and added additional test cases.
@XComp
Copy link
Owner Author

XComp commented Feb 18, 2022

Close this PR in favor of Flink PR #18536

@XComp XComp closed this Feb 18, 2022
XComp pushed a commit that referenced this pull request Oct 21, 2022
Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>
XComp added a commit that referenced this pull request Oct 21, 2022
Updated commit message
* foo #4

Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>

* foo #5

Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>

Signed-off-by: Matthias Pohl <matthias.pohl@aiven.io>
Co-authored-by: Sergey <snuyanzin@gmail.com>
Co-authored-by: Ryan Skraba <ryan.skraba@aiven.io>
XComp added a commit that referenced this pull request Mar 22, 2024
XComp added a commit that referenced this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants