[FLINK-11813] Introduces cleanup of dirty jobs in Dispatcher #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

XComp wants to merge 25 commits into FLINK-11813-move-jrs-init from FLINK-11813-cleanup

Owner

XComp commented Dec 2, 2021

What is the purpose of the change

This PR introduces the cleanup of the dirty job results in the Dispatcher. For now, only the ioExecutor is used to trigger the cleanup once.

Brief change log

See the individual commits messages for further details on each change.

Verifying this change

TODO: No extensive tests added, yet.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? TODO (not applicable / docs / JavaDocs / not documented)

XComp added 22 commits

December 1, 2021 22:16


          [hotfix] Adds missing JavaDoc

ed29419


          [hotfix] Makes intention of comment clearer

f3b5ddf


          [FLINK-11813] Renames JobGraphStoreFactory into JobPersistenceCompone…

aaef424

…ntFactory


          [FLINK-11813] Introduces new method HaServicesJobPersistenceComponent…

b13d14f

…Factory.createJobResultStore


          [FLINK-11813] Introduces TestingJobResultStore

9480c58


          [FLINK-11813] Make JobResultStore available to SessionDispatcherLeade…

0114fdd

…rProcess


          [FLINK-11813] Adds mapping from ApplicationStatus to JobStatus

2e56822


          [FLINK-11813] Integrated JobResult handling into SessionDispatcherLea…

23d4a86

…derProcess


          [FLINK-11813] Integrated JobResultStore into DispatcherGatewayService…

b9e84aa

…Factory


          [FLINK-11813] Integrated JobResultStore into DispatcherFactory

8822c2f


          [FLINK-11813] Renames PartialDispatcherServicesWithJobGraphStore into…

12c227a

… PartialDispatcherServicesWithJobPersistenceComponents


          [hotfix] Removes @nonnull annotation from PartialDispatcherServicesWi…

95f2c92

…thJobPersistenceComponents


          [hotfix] Removes @nonnull annotations from DispatcherServices

a9ae6f4

Instead, null checks are added


          [FLINK-11813] Adds JobResultStore to DispatcherServices

7d03655


          [FLINK-11813] Adds globally-terminated jobs to Dispatcher interface

e2ed7ce


          [FLINK-11813] Moves CheckpointsCleaner initialization up into Default…

b3ac425

…SlotPoolServicesSchedulerFactory


          [FLINK-11813] Moves CheckpointsCleaner initialization up into JobMaster

66f7527


          [FLINK-11813] Moves CheckpointsCleaner initialization into JobManager…

f758260

…SharedServices


          [FLINK-11813] Moves cleanup logic into dedicated methods

d17f7f0


          [FLINK-11813] Fixes wrong condition for cleanup marking

f5e5bfa


          [FLINK-11813] Adds async cleanup on ioExecutor

0ff1f3a


          [FLINK-11813] Removes unused classloader parameter from CheckpointRec…

97f392e

…overyFactory.createRecoveredCompletedCheckpointStore

XComp mentioned this pull request

[FLINK-11813] Integrates cleanup autophagy/flink#2

Closed

XComp force-pushed the FLINK-11813-cleanup branch from 8ccb2b0 to 5ac5028 Compare

December 2, 2021 13:54


          [FLINK-11813] Moves CheckpointRecoveryFactory initialization based on…

4dee11a

… JobManager configuration into the interface

XComp force-pushed the FLINK-11813-cleanup branch from 5ac5028 to 30ac5c1 Compare

December 2, 2021 14:03

XComp added 2 commits

December 3, 2021 09:02


          [FLINK-11813] Introduces classes for checkpoint-related cleanup

91385e4


          [FLINK-11813] Integrates checkpoint-related cleanup

d8b8782

XComp force-pushed the FLINK-11813-cleanup branch from 30ac5c1 to d8b8782 Compare

December 3, 2021 08:03

XComp commented

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

    
                      return false;

                  }

                  private void cleanupDirtyJobs() {

Owner Author

XComp Dec 3, 2021

It might be worth moving all the Dispatcher's cleanup logic into its own class, e.g. DispatcherCleanup.

Owner Author

XComp commented Dec 3, 2021

I realized that it's not smart to use the same CheckpointsCleaner instance everywhere. It implements AutoClosable which we should utilize. Hence, providing a CheckpointsCleanerFactory in JobManagerSharedServices would be the better approach.

dmvk reviewed

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

    
                                      }));

                      cleanupTaskResults.add(

                              CompletableFuture.supplyAsync(

dmvk Dec 6, 2021

We shoud never use async future operations without explicitly providing an executor. This will take the one from a common pool, which could result in weird memory leaks (eg. with thread locals) and other unexpected behaviors.

Owner Author

XComp Dec 8, 2021

Good point, that should operate utilizing the ioExecutor

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

    
                      return blobServer.cleanupJob(jobId, jobGraphRemoved);

                  }

                  private boolean cleanupHighAvailabilityServices(JobID jobId) {

dmvk Dec 6, 2021

All of these cleanup methods looks alike. Would it be possible to extract it behind a common interfaces and simplify this? (eg. having something along the lines of List<CleanupStage> ...)

Owner Author

XComp Dec 8, 2021

Having an interface was my initial approach. I backed off from it because I realized that FutureUtils's retry mechanism only relies on passing in some callback. The interface didn't bring any value - so I left it out for now.

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

    
                      try {

                          jobResultStore.markResultAsClean(jobId);

                      } catch (IOException e) {

                          log.warn("Could not properly mark job {} result as clean.", jobId, e);

dmvk Dec 6, 2021

What are the consequences of ignoring the failure here?

Owner Author

XComp Dec 8, 2021

In case of failure, the job should be picked up again for cleanup.

flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java

    
                      }

                      startRecoveredJobs();

                      cleanupDirtyJobs();

dmvk Dec 6, 2021

I'm not sure about having a separate code path for already terminated jobs as the dispatcher is already fairly complex.

What you think about unifying these and having a custom JobManagerRunner implementation that only performs checkpoint cleanup?

Owner Author

XComp Dec 8, 2021

I guess, that's a good point. It should be doable - it just didn't cross my mind. The JobManagerRunner could be in charge of handling the job's lifecycle entirely, i.e. also including the cleanup. I will work on it

Owner Author

XComp Dec 8, 2021 •

edited

Loading

I discussed the JobManagerRunner approach with Chesnay: We addressed the issue from a user's point of view: We somehow want to visualize the running jobs and the jobs that are completed but still in cleanup phase. Having everything in Dispatcher.runningJobs makes sense if we want to the cleanup phase still being part of the general job lifecycle. Dirty jobs that need to be cleaned up would be listed as not-finally-completed jobs until the cleanup is successful. The JobManagerRunner could be responsible for maintaining this lifecycle (even after the JobMaster finished). That would mean that we would move the cleanup logic into the JobManagerRunner and have it being used by both, the JobMasterServiceLeadershipRunner and the new JobManagerRunner implementation that is in charge of cleaning up the checkpoint-related artefacts.

The flaw of this approach is, that running jobs are meant to have a ExecutionGraphInfo which is provided by the JobMaster. For the cleanup, we don't have this information. It is accessible in the ExecutionGraphInfoStore in the dispatcher, though. We could use that store in the new implementation. But the ExecutionGraphInfoStore is only persisted locally. Hence, the data would be gone in case of failover.

One workaround for that would be moving the ExecutionGraphInfo into the JobResultStore as additional metadata. Essentially, we would merge ExecutionGraphInfoStore and the JobResultStore into a single store. The issue with that approach is, that the ExecutionGraphInfoStore has no requirements to be backwards-compatible right now. Persisting the ExecutionGraph in the JobResultStore would change that which is not what we want, I guess. Having a JobResultStore that stores only a limited amount of metadata per job makes it easier to maintain backwards-compatibility.

Another approach is treating the two phases separately: Dispatcher.runningJobs is actually only listing jobs with a JobMaster providing the ExecutionGraph. The dirty jobs should be handled separately through a separate member (and accessed through a dedicated REST API endpoint as well) like it's currently implemented in this prototyping PR.

We can move the cleanup logic into its own class to remove complexity from the Dispatcher.

Owner Author

XComp Dec 8, 2021

How is the user then informed about problems? There's going to be a dedicated REST endpoint (and new section in the Flink UI) listing the jobs from the JobResultStore. Any dirty jobs can be labeled as "cleanup pending". The Dispatcher tries to clean things up. The user is informed about issues through the logs. We could also think about adding some kind of retry counter that is persisted in the JobResultStore. The Dispatcher tries to clean things up infinitely.

Owner Author

XComp Dec 8, 2021 •

edited

Loading

How does the user handle failures? The cluster shutdown can be either blocking until all jobs are cleaned up? Or fail with a non-0 exit code in case of jobs not being able to be cleaned up. That would trigger a restart in case of HA, which would pick up the dirty jobs again for cleanup. Here, it might make sense to set some limit on the retries after a shutdown is triggered.

Owner Author

XComp Dec 8, 2021 •

edited

Loading

Thinking about it once more, we could actually follow your approach by extracting a new interface out of the JobManagerRunner interface which does only contain getResultFuture(), start() and getJobID(). The JobManagerRunner could extend this interface. A new implementation of the extracted interface could be used to implement the checkpoint-related cleanup. After that is done, the common cleanup other components can be triggered. We could add a method cancelCleanup() to the new interface to enable explicit cancelling of the cleanup phase. Chesnay voted against reusing JobManagerRunner.cancel() because of different semantics (cancelling the job resulting in a cancelled job vs. cancelling the job cleanup resulting in a still globally-terminated job).

That would enable us to cancel the cleanup of a single job without shutting down the cluster (through a new REST endpoint). We still have to decide, how a running job behaves for which the cancellation of the cleanup is called.

dmvk reviewed

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerUtils.java

    
                  }

                  @VisibleForTesting

                  static CompletedCheckpointStore createCompletedCheckpointStore(

dmvk Dec 6, 2021

What is the intuition behind moving this into CheckpointRecoveryFactory?

Owner Author

XComp Dec 8, 2021

The intuition was that I wanted to reuse the CheckpointsCleaner instance in the cleanup of the CompletedCheckpointsStore. But that's not possible as is because the CheckpointsCleaner is closed on the JobMaster. Instead, we should use a CheckpointsCleanerFactory, instead, that is passed into the different components that need a CheckpointsCleaner instance.

XComp force-pushed the FLINK-11813-move-jrs-init branch 2 times, most recently from e6037a0 to 59ac153 Compare

December 9, 2021 17:07

Owner Author

XComp commented Dec 10, 2021

Closing this draft in favor of PR Draft #3 which implements the JobManagerRunner interface

XComp closed this

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Updates log messsages and error messages

91d7fba

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Adds @internal annotation

a3ca01e

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Makes Dispatcher.recoveredJobs consider the …

e6eb3d5

…dirty JobResults

We don't want to retrigger jobs that finished already based on the JobResultStore.

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Updates log messsages and error messages

d2668dc

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Adds @internal annotation

f54d38d

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Makes Dispatcher.recoveredJobs consider the …

7f326df

…dirty JobResults

We don't want to retrigger jobs that finished already based on the JobResultStore.

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Extends test cases for SessionDispatcherLead…

374d27e

…erProcess

This change covers now all cases for recovery.

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Aligns input for MiniDispatcher for job cluster

efc9e36

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Aligns input for MiniDispatcher for job cluster

cee6e8b

XComp added a commit that referenced this pull request


          [FLINK-25430][review #2] Renames test name

201b965

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Introduce BlobServer.runAsyncWithWriteLock t…

cfcc4b2

…o be used by both the local and global cleanup

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Refactor Dispatcher error handling code

e098f8f

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Makes DefaultJobManagerRunnerRegistry implem…

08979fb

…ent global cleanup again

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Makes Dispatcher global cleanup not dependin…

42520c0

…g on local cleanup anymore

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Removes obsolete mainthread asserts

fceb1ca

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Adds missing JavaDoc

bb61d5b

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Adds DefaultJobGraphStore.runAsyncWithLockAs…

97ea6d3

…sertRunning

XComp added a commit that referenced this pull request


          [FLINK-25432][review #2] Refactors DefaultResourceCleaner

91ccec7

XComp added a commit that referenced this pull request


          commit #2

ed51da2

XComp pushed a commit that referenced this pull request


          [hotfix][docs] Fix git permission issue attempt #2

24c1954

XComp added a commit that referenced this pull request


          foo #2

6504d56

XComp added a commit that referenced this pull request


          fix #2

XComp added a commit that referenced this pull request


          foo #2

d1f9b2a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet