Skip to content

Conversation

XComp
Copy link
Owner

@XComp XComp commented Dec 9, 2021

What is the purpose of the change

Two goals are covered by this PR:

  • Consolidation of the cleanup phase (moving cleanup logic into GlobalCleanupStage) and providing a localCleanup method in the Dispatcher with now retry mechanism involved, yet!
  • Implements a basic JobManagerRunner class that only instantiates the checkpoint-related components and implements close logic for them. No retry mechanism is applied, yet.

Follow-ups:

  • RetryStategy needs to be integrated
  • The two classes implementing JobManagerRunner (i.e. JobMasterServiceLeadershipRunner and CheckpointJobDataCleanupRunner) have to be consolidated. General logic (the leader election) should be moved into an abstract class
  • We might want to integrate the ExecutionGraphInfoStore in the CheckpointJobDataCleanupRunner as a fallback.

Brief change log

See individual commits and their messages for descriptions.

Verifying this change

[TODO] This PR contains only a prototype. Additional efforts (like implementing tests, adding documentation) need to be applied.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@XComp XComp force-pushed the FLINK-11813-cleanup-v2 branch 2 times, most recently from c91336e to 0dbdeff Compare December 13, 2021 11:43
private boolean delete(final File f) throws IOException {
if (!f.exists()) {
return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a failing test case for this? Ideally one that covers all filesystems (I'm not sure whether we have such abstract test in place)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I still have to add that 👍

} catch (IOException e) {
log.warn("Could not properly mark job {} result as clean.", jobId, e);
}
private void cleanUpJobResult(JobID jobId) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitJobResult?


private void initializeCompletedCheckpointStore() {
try {
this.completedCheckpointStore =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup executor uses a fixed thread pool [1], so we're not guaranteed to always have a same thread here.

[1] https://github.com/apache/flink/blob/release-1.14.0/flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java#L314

public CompletableFuture<Acknowledge> cancel(Time timeout) {
Preconditions.checkState(
resultFuture != null, "The CheckpointJobDataCleanupRunner was not started, yet.");
if (resultFuture.cancel(true)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use-case for cancelling here? This will fail the future, but wouldn't the cleanup still keep executing?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I have to look into it more closely.

return new ExecutionGraphInfo(
ArchivedExecutionGraph.createSparseArchivedExecutionGraph(
jobResult.getJobId(),
"we might want to move the job name into the job result",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need this for the web ui?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that was my understanding. we need to provide some dummy response for now

@Override
public void localCleanup(JobID jobId) throws IOException {
checkNotNull(jobId);
deleteLocalData(jobId);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also obtain a RW lock here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, I will verify it.

deleteLocalData(jobId);
}

private void deleteLocalData(JobID jobId) throws IOException {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need an extra method? can the globalCleanup simply call the localCleanup instead?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that. I just thought that it makes the reading of the code easier...


@Override
public void releaseJobGraph(JobID jobId) {}
public void localCleanup(JobID jobId) {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems unnecessary

try {
return createCompletedCheckpointStore(
configuration, userCodeLoader, checkpointRecoveryFactory, log, jobId);
return checkpointRecoveryFactory.createRecoveredCompletedCheckpointStore(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of moving this into the CheckpointRecoveryFactory interface. What is the intuition / benefit behind it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to instantiate the CompletedCheckpointStore in different locations (i.e. in SchedulerBase and in CheckpointJobDataCleanupRunner). I didn't want to duplicate the logic for accessing the configuration to get the number of retained checkpoints.

But instead, we could also move the static method from SchedulerUtils into some other utils class

*
* @param jobId specifying the job to release the locks for
* @throws Exception if the locks cannot be released
*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intentionally keep the old javadoc?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought it would be more specific to the actual class.

XComp added 21 commits December 15, 2021 16:42
…CleanableResource and GloballyCleanableResource
…overyFactory.createRecoveredCompletedCheckpointStore
…b into more generic ArchivedExecutionGraph.createSparseArchivedExecutionGraph
… JobManager configuration into the interface
@XComp XComp force-pushed the FLINK-11813-move-jrs-init branch from 59ac153 to ba31513 Compare December 15, 2021 15:45
…lobStore interface

Previously, FileSystemBlobStore's delete methods didn't return true if the
corresponding file didn't exist and, therefore, was not deleted. BlobStore.delete and
BlobStore.deleteAll have a different contract which states that, in that case, the
method should return true.

The internal FileSystemBlobStore.delete method got updated accordingly. Additionally, a
test class covering the FileSystemBlobStore was added
@XComp XComp force-pushed the FLINK-11813-cleanup-v2 branch from 05155b9 to c513b0d Compare December 21, 2021 16:57
@XComp
Copy link
Owner Author

XComp commented Jan 28, 2022

Closed in favor of #4

@XComp XComp closed this Jan 28, 2022
XComp added a commit that referenced this pull request Oct 21, 2022
XComp added a commit that referenced this pull request Mar 22, 2024
XComp added a commit that referenced this pull request Apr 2, 2024
XComp added a commit that referenced this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants