Skip to content

Conversation

@liming30
Copy link
Contributor

@liming30 liming30 commented Sep 7, 2021

[FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time.

From FLINK-22483, the CompletedCheckpointStore will not change during task failover, so we only need to rebuild the SharedStateRegistry once, which can reduce the recovery time during failover.

What is the purpose of the change

Move SharedStateRigistry to CompleteCheckpointStore to make the life cycle of them consistent, so that we only need to re-register the shared state once when the CompleteCheckpointStore is restored, which can reduce the time of task failover.

Brief change log

  1. Add SharedStateRegistry to the createRecoveredCompletedCheckpointStore method of CheckpointRecoveryFactory, and let CompletedCheckpointStore manage SharedStateRegistry by itself. CheckpointCoordinator no longer manages SharedStateRegistry.
  2. When the CompletedCheckpointStore is recovering, it will also register the shared state of the recovered CompletedCheckpoint.
  3. CompletedCheckpointStore adds the registerSharedState method, which is used to provide an interface for the CheckpointCoordinator to register the shared state.

Verifying this change

This change added tests and can be verified as follows:

  • Added a test to verify that the shared state reference is correct and can be deleted correctly when the SharedStateRegistry is not rebuilt during failover.

org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTest#testSharedStateRegistrationWithoutRebuildSharedStateRegistry

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@liming30 liming30 changed the title [FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only whe… [FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time. Sep 7, 2021
@flinkbot
Copy link
Collaborator

flinkbot commented Sep 7, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 12f5359 (Tue Sep 07 09:22:58 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 7, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@liming30
Copy link
Contributor Author

Hi, @dawidwys, I'm sorry to update it for so long because I have been busy with other things recently.

…n the restore method is called for the first time.

From FLINK-22483, the CompletedCheckpointStore will not change during task failover, so we only need to rebuild the SharedStateRegistry once, which can reduce the recovery time during failover.
JobID jobId,
int maxNumberOfCheckpointsToRetain,
ClassLoader userClassLoader,
SharedStateRegistry sharedStateRegistry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea. It does not define a clear contract for the SharedStateRegistry. Is it empty? Does it have entries? What should we do about it if it is not empty?

It should be up to the CheckpointRecoveryFactory to tell where does the SharedStateRegistry comes from.

*/
boolean requiresExternalizedCheckpoints();

void registerSharedState(Map<OperatorID, OperatorState> operatorStates);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mixes responsibilities of the two classes/interfaces (CompletedCheckpointStore & SharedStateRegistry). I am not against coupling those two (as they're lifecycles are coupled already), but not in this way.

Maybe it would make sense to add getSharedStateRegistry(). I guess we would need to extract an interface from SharedStateRegistry then.

@dawidwys
Copy link
Contributor

@dmvk Do you have any opinion about the direction we're going for here?

@dmvk
Copy link
Member

dmvk commented Oct 19, 2021

@dawidwys I agree with the sentiment of this PR. It is basically a follow up to https://issues.apache.org/jira/browse/FLINK-22483, that could further reduce the time needed for restoring the state.

I'll try to look at the PR later, but from a quick look, I agree with your comments. We should try to keep the CompletedCheckpointStore and SharedStateRegistry decoupled if possible.

@dmvk
Copy link
Member

dmvk commented Oct 19, 2021

btw, good job @liming30, this is a great improvement

@liming30
Copy link
Contributor Author

liming30 commented Nov 2, 2021

Hi, @dawidwys, I checked the registration of the shared state again. Maybe the shared state in CompletedCheckpointStore only needs to be re-registered in the CheckpointCoordinator construction method. What do you think?

Copy link
Contributor

@dawidwys dawidwys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I must apologize for the delay.

Generally speaking I quite like the structure, that the SharedStateRegistry comes out of CompletedCheckpointStore. I had some inline comments that need to be addressed.

Lastly, I'd really like to hear from @dmvk and/or @rkhachatryan before we merge it.

}

@Test
public void testSharedStateRegistrationWithoutRebuildSharedStateRegistry() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test tests way more than it claims to do. Moreover I don't think it tests the no-rebuilding at all. None of the methods from CheckpointRecoveryFactory are used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dawid, it tests the existing functionality but the one that was changed.

I think CheckpointRecoveryFactory implementations should be unit-tested. And ideally, their use by schedulers too (maybe using TestingCheckpointRecoveryFactory).
WDYT?

JobID jobId,
int maxNumberOfCheckpointsToRetain,
ClassLoader userClassLoader,
SharedStateRegistryFactory sharedStateRegistryFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we need the SharedStateRegistryFactory parameter. I'd say it should be up to the CompletedCheckpointStore/CheckpointRecoveryFactory to decide about the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could also be passed as a constructor parameter to CheckpointRecoveryFactory implementations, but such a factory reduces coupling (as opposed to directly constructing the registry inside CheckpointRecoveryFactory).

this.maxRetainedCheckpoints = maxRetainedCheckpoints;
this.checkpoints.addAll(initialCheckpoints);

for (CompletedCheckpoint completedCheckpoint : this.checkpoints) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks wrong. If the sharedStateRegistry is restored, we will register checkpoints twice. If the SharedStateRegistry was passed from the outside we should assume it's properly populated already.

CheckpointRecoveryFactory withoutCheckpointStoreRecovery(IntFunction<T> storeFn) {
return new PerJobCheckpointRecoveryFactory<>(
(maxCheckpoints, previous) -> {
(maxCheckpoints, previous, sharedStateRegistry) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this extra parameter? You have the sharedStateRegistry inside of the previous already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess previous is optional during the 1st invokation?

Copy link
Contributor

@rkhachatryan rkhachatryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @liming30 .
LGTM in general, I've left some remarks, PTAL.

JobID jobId,
int maxNumberOfCheckpointsToRetain,
ClassLoader userClassLoader,
SharedStateRegistryFactory sharedStateRegistryFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could also be passed as a constructor parameter to CheckpointRecoveryFactory implementations, but such a factory reduces coupling (as opposed to directly constructing the registry inside CheckpointRecoveryFactory).

CheckpointRecoveryFactory withoutCheckpointStoreRecovery(IntFunction<T> storeFn) {
return new PerJobCheckpointRecoveryFactory<>(
(maxCheckpoints, previous) -> {
(maxCheckpoints, previous, sharedStateRegistry) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess previous is optional during the 1st invokation?

}

@Test
public void testSharedStateRegistrationWithoutRebuildSharedStateRegistry() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Dawid, it tests the existing functionality but the one that was changed.

I think CheckpointRecoveryFactory implementations should be unit-tested. And ideally, their use by schedulers too (maybe using TestingCheckpointRecoveryFactory).
WDYT?

// We create a new shared state registry object, so that all pending async disposal
// requests from previous runs will go against the old object (were they can do no
// harm). This must happen under the checkpoint lock.
sharedStateRegistry.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call now missing and the old registry isn't closed?

@rkhachatryan
Copy link
Contributor

Hi @liming30 , are you planning to merge this PR in 1.15?
I'd like to merge #17774 in 1.15; and it depends on this PR (mostly for moving SharedStateRegistry closer to CompletedCheckpointStore).

If you aim at a different release then I could implement those changes in my PR (or take over this PR if you don't mind).

@dmvk
Copy link
Member

dmvk commented Nov 25, 2021

We're also preparing some follow-up PRs that build on top of this one. It would be great if we could finish it soon ;)

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021
pushing SharedStateRegistry creation down the stack and passing checkpoints to it
rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021
by pushing SharedStateRegistry creation down the stack and passing checkpoints to it
rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021
by closing SharedStateRegistry on CompletedCheckpointStore.shutdown
@rkhachatryan
Copy link
Contributor

I've opened #18001 to address the feedback above, rebase and merge it.
Please feel free to review.

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021
by pushing SharedStateRegistry creation down the stack and passing checkpoints to it
rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021
by closing SharedStateRegistry on CompletedCheckpointStore.shutdown
rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 3, 2021
by pushing SharedStateRegistry creation down the stack and passing checkpoints to it
rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 3, 2021
by closing SharedStateRegistry on CompletedCheckpointStore.shutdown
@pnowojski
Copy link
Contributor

superseded by #18001

@pnowojski pnowojski closed this Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants