[FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time. #17179

liming30 · 2021-09-07T09:18:42Z

[FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time.

From FLINK-22483, the CompletedCheckpointStore will not change during task failover, so we only need to rebuild the SharedStateRegistry once, which can reduce the recovery time during failover.

What is the purpose of the change

Move SharedStateRigistry to CompleteCheckpointStore to make the life cycle of them consistent, so that we only need to re-register the shared state once when the CompleteCheckpointStore is restored, which can reduce the time of task failover.

Brief change log

Add SharedStateRegistry to the createRecoveredCompletedCheckpointStore method of CheckpointRecoveryFactory, and let CompletedCheckpointStore manage SharedStateRegistry by itself. CheckpointCoordinator no longer manages SharedStateRegistry.
When the CompletedCheckpointStore is recovering, it will also register the shared state of the recovered CompletedCheckpoint.
CompletedCheckpointStore adds the registerSharedState method, which is used to provide an interface for the CheckpointCoordinator to register the shared state.

Verifying this change

This change added tests and can be verified as follows:

Added a test to verify that the shared state reference is correct and can be deleted correctly when the SharedStateRegistry is not rebuilt during failover.

org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTest#testSharedStateRegistrationWithoutRebuildSharedStateRegistry

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2021-09-07T09:22:58Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 12f5359 (Tue Sep 07 09:22:58 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-09-07T10:00:29Z

CI report:

de26dc2 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

liming30 · 2021-10-17T13:56:33Z

Hi, @dawidwys, I'm sorry to update it for so long because I have been busy with other things recently.

…n the restore method is called for the first time. From FLINK-22483, the CompletedCheckpointStore will not change during task failover, so we only need to rebuild the SharedStateRegistry once, which can reduce the recovery time during failover.

dawidwys · 2021-10-19T09:36:47Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRecoveryFactory.java

+            JobID jobId,
+            int maxNumberOfCheckpointsToRetain,
+            ClassLoader userClassLoader,
+            SharedStateRegistry sharedStateRegistry)


I don't think it is a good idea. It does not define a clear contract for the SharedStateRegistry. Is it empty? Does it have entries? What should we do about it if it is not empty?

It should be up to the CheckpointRecoveryFactory to tell where does the SharedStateRegistry comes from.

dawidwys · 2021-10-19T09:40:01Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpointStore.java

     */
    boolean requiresExternalizedCheckpoints();

+    void registerSharedState(Map<OperatorID, OperatorState> operatorStates);


This mixes responsibilities of the two classes/interfaces (CompletedCheckpointStore & SharedStateRegistry). I am not against coupling those two (as they're lifecycles are coupled already), but not in this way.

Maybe it would make sense to add getSharedStateRegistry(). I guess we would need to extract an interface from SharedStateRegistry then.

dawidwys · 2021-10-19T09:48:10Z

@dmvk Do you have any opinion about the direction we're going for here?

dmvk · 2021-10-19T10:52:58Z

@dawidwys I agree with the sentiment of this PR. It is basically a follow up to https://issues.apache.org/jira/browse/FLINK-22483, that could further reduce the time needed for restoring the state.

I'll try to look at the PR later, but from a quick look, I agree with your comments. We should try to keep the CompletedCheckpointStore and SharedStateRegistry decoupled if possible.

dmvk · 2021-10-19T10:53:59Z

btw, good job @liming30, this is a great improvement

liming30 · 2021-11-02T05:35:26Z

Hi, @dawidwys, I checked the registration of the shared state again. Maybe the shared state in CompletedCheckpointStore only needs to be re-registered in the CheckpointCoordinator construction method. What do you think?

dawidwys

Now I must apologize for the delay.

Generally speaking I quite like the structure, that the SharedStateRegistry comes out of CompletedCheckpointStore. I had some inline comments that need to be addressed.

Lastly, I'd really like to hear from @dmvk and/or @rkhachatryan before we merge it.

dawidwys · 2021-11-12T14:46:42Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

    }

+    @Test
+    public void testSharedStateRegistrationWithoutRebuildSharedStateRegistry() throws Exception {


This test tests way more than it claims to do. Moreover I don't think it tests the no-rebuilding at all. None of the methods from CheckpointRecoveryFactory are used here.

I agree with Dawid, it tests the existing functionality but the one that was changed.

I think CheckpointRecoveryFactory implementations should be unit-tested. And ideally, their use by schedulers too (maybe using TestingCheckpointRecoveryFactory).
WDYT?

dawidwys · 2021-11-12T14:47:49Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRecoveryFactory.java

+            JobID jobId,
+            int maxNumberOfCheckpointsToRetain,
+            ClassLoader userClassLoader,
+            SharedStateRegistryFactory sharedStateRegistryFactory,


I am wondering if we need the SharedStateRegistryFactory parameter. I'd say it should be up to the CompletedCheckpointStore/CheckpointRecoveryFactory to decide about the implementation.

I think it could also be passed as a constructor parameter to CheckpointRecoveryFactory implementations, but such a factory reduces coupling (as opposed to directly constructing the registry inside CheckpointRecoveryFactory).

dawidwys · 2021-11-12T14:51:23Z

...time/src/main/java/org/apache/flink/runtime/checkpoint/EmbeddedCompletedCheckpointStore.java

        this.maxRetainedCheckpoints = maxRetainedCheckpoints;
        this.checkpoints.addAll(initialCheckpoints);
+
+        for (CompletedCheckpoint completedCheckpoint : this.checkpoints) {


That looks wrong. If the sharedStateRegistry is restored, we will register checkpoints twice. If the SharedStateRegistry was passed from the outside we should assume it's properly populated already.

dawidwys · 2021-11-12T14:52:37Z

...ntime/src/main/java/org/apache/flink/runtime/checkpoint/PerJobCheckpointRecoveryFactory.java

            CheckpointRecoveryFactory withoutCheckpointStoreRecovery(IntFunction<T> storeFn) {
        return new PerJobCheckpointRecoveryFactory<>(
-                (maxCheckpoints, previous) -> {
+                (maxCheckpoints, previous, sharedStateRegistry) -> {


Why do you need this extra parameter? You have the sharedStateRegistry inside of the previous already.

I guess previous is optional during the 1st invokation?

rkhachatryan

Thanks for the PR @liming30 .
LGTM in general, I've left some remarks, PTAL.

rkhachatryan · 2021-11-16T19:53:19Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRecoveryFactory.java

+            JobID jobId,
+            int maxNumberOfCheckpointsToRetain,
+            ClassLoader userClassLoader,
+            SharedStateRegistryFactory sharedStateRegistryFactory,


I think it could also be passed as a constructor parameter to CheckpointRecoveryFactory implementations, but such a factory reduces coupling (as opposed to directly constructing the registry inside CheckpointRecoveryFactory).

rkhachatryan · 2021-11-16T20:01:44Z

...ntime/src/main/java/org/apache/flink/runtime/checkpoint/PerJobCheckpointRecoveryFactory.java

            CheckpointRecoveryFactory withoutCheckpointStoreRecovery(IntFunction<T> storeFn) {
        return new PerJobCheckpointRecoveryFactory<>(
-                (maxCheckpoints, previous) -> {
+                (maxCheckpoints, previous, sharedStateRegistry) -> {


I guess previous is optional during the 1st invokation?

rkhachatryan · 2021-11-16T20:35:43Z

flink-runtime/src/test/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinatorTest.java

    }

+    @Test
+    public void testSharedStateRegistrationWithoutRebuildSharedStateRegistry() throws Exception {


I agree with Dawid, it tests the existing functionality but the one that was changed.

I think CheckpointRecoveryFactory implementations should be unit-tested. And ideally, their use by schedulers too (maybe using TestingCheckpointRecoveryFactory).
WDYT?

rkhachatryan · 2021-11-16T20:41:55Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

-            // We create a new shared state registry object, so that all pending async disposal
-            // requests from previous runs will go against the old object (were they can do no
-            // harm). This must happen under the checkpoint lock.
-            sharedStateRegistry.close();


Is this call now missing and the old registry isn't closed?

rkhachatryan · 2021-11-25T09:25:37Z

Hi @liming30 , are you planning to merge this PR in 1.15?
I'd like to merge #17774 in 1.15; and it depends on this PR (mostly for moving SharedStateRegistry closer to CompletedCheckpointStore).

If you aim at a different release then I could implement those changes in my PR (or take over this PR if you don't mind).

dmvk · 2021-11-25T10:02:27Z

We're also preparing some follow-up PRs that build on top of this one. It would be great if we could finish it soon ;)

pushing SharedStateRegistry creation down the stack and passing checkpoints to it

by pushing SharedStateRegistry creation down the stack and passing checkpoints to it

by closing SharedStateRegistry on CompletedCheckpointStore.shutdown

rkhachatryan · 2021-12-02T22:22:58Z

I've opened #18001 to address the feedback above, rebase and merge it.
Please feel free to review.

by pushing SharedStateRegistry creation down the stack and passing checkpoints to it

by closing SharedStateRegistry on CompletedCheckpointStore.shutdown

by pushing SharedStateRegistry creation down the stack and passing checkpoints to it

by closing SharedStateRegistry on CompletedCheckpointStore.shutdown

pnowojski · 2021-12-10T12:18:54Z

superseded by #18001

liming30 changed the title ~~[FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only whe…~~ [FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time. Sep 7, 2021

rmetzger added review=description? component=Runtime/Coordination component=Runtime/Checkpointing labels Sep 7, 2021

liming30 force-pushed the FLINK-24086 branch from 12f5359 to 203711c Compare September 9, 2021 07:15

dawidwys self-assigned this Sep 9, 2021

liming30 force-pushed the FLINK-24086 branch from 203711c to 32c72cb Compare October 17, 2021 13:39

liming30 force-pushed the FLINK-24086 branch from 32c72cb to 04ae4ec Compare October 18, 2021 13:43

dawidwys requested changes Oct 19, 2021

View reviewed changes

fix comments

de26dc2

rkhachatryan mentioned this pull request Nov 12, 2021

[FLINK-24611] Prevent JM from discarding state on checkpoint abortion #17774

Merged

dawidwys requested changes Nov 12, 2021

View reviewed changes

rkhachatryan reviewed Nov 16, 2021

View reviewed changes

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021

address apache#17179 (comment) by

927c87e

pushing SharedStateRegistry creation down the stack and passing checkpoints to it

rkhachatryan mentioned this pull request Dec 2, 2021

[FLINK-24086][runtime] Rebuilding the SharedStateRegistry only when the restore method is called for the first time #18001

Merged

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021

address apache#17179 (comment)

fd0be1b

by pushing SharedStateRegistry creation down the stack and passing checkpoints to it

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021

address apache#17179 (comment)

102b5c3

by closing SharedStateRegistry on CompletedCheckpointStore.shutdown

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021

address apache#17179 (comment)

ba11841

by pushing SharedStateRegistry creation down the stack and passing checkpoints to it

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 2, 2021

address apache#17179 (comment)

7fc4a42

by closing SharedStateRegistry on CompletedCheckpointStore.shutdown

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 3, 2021

address apache#17179 (comment)

aadbc31

by pushing SharedStateRegistry creation down the stack and passing checkpoints to it

rkhachatryan added a commit to rkhachatryan/flink that referenced this pull request Dec 3, 2021

address apache#17179 (comment)

6ec26d9

by closing SharedStateRegistry on CompletedCheckpointStore.shutdown

pnowojski closed this Dec 10, 2021

[FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time. #17179

[FLINK-24086][checkpoint] Rebuilding the SharedStateRegistry only when the restore method is called for the first time. #17179

Uh oh!

Conversation

liming30 commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Sep 7, 2021

Automated Checks

Review Progress

Uh oh!

flinkbot commented Sep 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

liming30 commented Oct 17, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dawidwys commented Oct 19, 2021

Uh oh!

dmvk commented Oct 19, 2021

Uh oh!

dmvk commented Oct 19, 2021

Uh oh!

liming30 commented Nov 2, 2021

Uh oh!

dawidwys left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkhachatryan commented Nov 25, 2021

Uh oh!

dmvk commented Nov 25, 2021

Uh oh!

rkhachatryan commented Dec 2, 2021

Uh oh!

pnowojski commented Dec 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

liming30 commented Sep 7, 2021 •

edited

Loading

flinkbot commented Sep 7, 2021 •

edited

Loading