Skip to content

[FLINK-7268] [checkpoints] Scope SharedStateRegistry objects per (re)start#4410

Merged
asfgit merged 3 commits intoapache:masterfrom
StefanRRichter:ImprovedIncreemntalCPDebug
Aug 15, 2017
Merged

[FLINK-7268] [checkpoints] Scope SharedStateRegistry objects per (re)start#4410
asfgit merged 3 commits intoapache:masterfrom
StefanRRichter:ImprovedIncreemntalCPDebug

Conversation

@StefanRRichter
Copy link
Copy Markdown
Contributor

@StefanRRichter StefanRRichter commented Jul 27, 2017

What is the purpose of the change

This PR fixes FLINK-7268. The problem was that ZookeeperCompletedCheckpointStore deletes checkpoints asynchronously. When this happens parallel to a restart, it could happen that the async delete performed shared state de-registration.

Before this PR, the old SharedStateRegistry was kept after restart and the counts where updated from the completed checkpoint store. In the described race, a checkpoint that has a pending delete will not contribute to the new count, but it can still decrement the count once the shared state is unregistered in the async deletion thread. This can accidentally drop counts below 1 and lead to data loss.

The core idea behind the PR is to scope the SharedStateRegistry per (re-)start, so that old pending deletes cannot influence the current count.

SharedStateRegistry is now created via a factory that is passed into the CheckpointCoordinator to simplify testing.

The PR also introduces additional tests and improves the debug/trace logging of incremental checkpointing.

Verifying this change

This change added tests and can be verified as follows:

Run a job with keyed state, using incremental checkpoints and HA mode. Kill TMs to trigger recovery. After a couple of attempts, the problematic condition should be triggered, leading to an infinite recovery loop due to FileNotFoundException. This should be fixed in the PR.

Additional tests:

HAIncrementalRocksDbBackendEventTimeWindowCheckpointingITCase
CheckpointCoordinatortest::testSharedStateRegistrationOnRestore
IncrementalKeyedStateHandleTest::testSharedStateReRegistration

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (YES)

Documentation

  • Does this pull request introduce a new feature? (no)

@StefanRRichter
Copy link
Copy Markdown
Contributor Author

CC @aljoscha

@aljoscha
Copy link
Copy Markdown
Contributor

I also reviewed CheckpointCoordinatorTest.testSharedStateRegistrationOnRestore(). Looks good 👍 The rest is the same as in #4410 so let's wait until #4353 is merged and then we can merge this one as well.

@StefanRRichter StefanRRichter force-pushed the ImprovedIncreemntalCPDebug branch 7 times, most recently from fdc84c0 to 10adb9d Compare August 15, 2017 12:54
StefanRRichter and others added 3 commits August 15, 2017 14:56
…Case

This helps tease out races, for example the recently discovered one in
cleanup of incremental state handles at the SharedStateRegistry.

(cherry picked from commit d7683cc)
@StefanRRichter StefanRRichter force-pushed the ImprovedIncreemntalCPDebug branch from 10adb9d to d29bed3 Compare August 15, 2017 12:57
@asfgit asfgit merged commit d29bed3 into apache:master Aug 15, 2017
sharedStateRegistry.close();
sharedStateRegistry = sharedStateRegistryFactory.create(executor);

// Recover the checkpoints, TODO this could be done only when there is a new leader, not on each recovery
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use highAvailabilityServices.getJobManagerLeaderRetriever(), Job Id is required.
Can Job Id be obtained from JobVertexID ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants