Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-20978] Implement HeapSavepointRestoreOperation #14648

Closed
wants to merge 10 commits into from

Conversation

dawidwys
Copy link
Contributor

@dawidwys dawidwys commented Jan 14, 2021

What is the purpose of the change

The PR implements the logic of restoring a heap keyed state backend from a savepoint in a unified binary format.

Brief change log

  • Extract common logic for restoring from a savepoint
  • Introduce the HeapSavepointRestoreOperation for restoring from savepoint
  • Introduce SavepointKeyedStateHandle as a marker interface for differentiating a savepoint from a checkpoint

For more info see the commit messages.

Verifying this change

All current tests for state backends should succeed.
Added a test for verifying the restore from the new savepoint format: SavepointStateBackendSwitchTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no): test dependency in flink-tests module
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) : will be documented after the snapshotting part is ready

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 14, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 61ff07a (Fri May 28 08:57:22 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 14, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@dawidwys dawidwys force-pushed the restore-savepoint-cleanups branch 2 times, most recently from 3cdc18c to 29862de Compare February 1, 2021 08:50
@dawidwys dawidwys marked this pull request as ready for review February 1, 2021 08:59
Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes seem good to me, though there currently seem to be test failures. I didn't review the diff of extracting the restore logic to an iterator, instead I reviewed it as a new piece of code and checked that the Rocks backend uses the iterator correctly. Also, I'm trusting the snapshot/restore tests quite a bit because they are decently thorough.

I had some inline questions.

@dawidwys dawidwys force-pushed the restore-savepoint-cleanups branch 3 times, most recently from 6698ad9 to 2fd191a Compare February 1, 2021 15:29
@dawidwys dawidwys force-pushed the restore-savepoint-cleanups branch 6 times, most recently from d448a45 to 5f2e115 Compare February 3, 2021 11:43
…Info

Some of the methods are annotated with @nullable even though they
forward to methods annotated with @nonnull.
This commit implements the logic of restoring a heap keyed state backend
from a savepoint in a unified binary format. It eagerly deserializes all
states and populates the in memory structures.
Introduce a marker SavepointKeyedStateHandle interface for state handles that describe savepoints. Based on the interface we can later decide which strategy to use when restoring from the handle.
@dawidwys
Copy link
Contributor Author

dawidwys commented Feb 4, 2021

@flinkbot run azure

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look very good! I had a comment about making the switching test generic for all state backends.

* Tests for the unified savepoint format. They verify you can switch a state backend through a
* savepoint.
*/
public class SavepointStateBackendSwitchTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a specific test, could we add something like this to StateBackendTestBase to ensure that all backends that use this test (which should be all of them) can restore from a savepoint? That test is already quite long, so the code should probably be completely in a util but it should still be enforced by the base test.

Copy link
Contributor Author

@dawidwys dawidwys Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think about the version of this test from the next PR: 922be55#diff-73a96e160508de0055bc3b19ae1f425e14f8640ae112762cc64498b811d87c9b ?

I reworked it a bit there and made it more declarative. If that's something you like I could move it over to this PR already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that one, yes. But it still has the list of backends hardcoded. In the end it's probably not too big of a deal but I like forcing implementers of new backends to just have to support this if they use the test base. 😅

@aljoscha aljoscha self-assigned this Feb 5, 2021
Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the changes look very good!

@dawidwys dawidwys closed this in f56021c Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants