Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[FLINK-8360] Implement task-local state recovery #5239
What is the purpose of the change
This changes introduces the task-local recovery feature. The main idea is to have a secondary, local copy of the checkpointed state, while there is still a primary copy in DFS that we report to the checkpoint coordinator.
Recovery can attempt to restore from the secondary local copy, if available, to save network bandwidth. This requires that the assignment from tasks to slots is as sticky is possible.
For starters, we will implement this feature for all managed keyed states and can easily enhance it to all other state types (e.g. operator state) later, because the basic infrastructure is already in place. This PR is on top of #4745.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
Today I only managed to review up to
StateInitializationContextImpl.java from second commit. Will continue on Monday.
I did a peer review and walk through the code.
Now, since this PR is already complicated and needs a heavy rebase, I would be okay with doing that in another PR, if there is commitment to do this soon (before the 1.5 release branch is cut).
Slightly off topic: This code has a very distinct style of using many
Thanks for going through the general design @StephanEwen ! As we discussed, I agree with your first point. For the second point about RocksDB, this PR already contains an optimized way to deal with incremental local checkpoints that we did not discuss in our review, because I thought it is too much of a low level detail.
Full snapshots, work with duplicated streams.
I have finished looking through the second commit. Thanks for the patient and sorry for late response but it took me quite some time to understand what's this change is about and I was already kind of overloaded with other reviews :(