Skip to content

Conversation

@uce
Copy link
Contributor

@uce uce commented Oct 28, 2016

When restoring from a checkpoint/savepoint, state for each operator has to be restored. For savepoints, this means that the user cannot remove an operator from her topology and still use the savepoint.

With this change, we will allow to ignore state that cannot be mapped back to the job being restored. The default behaviour does not change.

Changes

  • I've removed the allOrNothingState flag as it was only effecting non-partitioned operator state and never set to true anyways (except tests). The flag controlled whether each non-partitioned operator state was restored.
  • Moved the savepoint path from the JobSnapshottingSettings to the JobGraph
  • Added the --ignoreUnmappedState (short -i) flag to the run command: bin/flink run -s <savepointPath> -i ...

I've tested this manually by triggering a savepoint for a job, adjusting the job (removing an operator), and then trying to resume from the savepoint. By default, restoring fails, but with the flag everything works.

Copy link
Contributor

@StefanRRichter StefanRRichter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the proposed changes look good to me.

@uce
Copy link
Contributor Author

uce commented Oct 31, 2016

Thanks for the review. Going to merge this.

@StephanEwen
Copy link
Contributor

Looks good to me, except the name ignoreUnmappedState ;-)
As usual, the three most difficult things in computer science are (1) finding good names, and (2) off-by-one errors.

What do you think about calling something like allowUnresumedState or allowNonRestoredState? The "allow" to me implies that this is a valid scenario.

@uce
Copy link
Contributor Author

uce commented Oct 31, 2016

Very much agree Stephan! I don't know what sounds better to native speakers and more intuitive to users though... unresumed state or non restored state? ;-) @greghogan @jgrier do you have any input on this?

The internal behaviour is the following: The checkpoint/savepoint stores state for each operator of the original job graph (from which the checkpoint/savepoint was triggered) keyed by the operator ID. When a user resumes from this checkpoint/savepoint, the checkpoint coordinator tries to map each state (keyed by operator ID) to the operators of the new job. This PR allows ;-) that some of this state is not restored. Any ideas on how to call this?

uce added 2 commits November 1, 2016 17:51
Allow to specify whether a checkpoint restore should ignore
checkpoint state that it cannot map to the program. This is
exposed via the CLI in the run command:

bin/flink run -s <savepointPath> -i ...

Furthermore, the savepoint restore settings are moved out of
the snapshotting settings.
… state

Allows to ignore checkpoint state that cannot be mapped to a job vertex when
restoring.
@uce uce force-pushed the 4445-unmatched_state branch from 57621d3 to 5c44d6a Compare November 1, 2016 17:20
@uce uce force-pushed the 4445-unmatched_state branch from 5c44d6a to 98a1362 Compare November 1, 2016 17:45
@asfgit asfgit closed this in c0e620f Nov 2, 2016
@uce uce deleted the 4445-unmatched_state branch November 2, 2016 09:15
liuyuzhong pushed a commit to liuyuzhong/flink that referenced this pull request Dec 5, 2016
…int state

Allows to skip checkpoint state that cannot be mapped to a job vertex when
restoring.

This closes apache#2712.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants