[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7009

Myasuka · 2018-11-02T16:48:09Z

What is the purpose of the change

Currently, RestartPipelinedRegionStrategy does not perform any state restore. This is big problem because all restored regions will be restarted with empty state. This PR supports to restore state when using RestartPipelinedRegionStrategy.

Brief change log

Implement new restoreLatestCheckpointedState API for region-based failover in CheckpointCoordinator.
Reload checkpointed state when FailoverRegion called restart method.
StateAssignmentOperation could assign state with given executionVertices.

Verifying this change

This change added tests and can be verified as follows:

Added unit tests for FailoverRegion to ensure the failover region ever called new restoreLatestCheckpointedState API within CheckpointCoordinator.
Added unit tests for CheckpointCoordinatorTest to ensure CheckpointCoordinator could restore with RestartPipelinedRegionStrategy.
Added unit tests for CheckpointStateRestoreTest to ensure RestartPipelinedRegionStrategy could handle well when restoring state from a checkpoint to the task executions.
Added new integration test RegionFailoverITCase to verify state could be restored properly when the job consists multi regions.
Refactored StreamFaultToleranceTestBase to let all sub-classes ITs could failover with state using RestartPipelinedRegionStrategy.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): don't know
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

StefanRRichter

I am not completely done with the review, but so far already have some change requests. Mostly the problem is code duplication. We should try to come up with a single implementation to handle both cases.

StefanRRichter · 2018-11-26T14:49:26Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java

@@ -201,31 +261,33 @@ private void assignTaskStateToExecutionJobVertices(

 		for (int subTaskIndex = 0; subTaskIndex < newParallelism; subTaskIndex++) {

-			Execution currentExecutionAttempt = executionJobVertex.getTaskVertices()[subTaskIndex]
-				.getCurrentExecutionAttempt();
+			if (subTaskIndices.contains(subTaskIndex)) {


instead of for i in (0 .. newParallelism) -> contains(i), why not supply an Iterable subtaskIDsinstead and thenfor (int subtask : subtaskIDs)`? The old codepath would just pass in an iterable from 0 to new parallelism.

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java

StefanRRichter · 2018-11-26T14:52:44Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

+	 *                               that restores <i>non-partitioned</i> state from this
+	 *                               checkpoint.
+	 */
+	public boolean restoreLatestCheckpointedState(


Again, this is almost a complete duplication of the original method. We should unify boths methods to keep this maintainable.

StefanRRichter · 2018-11-26T14:55:00Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java

 		}

 		return true;
 	}

-	private void assignAttemptState(ExecutionJobVertex executionJobVertex, List<OperatorState> operatorStates) {
+	private void assignAttemptState(ExecutionJobVertex executionJobVertex, List<OperatorState> operatorStates, Set<Integer> subTaskIndices) {


I doubt that Set<Integer> is the best representation of subtask indexes. At least from the interface leven, an Iterable<Integer> could do the job if we rewite the loop as I suggested. Forthermore, we can have a more memory friendly implementation to back this, for example boolean[] or Bitset.

Myasuka · 2018-11-30T05:35:30Z

@StefanRRichter Thanks for your comments, I would refactor this PR.
BTW, I found region failover without letting checkpoint coordinator restart its checkpointScheduler would not guarantee EXACTLY_ONCE mechanism. I'll include this part of modification in next commits.

StefanRRichter · 2018-12-10T11:00:10Z

Ok, sounds good, looking forward to the new changes!

…ionStrategy

…ng RestartPipelinedRegionStrategy

Myasuka · 2018-12-21T15:22:02Z

@StefanRRichter Would you please take a look at the new commit? I really appreciate any help you can provide.

StefanRRichter

Thanks for the update! I have a few suggestions for improvements in the detailed comments. More importantly, I was wondering if we even need this approach in all its complexity or if we can get away simply by using a slighly modified call to existing code (details in the inline comments). This is an important point to figure out before we proceed. Please also try to avoid unrelated changes, in particular to indentation, because they make the diff bigger and more to review.

...k-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/FailoverRegion.java

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

StefanRRichter · 2019-01-07T13:48:14Z

.../src/test/java/org/apache/flink/streaming/connectors/fs/RollingSinkFaultToleranceITCase.java

@@ -77,8 +77,8 @@

 	private static String outPath;

-	@BeforeClass
-	public static void createHDFS() throws IOException {
+	@Before


All the changes on he file sink tests look unrelated. Does this fix some problem. Is there any reason why we should combine them with this PR instead of having a separate PR? If you agree, I would suggest revert such changes.

This is because its parent StreamFaultToleranceTestBase introduced parameterized test for two different failover strategies, for which I want to verify whether streaming programs could still be fault tolerant with RestartPipelinedRegionStrategy. However, files on hdfs left within RollingSinkFaultToleranceITCase after this test has run once would cause it failed in the new run with another strategy. That's why I have to modify this test.
I think after this PR, FLINK-10713 should also add RestartIndividualStrategy into StreamFaultToleranceTestBase.

StefanRRichter · 2019-01-07T13:50:20Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

+		boolean errorIfNoCheckpoint) throws Exception {
+
+		Map<JobVertexID, ExecutionJobVertex> tasks = new HashMap<>(executionVertices.size());
+		Map<JobVertexID, BitSet> executionJobVertexIndices = new HashMap<>(executionVertices.size());


I think it could make more sense to combine the maps tasks and executionJobVertexIndices, for example combining them in a Map<JobVertexID, ExecutionJobVertexWithSelectedSubtasks>, where ExecutionJobVertexWithSelectedSubtasks is a pojo that combines ExecutionJobVertex with their corresponding BitSet / Iterable<Integer>. This can save lookups and memory for unnecessary data structures.

StefanRRichter · 2019-01-07T13:54:20Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java

@@ -54,6 +55,7 @@
 	private static final Logger LOG = LoggerFactory.getLogger(StateAssignmentOperation.class);

 	private final Map<JobVertexID, ExecutionJobVertex> tasks;
+	private final Map<JobVertexID, BitSet> taskIndices;


In my previous suggestion, what I meant was really using something like the interface Iterable<Integer>, not directly the implementation of BitSet. Then we could have two implementations of the iterable, one that delegates to a bitset internally, and one for the general case that just takes a number (parallelism) and generates a sequence from 0..parallelism. This would keep the general code path more memory friendly and does not tie us to BitSet. Instead of Iterable, Collection could be considered if you find that knowing the size opens up optimizations.

StefanRRichter · 2019-01-07T13:56:36Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java

-			Map<OperatorInstanceID, List<KeyedStateHandle>> subManagedKeyedState,
-			Map<OperatorInstanceID, List<KeyedStateHandle>> subRawKeyedState,
-			int newParallelism) {
+		ExecutionJobVertex executionJobVertex,


There are a lot of unrelated formatting changes for method parameters. Can you please revert all the indentation changes because the make the diff bigger and therefore reviews harder.

StefanRRichter · 2019-01-07T14:07:30Z

...k-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/FailoverRegion.java

 					executionGraph.getCheckpointCoordinator().restoreLatestCheckpointedState(
-							connectedExecutionVertexes, false, false);
+							connectedExecutionVertexes, false);


I have a general a high-level question about this whole approach, that I forgot to ask in my previous review: do we even need to introduce this additional restore method. It seems to me that all it does is optimizing the state assignment a little bit for regional failover (at slightly increased complexity for the full recovery cases).

Instead, couldn't we just simply use the existing restoreLatestCheckpointedState method, without providing any indexes and just call it from here with executionGraph.allVertices(). State assignment is a simple meta data operation and should run very fast in general. As this will only modify the restoreState variable, only the vertices that are actually restarted for the failed region see the effect. The remaining vertices are not restarted and do not care about the change. We might need to change setInitialState for this by removing the precondition that checks for CREATED or only assign to instances in this state.
If this is just an optimization attempt for this special case, this could reduce the amount of changes and potentiall bugs by a lot. What do you think? Did you find any other reason why this whole index handling is required?

In my point of view, we should not depend on execution's current status to determine whether to assign state.
With the given indices, we could figure out the logical plan containing exact executions needed to be assigned. While try to assign state to all executions and use execution's status to determine whether assigning initial state would cause potential bugs indeed.
Moreover, current implementation has already use one base restoreLatestCheckpointedState method to handle both situations to reduce potential bugs happening. On the other hand, I think FLINK-10713 could also reuse new restoreLatestCheckpointedState(List<ExecutionVertex>, boolean) method.
I'm not sure whether I have expressed my thoughts clearly to convince you, but please leave any thoughts or concerns if further discussion still needed.

For all cases, as I see it the indexes work as an optimization to just run the repartitioning for some tasks. I think it would be ok to drop the precondition check because the assignment is only done once in the very beginning, so it seems a bit overcautious at this point. If you think that the optimization is helpful, we can introduce it with the small changes I proposed. However, I doubt that we see real performance benefits in the reassignment, but the code is getting more complex, which can make bugs more likely. Maybe you could just try out if the is an observable performance difference between using the indexes and just reassigning to all tasks?

StefanRRichter · 2019-01-07T14:11:11Z

flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java

@@ -985,6 +986,67 @@ int getNumScheduledTasks() {
 	 * restore from.
 	 * @param allowNonRestoredState Allow checkpoint state that cannot be mapped
 	 * to any job vertex in tasks.
+	 */
+	public boolean restoreLatestCheckpointedState(


If you want, it seems to me like we can remove the return value from this method and change it to void.

StefanRRichter · 2019-01-09T10:33:41Z

I have one more concern that might lead to bugs in a certain corner case. What will happen in your change if the task is using operator state, union state in particular. In applyRepartitioner(),

flink/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java

Line 636 in 1e2aa8e

if (OperatorStateHandle.Mode.UNION.equals(metaInfo.getDistributionMode())) {

you can see that all operator state is repartitioned if there is one union state. this can already be a problem with a union state, but even more if there is a union state and some partionable state - the partitioning for the partitionable state for those tasks that are restarted could differ to the partitioning used in the original run - some partitions could be dropped or assigned twice by this. I think that means we need to change the method to only redistribute the union states, and I wonder if even distributioing the union state only for the failed task even makes sense. I think we need a testcase for this scenario (operator with 1 union and 1 partionable operator state)and I think it might fail as described before when we check how the operator state was reassigned after some partial recovery. What do you think?

Myasuka · 2019-01-13T16:56:36Z

@StefanRRichter It seems union operator state, somehow, conflicts with partial recovery. However, since region means the minimal pipeline connected sub graph, why it could have union-state across different regions? Otherwise, we might need to introduce some limitations in this scenario.

Would you please kindly help to clarify more clearly on this corner case?

StefanRRichter · 2019-01-14T10:10:02Z

@Myasuka Yes, in the current implementation union state is a problem and unfortunately it is also used in some popular operators. For example, KafkaConsumer "abuses" union state to have a rescaling protocol that can support partition discovery. In a nutshell, during restore all parallel instances see the kafka offsets of all partitions and every instance will cherrypick kafka partitions through a protocoll that all instances follow. So every partition will go to exactly one operator and there is no need for communication between instances for that.

The contract of union state is that in recovery, each operator instance sees the union of states from all instances. The question for partial recovery is now, will recoverying instances see i) all states, ii) all states from other recovering instances, or iii) only their old state. I think most likely, we should go for option i), but this also means that all states would have to go into the operation and cannot be excluded via the index.

Then there is another problem in the current implementation that can lead to bugs with your code in the following way: if there is at least one union state, all other operator states will go through round-robin reassignment as well. So, we round-robin reassign some state, but only restarting operators load the reassigned version of the state. Instances that we keep running will run with the old assignment of the state. This can lead to some partitions beeing assigned twice or not being assigned at all.

One way to solve this problem would be to separate union states from other operator states and only round-robin assign operator state if the parallelism did not change (which it never does for recoveries, only for restarts).

Myasuka · 2019-01-14T17:54:30Z

@StefanRRichter Thanks for your explanation. I still have two questions below:

Even if we could assign partitioned operator state to all operator instances, the current taskRestore within Execution could only be shipped to taskmanagers if those executions located in the failed region. And instances that we keep running would not know operator states have changed. The possible bug "This can lead to some partitions beeing assigned twice or not being assigned at all" you mentioned is more likely on execution-graph side, and how we define the 'disunity' among tasks?
The last suggestion you provide seems a bit confused for me, please correct me if I am wrong, did you actually mean only round-robin assign operator state if parallelism did change? The parallelism could only be changed if job restarted, while it would not change during job fail-over.

StefanRRichter · 2019-01-15T10:04:10Z

@Myasuka

Yes, that is correct. We need to doublecheck for operators that use union state if their protocol also works for partial restores. For example if under the same parallelism they will always pick the same state from the union a they have checkpointed and will not produce conflicts with operators that are still running.
Again correct, it was a typo. We should no redistribution if the parallelism did not change for those states. That already avoids problems for all cases except 1.

Myasuka · 2019-01-18T17:22:38Z

@StefanRRichter I have went through all production code using union state as below:

For ArtificalOperatorStateMapper, SimpleEndlessSourceWithBloatedState , and StateCreatingFlatMap, they just use union state for end-to-end test verifying.

For FlinkKafkaProducer and FlinkKafkaProducer011, the union nextTransactionalIdHintState is just the same for all sub-tasks.

For KafkaConsumer, since Kafaka does support to decrease partition counts, the kafka partition is sticky to the subtask when parallslism not changed.

I think above operators would not meet conflict case.

For SequenceGeneratorSource, it would initialize state with max event time. However, monotonousEventTime only increase by fixed step of eventTimeClockProgressPerEvent, which should be the same for all sub-tasks.

For StreamingFileSink, it also get the max counter, but from the defination of StreamingFileSink, when restoring from checkpoint, the restored files in pending state are transferred into finished state, while in-progress files are rolled back. In other words from my point, partial recovery is not suitable for StreamingFileSink.

From my point of view, all operators in production code with union state, except StreamingFileSink, should work fine for partial restore. However, since the getUnionList API is public for users, we cannot control users' behavior. In a nutshell, if we support to restore state when using RestartPipelinedRegionStrategy, we should add limitation for union state.

I plan to add another parameter, which might be RecoveryMode, in the restoreLatestCheckpointedState method. For RestartAllStrategy and overall only one region's RestartPipelinedRegionStrategy, it's RecoveryMode.ALL; for other failover strategies, it's RecoveryMode.PARTIAL. By means of this, when assigning state, if we found RecoveryMode.PARTIAL and union state existed, unsupported exception could be thrown out. What do you think?

StefanRRichter · 2019-01-21T15:08:58Z

I think I agree with the assessment of the existing operators.

About adding a RecoveryMode to consider, would that mean that we would prevent all jobs that use union state to work with partial recovery? I think if we just consider a few popular operators like KafkaConsumer, that would already prevent a lot of jobs from using different recovery modes.

I can see that this comes from the concern about existing code that uses union state. However, stricly speaking it should not be a regression because those recovery modes previously did not support state recovery at all. We also cannot prevent users from making wrong implementations, so I feel like a good thing to do is document what to care care for union state when using such recovery modes.

Myasuka · 2019-02-23T08:30:31Z

A new PR #7813 created to replace this one due to outdated code.

tillrohrmann · 2019-03-20T13:12:22Z

Closing PR because it has been subsumed by #7813.

StefanRRichter requested changes Nov 26, 2018

View reviewed changes

Myasuka added 2 commits December 17, 2018 03:26

[FLINK-10712] Support to restore state when using RestartPipelinedReg…

47a685e

…ionStrategy

remove duplicated code and ensure to restart checkpoint scheduler usi…

d102a51

…ng RestartPipelinedRegionStrategy

Myasuka force-pushed the region-failover-with-state branch from 84d3487 to d102a51 Compare December 16, 2018 19:36

StefanRRichter requested changes Jan 7, 2019

View reviewed changes

Myasuka mentioned this pull request Feb 23, 2019

[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7813

Merged

rmetzger added the component=Runtime/Coordination label Mar 18, 2019

tillrohrmann closed this Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7009

[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7009

Myasuka commented Nov 2, 2018

StefanRRichter left a comment

StefanRRichter Nov 26, 2018

StefanRRichter Nov 26, 2018

StefanRRichter Nov 26, 2018

Myasuka commented Nov 30, 2018

StefanRRichter commented Dec 10, 2018

Myasuka commented Dec 21, 2018

StefanRRichter left a comment

StefanRRichter Jan 7, 2019

Myasuka Jan 8, 2019

StefanRRichter Jan 7, 2019

StefanRRichter Jan 7, 2019

StefanRRichter Jan 7, 2019

StefanRRichter Jan 7, 2019

Myasuka Jan 8, 2019

StefanRRichter Jan 9, 2019

StefanRRichter Jan 7, 2019

StefanRRichter commented Jan 9, 2019 •

edited

Loading

Myasuka commented Jan 13, 2019

StefanRRichter commented Jan 14, 2019 •

edited

Loading

Myasuka commented Jan 14, 2019

StefanRRichter commented Jan 15, 2019

Myasuka commented Jan 18, 2019

StefanRRichter commented Jan 21, 2019

Myasuka commented Feb 23, 2019

tillrohrmann commented Mar 20, 2019

[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7009

[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7009

Conversation

Myasuka commented Nov 2, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

StefanRRichter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Myasuka commented Nov 30, 2018

StefanRRichter commented Dec 10, 2018

Myasuka commented Dec 21, 2018

StefanRRichter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanRRichter commented Jan 9, 2019 • edited Loading

Myasuka commented Jan 13, 2019

StefanRRichter commented Jan 14, 2019 • edited Loading

Myasuka commented Jan 14, 2019

StefanRRichter commented Jan 15, 2019

Myasuka commented Jan 18, 2019

StefanRRichter commented Jan 21, 2019

Myasuka commented Feb 23, 2019

tillrohrmann commented Mar 20, 2019

StefanRRichter commented Jan 9, 2019 •

edited

Loading

StefanRRichter commented Jan 14, 2019 •

edited

Loading