-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-10712] Support to restore state when using RestartPipelinedRegionStrategy #7009
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not completely done with the review, but so far already have some change requests. Mostly the problem is code duplication. We should try to come up with a single implementation to handle both cases.
@@ -201,31 +261,33 @@ private void assignTaskStateToExecutionJobVertices( | |||
|
|||
for (int subTaskIndex = 0; subTaskIndex < newParallelism; subTaskIndex++) { | |||
|
|||
Execution currentExecutionAttempt = executionJobVertex.getTaskVertices()[subTaskIndex] | |||
.getCurrentExecutionAttempt(); | |||
if (subTaskIndices.contains(subTaskIndex)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of for i in (0 .. newParallelism) ->
contains(i), why not supply an
Iterable subtaskIDsinstead and then
for (int subtask : subtaskIDs)`? The old codepath would just pass in an iterable from 0 to new parallelism.
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java
Outdated
Show resolved
Hide resolved
* that restores <i>non-partitioned</i> state from this | ||
* checkpoint. | ||
*/ | ||
public boolean restoreLatestCheckpointedState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is almost a complete duplication of the original method. We should unify boths methods to keep this maintainable.
} | ||
|
||
return true; | ||
} | ||
|
||
private void assignAttemptState(ExecutionJobVertex executionJobVertex, List<OperatorState> operatorStates) { | ||
private void assignAttemptState(ExecutionJobVertex executionJobVertex, List<OperatorState> operatorStates, Set<Integer> subTaskIndices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that Set<Integer>
is the best representation of subtask indexes. At least from the interface leven, an Iterable<Integer>
could do the job if we rewite the loop as I suggested. Forthermore, we can have a more memory friendly implementation to back this, for example boolean[]
or Bitset
.
@StefanRRichter Thanks for your comments, I would refactor this PR. |
Ok, sounds good, looking forward to the new changes! |
…ng RestartPipelinedRegionStrategy
84d3487
to
d102a51
Compare
@StefanRRichter Would you please take a look at the new commit? I really appreciate any help you can provide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I have a few suggestions for improvements in the detailed comments. More importantly, I was wondering if we even need this approach in all its complexity or if we can get away simply by using a slighly modified call to existing code (details in the inline comments). This is an important point to figure out before we proceed. Please also try to avoid unrelated changes, in particular to indentation, because they make the diff bigger and more to review.
...k-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/FailoverRegion.java
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
Show resolved
Hide resolved
@@ -77,8 +77,8 @@ | |||
|
|||
private static String outPath; | |||
|
|||
@BeforeClass | |||
public static void createHDFS() throws IOException { | |||
@Before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the changes on he file sink tests look unrelated. Does this fix some problem. Is there any reason why we should combine them with this PR instead of having a separate PR? If you agree, I would suggest revert such changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because its parent StreamFaultToleranceTestBase
introduced parameterized test for two different failover strategies, for which I want to verify whether streaming programs could still be fault tolerant with RestartPipelinedRegionStrategy
. However, files on hdfs left within RollingSinkFaultToleranceITCase
after this test has run once would cause it failed in the new run with another strategy. That's why I have to modify this test.
I think after this PR, FLINK-10713 should also add RestartIndividualStrategy
into StreamFaultToleranceTestBase
.
boolean errorIfNoCheckpoint) throws Exception { | ||
|
||
Map<JobVertexID, ExecutionJobVertex> tasks = new HashMap<>(executionVertices.size()); | ||
Map<JobVertexID, BitSet> executionJobVertexIndices = new HashMap<>(executionVertices.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could make more sense to combine the maps tasks
and executionJobVertexIndices
, for example combining them in a Map<JobVertexID, ExecutionJobVertexWithSelectedSubtasks>
, where ExecutionJobVertexWithSelectedSubtasks
is a pojo that combines ExecutionJobVertex
with their corresponding BitSet
/ Iterable<Integer>
. This can save lookups and memory for unnecessary data structures.
@@ -54,6 +55,7 @@ | |||
private static final Logger LOG = LoggerFactory.getLogger(StateAssignmentOperation.class); | |||
|
|||
private final Map<JobVertexID, ExecutionJobVertex> tasks; | |||
private final Map<JobVertexID, BitSet> taskIndices; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous suggestion, what I meant was really using something like the interface Iterable<Integer>
, not directly the implementation of BitSet
. Then we could have two implementations of the iterable, one that delegates to a bitset internally, and one for the general case that just takes a number (parallelism) and generates a sequence from 0..parallelism. This would keep the general code path more memory friendly and does not tie us to BitSet
. Instead of Iterable
, Collection
could be considered if you find that knowing the size opens up optimizations.
Map<OperatorInstanceID, List<KeyedStateHandle>> subManagedKeyedState, | ||
Map<OperatorInstanceID, List<KeyedStateHandle>> subRawKeyedState, | ||
int newParallelism) { | ||
ExecutionJobVertex executionJobVertex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of unrelated formatting changes for method parameters. Can you please revert all the indentation changes because the make the diff bigger and therefore reviews harder.
executionGraph.getCheckpointCoordinator().restoreLatestCheckpointedState( | ||
connectedExecutionVertexes, false, false); | ||
connectedExecutionVertexes, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a general a high-level question about this whole approach, that I forgot to ask in my previous review: do we even need to introduce this additional restore method. It seems to me that all it does is optimizing the state assignment a little bit for regional failover (at slightly increased complexity for the full recovery cases).
Instead, couldn't we just simply use the existing restoreLatestCheckpointedState
method, without providing any indexes and just call it from here with executionGraph.allVertices()
. State assignment is a simple meta data operation and should run very fast in general. As this will only modify the restoreState
variable, only the vertices that are actually restarted for the failed region see the effect. The remaining vertices are not restarted and do not care about the change. We might need to change setInitialState
for this by removing the precondition that checks for CREATED
or only assign to instances in this state.
If this is just an optimization attempt for this special case, this could reduce the amount of changes and potentiall bugs by a lot. What do you think? Did you find any other reason why this whole index handling is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my point of view, we should not depend on execution's current status to determine whether to assign state.
With the given indices, we could figure out the logical plan containing exact executions needed to be assigned. While try to assign state to all executions and use execution's status to determine whether assigning initial state would cause potential bugs indeed.
Moreover, current implementation has already use one base restoreLatestCheckpointedState
method to handle both situations to reduce potential bugs happening. On the other hand, I think FLINK-10713 could also reuse new restoreLatestCheckpointedState(List<ExecutionVertex>, boolean)
method.
I'm not sure whether I have expressed my thoughts clearly to convince you, but please leave any thoughts or concerns if further discussion still needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all cases, as I see it the indexes work as an optimization to just run the repartitioning for some tasks. I think it would be ok to drop the precondition check because the assignment is only done once in the very beginning, so it seems a bit overcautious at this point. If you think that the optimization is helpful, we can introduce it with the small changes I proposed. However, I doubt that we see real performance benefits in the reassignment, but the code is getting more complex, which can make bugs more likely. Maybe you could just try out if the is an observable performance difference between using the indexes and just reassigning to all tasks?
@@ -985,6 +986,67 @@ int getNumScheduledTasks() { | |||
* restore from. | |||
* @param allowNonRestoredState Allow checkpoint state that cannot be mapped | |||
* to any job vertex in tasks. | |||
*/ | |||
public boolean restoreLatestCheckpointedState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, it seems to me like we can remove the return value from this method and change it to void
.
I have one more concern that might lead to bugs in a certain corner case. What will happen in your change if the task is using operator state, union state in particular. In flink/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java Line 636 in 1e2aa8e
you can see that all operator state is repartitioned if there is one union state. this can already be a problem with a union state, but even more if there is a union state and some partionable state - the partitioning for the partitionable state for those tasks that are restarted could differ to the partitioning used in the original run - some partitions could be dropped or assigned twice by this. I think that means we need to change the method to only redistribute the union states, and I wonder if even distributioing the union state only for the failed task even makes sense. I think we need a testcase for this scenario (operator with 1 union and 1 partionable operator state)and I think it might fail as described before when we check how the operator state was reassigned after some partial recovery. What do you think? |
@StefanRRichter It seems union operator state, somehow, conflicts with partial recovery. However, since Would you please kindly help to clarify more clearly on this corner case? |
@Myasuka Yes, in the current implementation union state is a problem and unfortunately it is also used in some popular operators. For example, KafkaConsumer "abuses" union state to have a rescaling protocol that can support partition discovery. In a nutshell, during restore all parallel instances see the kafka offsets of all partitions and every instance will cherrypick kafka partitions through a protocoll that all instances follow. So every partition will go to exactly one operator and there is no need for communication between instances for that. The contract of union state is that in recovery, each operator instance sees the union of states from all instances. The question for partial recovery is now, will recoverying instances see i) all states, ii) all states from other recovering instances, or iii) only their old state. I think most likely, we should go for option i), but this also means that all states would have to go into the operation and cannot be excluded via the index. Then there is another problem in the current implementation that can lead to bugs with your code in the following way: if there is at least one union state, all other operator states will go through round-robin reassignment as well. So, we round-robin reassign some state, but only restarting operators load the reassigned version of the state. Instances that we keep running will run with the old assignment of the state. This can lead to some partitions beeing assigned twice or not being assigned at all. One way to solve this problem would be to separate union states from other operator states and only round-robin assign operator state if the parallelism did not change (which it never does for recoveries, only for restarts). |
@StefanRRichter Thanks for your explanation. I still have two questions below:
|
|
@StefanRRichter I have went through all production code using union state as below: For For For I think above operators would not meet conflict case. For For From my point of view, all operators in production code with union state, except I plan to add another parameter, which might be |
I think I agree with the assessment of the existing operators. About adding a I can see that this comes from the concern about existing code that uses union state. However, stricly speaking it should not be a regression because those recovery modes previously did not support state recovery at all. We also cannot prevent users from making wrong implementations, so I feel like a good thing to do is document what to care care for union state when using such recovery modes. |
A new PR #7813 created to replace this one due to outdated code. |
Closing PR because it has been subsumed by #7813. |
What is the purpose of the change
Currently, RestartPipelinedRegionStrategy does not perform any state restore. This is big problem because all restored regions will be restarted with empty state. This PR supports to restore state when using RestartPipelinedRegionStrategy.
Brief change log
restoreLatestCheckpointedState
API for region-based failover inCheckpointCoordinator
.FailoverRegion
calledrestart
method.StateAssignmentOperation
could assign state with given executionVertices.Verifying this change
This change added tests and can be verified as follows:
FailoverRegion
to ensure the failover region ever called newrestoreLatestCheckpointedState
API withinCheckpointCoordinator
.CheckpointCoordinatorTest
to ensureCheckpointCoordinator
could restore withRestartPipelinedRegionStrategy
.CheckpointStateRestoreTest
to ensureRestartPipelinedRegionStrategy
could handle well when restoring state from a checkpoint to the task executions.RegionFailoverITCase
to verify state could be restored properly when the job consists multi regions.StreamFaultToleranceTestBase
to let all sub-classes ITs could failover with state using RestartPipelinedRegionStrategy.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation