[FLINK-24807][iteration] Fix the issues in checkpoints with iteration #25

gaoyunhaii · 2021-11-06T05:12:43Z

We have met some more issues in supporting checkpoints with iteration:

Support snapshotting Replayer Operator and per-round wrapper.
Fix the issues that after head tasks finished, the coordinator continues emitting CoordinatorCheckpointEvent, which made the checkpoint fails due to events not sending.
Support raw operator state inside the iteration.

Specially for the second point, Currently the HeadCoordinator would emit CoordinatorCheckpointEvent to the tasks so that the GloballyAlignedEvent would not be interleave with the checkpoint barrier. However, if the tasks are finished and we continue emitting the event, the checkpoint would fail due to there are failed operator events. To address this issue, we would stop emitting CoordinatorCheckpointEvent after the head operator is terminating, namely it received the GloballyAlignedEvent marking terminating.

The third point is required since operators like withBroadcast might rely on raw operator state to snapshot the cached records. This is necessary since the normal operator state always resides in the memory, which might be not enough.

guoweiM

Thanks @gaoyunhaii for opening this pr. I have some following minor comments below.
BTW I just think that we could not support Batch execution mode. So I think it would be nice to tell the user when they set the execution mode to Batch. I am sorry that I find it so late. But I think you could open a separate pr for this.
Thanks

flink-ml-iteration/src/main/java/org/apache/flink/iteration/operator/ReplayOperator.java

flink-ml-iteration/src/test/java/org/apache/flink/iteration/operator/ReplayOperatorTest.java

.../main/java/org/apache/flink/iteration/operator/perround/AbstractPerRoundWrapperOperator.java

...-ml-tests/src/test/java/org/apache/flink/test/iteration/BoundedPerRoundCheckpointITCase.java

.../src/test/java/org/apache/flink/test/iteration/operators/TwoInputReducePerRoundOperator.java

…rminating Currently the HeadCoordinator would emit CoordinatorCheckpointEvent to the tasks so that the GloballyAlignedEvent would not be interleave with the checkpoint barrier. Howver, if the tasks are finished and we continue emitting the event, the checkpoint would fail due to there are failed operator events. To address this issue, we would stop emitting CoordinatorCheckpointEvent after the head operator is terminating, namely it received the GloballyAlignedEvent marking terminating.

…e barrier feed back first

gaoyunhaii force-pushed the i10_add_checkpoints_per_round branch from f14e74c to 29ce569 Compare November 10, 2021 07:40

guoweiM reviewed Nov 10, 2021

View reviewed changes

gaoyunhaii added 7 commits November 10, 2021 16:43

[FLINK-24807][iteration] Support snapshot the ReplayOperator

a2294e6

[FLINK-24807][iteration] Stores the state for per-round wrapper

7a05cf8

[hotfix][iteration] Rename the all-round checkpoint test to be it case

68c5945

[FLINK-24807][iteration] Add per-round checkpoint IT case

0b033df

[FLINK-24807][iteration] Support raw operator state

8d3f2ad

[FLINK-24807][iteration] Not start logging at the head operator if th…

00ec475

…e barrier feed back first

gaoyunhaii force-pushed the i10_add_checkpoints_per_round branch from 29ce569 to 00ec475 Compare November 10, 2021 14:32

guoweiM approved these changes Nov 10, 2021

View reviewed changes

gaoyunhaii closed this in acbf4b9 Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-24807][iteration] Fix the issues in checkpoints with iteration #25

[FLINK-24807][iteration] Fix the issues in checkpoints with iteration #25

gaoyunhaii commented Nov 6, 2021

guoweiM left a comment

[FLINK-24807][iteration] Fix the issues in checkpoints with iteration #25

[FLINK-24807][iteration] Fix the issues in checkpoints with iteration #25

Conversation

gaoyunhaii commented Nov 6, 2021

guoweiM left a comment

Choose a reason for hiding this comment