Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-24807][iteration] Fix the issues in checkpoints with iteration #25

Closed

Conversation

gaoyunhaii
Copy link
Contributor

We have met some more issues in supporting checkpoints with iteration:

  1. Support snapshotting Replayer Operator and per-round wrapper.
  2. Fix the issues that after head tasks finished, the coordinator continues emitting CoordinatorCheckpointEvent, which made the checkpoint fails due to events not sending.
  3. Support raw operator state inside the iteration.

Specially for the second point, Currently the HeadCoordinator would emit CoordinatorCheckpointEvent to the tasks so that the GloballyAlignedEvent would not be interleave with the checkpoint barrier. However, if the tasks are finished and we continue emitting the event, the checkpoint would fail due to there are failed operator events. To address this issue, we would stop emitting CoordinatorCheckpointEvent after the head operator is terminating, namely it received the GloballyAlignedEvent marking terminating.

The third point is required since operators like withBroadcast might rely on raw operator state to snapshot the cached records. This is necessary since the normal operator state always resides in the memory, which might be not enough.

Copy link

@guoweiM guoweiM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaoyunhaii for opening this pr. I have some following minor comments below.
BTW I just think that we could not support Batch execution mode. So I think it would be nice to tell the user when they set the execution mode to Batch. I am sorry that I find it so late. But I think you could open a separate pr for this.
Thanks

…rminating

Currently the HeadCoordinator would emit CoordinatorCheckpointEvent to
the tasks so that the GloballyAlignedEvent would not be interleave with
the checkpoint barrier. Howver, if the tasks are finished and we
continue emitting the event, the checkpoint would fail due to there
are failed operator events. To address this issue, we would stop
emitting CoordinatorCheckpointEvent after the head operator is terminating,
namely it received the GloballyAlignedEvent marking terminating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants