Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-24655][iteration] Support the checkpoints for the iteration #17

Closed
wants to merge 8 commits into from

Conversation

gaoyunhaii
Copy link
Contributor

This PR implements the checkpoint mechanism for the iteration. The target of the checkpoint mechanism is to ensure

  1. The record processing is consistent with the state, which is the same to the normal checkpoints without feedback edges.
  2. The notification of epoch incremented is exactly-once.

The checkpoints relies on the reference count mechanism to include the feedback records into snapshots. Besides, it also take cares of the state for the controller operator / all-round wrappers and per-round wrappers. At this version it introduce some limitation in that it does not allows for the all the operators inside the iteration to change parallelism after restart from checkpoint. For the long run, the condition could be relaxed to

  1. Unbounded iteration could rescale freely.
  2. Bounded iteration could rescale all-round / per-round operators if they are restored from a savepoint / external checkpoint taken after round 0 is fully finished. However, the controller operator (Head & Tail) could not support rescaling.

@gaoyunhaii gaoyunhaii force-pushed the i10_add_checkpoints branch 2 times, most recently from ea8599d to cb50626 Compare October 25, 2021 15:59
@gaoyunhaii gaoyunhaii changed the title [FLINK-10][iteration] Support the checkpoints for the iteration. [FLINK-24655][iteration] Support the checkpoints for the iteration Nov 3, 2021
…s back before terminating.

This is a basis for the checkpoint since for checkpoints
with feedback edges, we would need to also include the
feedback records into snapshot, thus if we want to make
sure all the checkpoints before the terminated globally
aligned events get done, we have to wait for one more round.
… epoch watermark

If the finished and failover, then it would be skipped and would not
insert the epoch watermark again.
@gaoyunhaii
Copy link
Contributor Author

Very thank @yunfengzhou-hub and @guoweiM for the review! will merge~

@gaoyunhaii gaoyunhaii closed this in b9ee412 Nov 3, 2021
zhipeng93 pushed a commit to zhipeng93/flink-ml that referenced this pull request Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants