-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-18064][docs] Added unaligned checkpointing to docs. #12722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit d162c92 (Fri Jun 19 12:47:13 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
For ease of reviewing, I made a screenshot of the page with changed graphics. |
morsapaes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. Thanks for sharing, @AHeise , I'll "align" the release blogpost with these docs.
rkhachatryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice graphics :)
I left some comments. Most of them are nits, except for mentioning feature status
concepts/stateful-stream-processing
| See [Restart Strategies]({% link dev/task_failure_recovery.md | ||
| %}#restart-strategies) for more information. | ||
|
|
||
| ### Unaligned Checkpointing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that this is an experimental feature?
I think it should be a separate statement in the end of the section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that this document is rather an expanded glossary and talks about the concept and not the implementation. Thus, I'd leave the implementation state out of this place. The ops link will directly say that it's experimental in 1.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The page already goes quite deep into the details so I don't see why it shouldn't be mentioned here. Some users could benefit by ruling out the feature earlier if they are considering Flink or its configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we decided at a different point to drop the experimental label, I'd leave this section as is.
rkhachatryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding some more comments. Sorry for the intermittent review.
Looking at the configuration reference (ops/config.md) I couldn't find any relation between execution.checkpointing.unaligned and execution.checkpointing.max-concurrent-checkpoints. I think we should mention explicitly that when the former is enabled, the latter must be 1.
docs/ops/state/checkpoints.md
Outdated
| - You cannot rescale from unaligned checkpoints. You have to take a savepoint | ||
| before rescaling. Savepoints are always aligned independent of the alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - You cannot rescale from unaligned checkpoints. You have to take a savepoint | |
| before rescaling. Savepoints are always aligned independent of the alignment | |
| - You cannot rescale or change job graph with unaligned checkpoints. You have to take a savepoint | |
| before rescaling. Savepoints are always aligned independent of the alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the job graph with current checkpoints? I was always assuming that you need savepoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...the current Flink docs says that retained checkpoints:
do not support Flink specific features like rescaling.
...and nothing about the job graph.
Besides that, UC doesn't currently support Local recovery.
Edit:
Local recovery limitation should probably be described in
Tuning Checkpoints section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes technically but its incidental. The community hasn't made any backward compat guaruntees around that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
I think we can leave it as is then.
AHeise
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback @morsapaes and @rkhachatryan . I followed most of your suggestions, but I have some unresolved issues coming from @rkhachatryan .
| See [Restart Strategies]({% link dev/task_failure_recovery.md | ||
| %}#restart-strategies) for more information. | ||
|
|
||
| ### Unaligned Checkpointing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that this document is rather an expanded glossary and talks about the concept and not the implementation. Thus, I'd leave the implementation state out of this place. The ops link will directly say that it's experimental in 1.11.
docs/ops/state/checkpoints.md
Outdated
| - You cannot rescale from unaligned checkpoints. You have to take a savepoint | ||
| before rescaling. Savepoints are always aligned independent of the alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the job graph with current checkpoints? I was always assuming that you need savepoints.
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this down @AHeise, mostly LGTM :)
|
|
||
| /** | ||
| * Enables unaligned checkpoints, which greatly reduce checkpointing times under backpressure. | ||
| * Enables unaligned checkpoints, which greatly reduce checkpointing times under backpressure (experimental). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it's stable on our builds, maybe we could label it more production ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then just leave out experimental and just link to limitations?
| operations can asynchronously snapshot their state. | ||
|
|
||
| Since Flink 1.11, checkpoints can be taken with or without alignment. In the | ||
| following, we describe aligned checkpoints first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following section?
| Unaligned checkpointing ensures that barriers are arriving at the sink as fast | ||
| as possible. It's especially suited for applications with at least one slow | ||
| moving data path, where alignment times can reach hours. However, since it's | ||
| adding additional I/O pressure to state backends, it doesn't help when the I/O |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I/O pressure to state backends -> I/O pressure, as it's not using state backends per se.
docs/ops/state/checkpoints.md
Outdated
| - Flink currently does not support concurrent unaligned checkpoints. However, | ||
| due to the more predictable and shorter checkpointing times, concurrent | ||
| checkpoints might not be needed at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention here that aligned savepoints also can not happen concurrently to unaligned checkpoint
docs/ops/state/checkpoints.md
Outdated
| Currently, Flink generates the watermark as a first step of recovery instead of | ||
| storing the latest watermark in the operators to ease rescaling. In unaligned | ||
| checkpoints, that means on recovery, **Flink generates watermarks after it | ||
| restores in-flight data**. If your pipeline uses an **operator that applies the | ||
| latest watermark on each record**, it will produce **incorrect results** during | ||
| recovery if the watermark is not directly or indirectly part of the operator | ||
| state. Thus, **SQL OVER operator should not be used with unaligned | ||
| checkpoints**, while window operators are safe to use. The workaround is to | ||
| store the watermark in the operator state. If rescaling may occur, watermarks | ||
| should be stored per key-group in a union-state. We mostly likely will | ||
| implement this approach as a general solution (didn't make it into Flink | ||
| 1.11.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this paragraph is a bit too strong. As far as I understand, it's not that the UC will produce incorrect result, just that some records during the reprocessing might not be accounted as late data, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can tone it done, but basically we are breaking with the old assumption that watermarks don't need to be stored at the operator because they are sent first.
I'm especially referring to the OverITCases, which use a weird way to inject watermarks and logically should persist them. But now that I'm thinking about it, it's more a matter of the test setup itself.
docs/ops/state/checkpoints.md
Outdated
| We flagged unaligned checkpoints as experimental as it currently has the | ||
| following limitations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention that flatMap operators can lead to unbounded spilled data.
It's split into 3 parts to simulate the description of aligned checkpointing: - It's added on conceptual level in stateful-stream-processing.md with new/revised pics. It's written in a way that it could survive 1.12 without change. - A small change to dev/stream/state/checkpointing.md to show how it is enabled programmatically in Java/Scala/Python. Might need to be extended for 1.12 when new options become available (depending whether they can be programmatically changed or not). - A larger discussion in ops/state/checkpoints.md which includes the current limitations and a small glimpse into the next steps (will be in much more detail in blog post). This part needs to be largely rewritten for 1.12+ to reflect the new options.
|
Since no pending requests and @pnowojski , @rkhachatryan already approved, I will merge it now. |
What is the purpose of the change
Adds documentation about unaligned checkpoints.
Brief change log
Doc is split into 3 parts to simulate the description of aligned checkpointing:
Verifying this change
For Python API, added case in test_check_point_config.py.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation