[FLINK-18064][docs] Added unaligned checkpointing to docs. #12722

AHeise · 2020-06-19T12:45:03Z

What is the purpose of the change

Adds documentation about unaligned checkpoints.

Brief change log

Added Python API to enable unaligned checkpoints.

Doc is split into 3 parts to simulate the description of aligned checkpointing:

It's added on conceptual level in stateful-stream-processing.md with new/revised pics. It's written in a way that it could survive 1.12 without change.
A small change to dev/stream/state/checkpointing.md to show how it is enabled programmatically in Java/Scala/Python. Might need to be extended for 1.12 when new options become available (depending whether they can be programmatically changed or not).
A larger discussion in ops/state/checkpoints.md which includes the current limitations and a small glimpse into the next steps (will be in much more detail in blog post). This part needs to be largely rewritten for 1.12+ to reflect the new options.

Verifying this change

For Python API, added case in test_check_point_config.py.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2020-06-19T12:47:13Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit d162c92 (Fri Jun 19 12:47:13 UTC 2020)

Warnings:

Documentation files were touched, but no .zh.md files: Update Chinese documentation or file Jira ticket.

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

AHeise · 2020-06-19T12:52:16Z

For ease of reviewing, I made a screenshot of the page with changed graphics.

flinkbot · 2020-06-19T13:08:30Z

CI report:

91dfe90 UNKNOWN
f818134 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

morsapaes

Left some minor comments. Thanks for sharing, @AHeise , I'll "align" the release blogpost with these docs.

docs/concepts/stateful-stream-processing.md

docs/ops/state/checkpoints.md

rkhachatryan

Nice graphics :)

I left some comments. Most of them are nits, except for mentioning feature status
concepts/stateful-stream-processing

docs/concepts/stateful-stream-processing.md

rkhachatryan · 2020-06-22T08:27:04Z

docs/concepts/stateful-stream-processing.md

 See [Restart Strategies]({% link dev/task_failure_recovery.md
 %}#restart-strategies) for more information.

+### Unaligned Checkpointing


Should we mention that this is an experimental feature?
I think it should be a separate statement in the end of the section.

My understanding was that this document is rather an expanded glossary and talks about the concept and not the implementation. Thus, I'd leave the implementation state out of this place. The ops link will directly say that it's experimental in 1.11.

The page already goes quite deep into the details so I don't see why it shouldn't be mentioned here. Some users could benefit by ruling out the feature earlier if they are considering Flink or its configuration.

Since we decided at a different point to drop the experimental label, I'd leave this section as is.

rkhachatryan

I'm adding some more comments. Sorry for the intermittent review.

Looking at the configuration reference (ops/config.md) I couldn't find any relation between execution.checkpointing.unaligned and execution.checkpointing.max-concurrent-checkpoints. I think we should mention explicitly that when the former is enabled, the latter must be 1.

rkhachatryan · 2020-06-22T08:50:49Z

docs/ops/state/checkpoints.md

+- You cannot rescale from unaligned checkpoints. You have to take a savepoint 
+before rescaling. Savepoints are always aligned independent of the alignment


Suggested change

- You cannot rescale from unaligned checkpoints. You have to take a savepoint

before rescaling. Savepoints are always aligned independent of the alignment

- You cannot rescale or change job graph with unaligned checkpoints. You have to take a savepoint

before rescaling. Savepoints are always aligned independent of the alignment

Can you change the job graph with current checkpoints? I was always assuming that you need savepoints.

Hmm...the current Flink docs says that retained checkpoints:

do not support Flink specific features like rescaling.

...and nothing about the job graph.

Besides that, UC doesn't currently support Local recovery.

Edit:
Local recovery limitation should probably be described in
Tuning Checkpoints section.

Yes technically but its incidental. The community hasn't made any backward compat guaruntees around that behavior.

I see, thanks.
I think we can leave it as is then.

docs/ops/state/checkpoints.md

AHeise

Thank you for the feedback @morsapaes and @rkhachatryan . I followed most of your suggestions, but I have some unresolved issues coming from @rkhachatryan .

docs/concepts/stateful-stream-processing.md

AHeise · 2020-06-22T11:49:21Z

docs/concepts/stateful-stream-processing.md

 See [Restart Strategies]({% link dev/task_failure_recovery.md
 %}#restart-strategies) for more information.

+### Unaligned Checkpointing


My understanding was that this document is rather an expanded glossary and talks about the concept and not the implementation. Thus, I'd leave the implementation state out of this place. The ops link will directly say that it's experimental in 1.11.

docs/concepts/stateful-stream-processing.md

AHeise · 2020-06-22T11:53:25Z

docs/ops/state/checkpoints.md

+- You cannot rescale from unaligned checkpoints. You have to take a savepoint 
+before rescaling. Savepoints are always aligned independent of the alignment


Can you change the job graph with current checkpoints? I was always assuming that you need savepoints.

pnowojski

Thanks for writing this down @AHeise, mostly LGTM :)

pnowojski · 2020-06-22T16:04:03Z

...treaming-java/src/main/java/org/apache/flink/streaming/api/environment/CheckpointConfig.java


 	/**
-	 * Enables unaligned checkpoints, which greatly reduce checkpointing times under backpressure.
+	 * Enables unaligned checkpoints, which greatly reduce checkpointing times under backpressure (experimental).


As it's stable on our builds, maybe we could label it more production ready?

Then just leave out experimental and just link to limitations?

pnowojski · 2020-06-22T16:05:15Z

docs/concepts/stateful-stream-processing.md

 operations can asynchronously snapshot their state.

+Since Flink 1.11, checkpoints can be taken with or without alignment. In the 
+following, we describe aligned checkpoints first.


following section?

pnowojski · 2020-06-22T16:07:06Z

docs/concepts/stateful-stream-processing.md

+Unaligned checkpointing ensures that barriers are arriving at the sink as fast 
+as possible. It's especially suited for applications with at least one slow 
+moving data path, where alignment times can reach hours. However, since it's
+adding additional I/O pressure to state backends, it doesn't help when the I/O


I/O pressure to state backends -> I/O pressure, as it's not using state backends per se.

pnowojski · 2020-06-22T16:09:05Z

docs/ops/state/checkpoints.md

+- Flink currently does not support concurrent unaligned checkpoints. However, 
+due to the more predictable and shorter checkpointing times, concurrent 
+checkpoints might not be needed at all.


We should mention here that aligned savepoints also can not happen concurrently to unaligned checkpoint

pnowojski · 2020-06-22T16:10:55Z

docs/ops/state/checkpoints.md

+Currently, Flink generates the watermark as a first step of recovery instead of 
+storing the latest watermark in the operators to ease rescaling. In unaligned 
+checkpoints, that means on recovery, **Flink generates watermarks after it 
+restores in-flight data**. If your pipeline uses an **operator that applies the
+latest watermark on each record**, it will produce **incorrect results** during 
+recovery if the watermark is not directly or indirectly part of the operator 
+state. Thus, **SQL OVER operator should not be used with unaligned
+checkpoints**, while window operators are safe to use. The workaround is to
+store the watermark in the operator state. If rescaling may occur, watermarks
+should be stored per key-group in a union-state. We mostly likely will
+implement this approach as a general solution (didn't make it into Flink 
+1.11.0).


I think this paragraph is a bit too strong. As far as I understand, it's not that the UC will produce incorrect result, just that some records during the reprocessing might not be accounted as late data, right?

I can tone it done, but basically we are breaking with the old assumption that watermarks don't need to be stored at the operator because they are sent first.
I'm especially referring to the OverITCases, which use a weird way to inject watermarks and logically should persist them. But now that I'm thinking about it, it's more a matter of the test setup itself.

pnowojski · 2020-06-22T16:12:17Z

docs/ops/state/checkpoints.md

+We flagged unaligned checkpoints as experimental as it currently has the
+following limitations:


I would also mention that flatMap operators can lead to unbounded spilled data.

…Enabled.

It's split into 3 parts to simulate the description of aligned checkpointing: - It's added on conceptual level in stateful-stream-processing.md with new/revised pics. It's written in a way that it could survive 1.12 without change. - A small change to dev/stream/state/checkpointing.md to show how it is enabled programmatically in Java/Scala/Python. Might need to be extended for 1.12 when new options become available (depending whether they can be programmatically changed or not). - A larger discussion in ops/state/checkpoints.md which includes the current limitations and a small glimpse into the next steps (will be in much more detail in blog post). This part needs to be largely rewritten for 1.12+ to reflect the new options.

zhijiangW · 2020-06-29T08:43:50Z

Since no pending requests and @pnowojski , @rkhachatryan already approved, I will merge it now.

rmetzger added the review=description? label Jun 19, 2020

rmetzger added component=Documentation component=Runtime/Checkpointing labels Jun 19, 2020

morsapaes reviewed Jun 22, 2020

View reviewed changes

rkhachatryan requested changes Jun 22, 2020

View reviewed changes

AHeise commented Jun 22, 2020

View reviewed changes

pnowojski reviewed Jun 22, 2020

View reviewed changes

rkhachatryan approved these changes Jun 24, 2020

View reviewed changes

Arvid Heise added 5 commits June 24, 2020 20:49

[FLINK-18064][python] Adding unaligned checkpoint config options.

1b9e224

[hotfix][conf] Fix javadoc of CheckpointConfig#isUnalignedCheckpoints…

351a5f9

…Enabled.

[hotfix][docs] Fix broken link in metrics.md.

f0799d0

[hotfix][docs] Replace/fix links in checkpointing documents.

f818134

pnowojski mentioned this pull request Jun 26, 2020

Add 1.11 Release announcement. apache/flink-web#352

Closed

AHeise mentioned this pull request Jun 29, 2020

[FLINK-18064][docs] Added unaligned checkpointing to docs. (1.11) #12787

Merged

zhijiangW merged commit 58c2047 into apache:master Jun 29, 2020

		- You cannot rescale from unaligned checkpoints. You have to take a savepoint
		before rescaling. Savepoints are always aligned independent of the alignment

		We flagged unaligned checkpoints as experimental as it currently has the
		following limitations:

[FLINK-18064][docs] Added unaligned checkpointing to docs. #12722

[FLINK-18064][docs] Added unaligned checkpointing to docs. #12722

Uh oh!

Conversation

AHeise commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jun 19, 2020

Automated Checks

Review Progress

Uh oh!

AHeise commented Jun 19, 2020

Uh oh!

flinkbot commented Jun 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

morsapaes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Jun 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AHeise left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pnowojski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AHeise commented Jun 19, 2020 •

edited

Loading

flinkbot commented Jun 19, 2020 •

edited

Loading

rkhachatryan Jun 22, 2020 •

edited

Loading