-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-20153] Add documentation for BATCH execution mode #14114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 70975eb (Wed Nov 18 09:16:21 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
Thanks for the speedy review! And I really appreciate the suggestions that I can just apply right here in Github. I pushed a commit that should address most comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this documentation PR @aljoscha. I had a couple of minor comments.
I addressed more comments and also added the |
I now also added sections by Klou and a section about state backends. Now all the content is theoretically in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work @aljoscha , I left some comments in the PR. Feel free to integrate whichever you agree with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @kl0u 's comments and few others
In the batch world though, we believe that such use-cases do not make much | ||
sense, as the input (both the elements and the control stream) are static and | ||
known in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the recommendation, to load the dataset in open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid so. But it's not that nice. We do want to add proper support for broadcast input in the next release, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one more comment about choosing the BATCH
vs. the STREAMING
execution mode.
As a rule of thumb, you should be using `BATCH` execution mode when your program | ||
is bounded because this will be more efficient. You have to use `STREAMING` | ||
execution mode when your program is unbounded because only this mode is general | ||
enough to be able to deal with continuous data streams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it sounds as if it does not really matter whether to choose BATCH
or STREAMING
for a bounded job from a correctness perspective. However, the FileSink
won't commit the in-progress files at the end of the program when using the STREAMING
execution mode. It might be worthwhile to document this behaviour somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is unfortunate. Though the fact that we cannot do checkpoints as soon as at least one task has finished, which in turn means that we can't get a "final" checkpoint has been a feature/bug of DataStream execution since the beginning. I wouldn't document it here but we can think about adding this to a general "caveats" section. I'm sure there would be other corner cases that are worth documenting 😅
This is possible because inputs are bounded. This pushes the cost more towards | ||
the recovery, but makes the regular processing cheaper, because it avoids | ||
checkpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last paragraph of failure recovery
reads as if BATCH
execution improves the overall execution time of jobs but here it reads a bit differently. Concretely that BATCH
recoveries are more costly than STREAMING
recoveries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! @dawidwys what was the original intention here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention was to briefly remind the batch failure recovery model. For that I actually reused the description from: https://ci.apache.org/projects/flink/flink-docs-master/concepts/stateful-stream-processing.html#state-and-fault-tolerance-in-batch-programs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the description in the failure recovery
section, we can probably drop the first paragraph and start with the second one:
It is important to remember that because there are no checkpoints, as described above, certain ...
BTW shall we update it in #stateful-stream-processing.html#state-and-fault-tolerance-in-batch-programs
? It is not easy to tell which model recovers "faster" as it very much depends on the state size, number of records to replay, number of tasks to recover etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy, this text was first added in 2016: https://github.com/apache/flink/blame/b04d51a129a3341887e7a0866557c9871f58e94c/docs/concepts/concepts.md. I copied it to the current concepts section from there. That's not at all up-to-date anymore.
That section needs an overhaul or should be removed because it's also misleading for DataSet
programs or Table/SQL batch programs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this documentation here I think we can go with @dawidwys's suggestion and just drop that paragraph because I added some text about that above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from my side for these changes
I believe I addressed all comments. Please take another look. If there's no objection I would merge this by tomorrow because this PR/discussion is growing a bit unwieldy. Anything else we want to add we can still add later. |
This adds documentation for the new `BATCH` execution mode. We also explain `STREAMING` execution mode because there is no central page that explains the basic behavior, so far.
4d1d0d1
to
f363079
Compare
Thanks for the reviews! I merged this now. |
This adds documentation for the new
BATCH
execution mode. We also explainSTREAMING
execution mode because there is no central page that explains the basic behavior, so far.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation
Documentation only.