Skip to content

Conversation

aljoscha
Copy link
Contributor

This adds documentation for the new BATCH execution mode. We also explain STREAMING execution mode because there is no central page that explains the basic behavior, so far.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

Documentation only.

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 70975eb (Wed Nov 18 09:16:21 UTC 2020)

Warnings:

  • Documentation files were touched, but no .zh.md files: Update Chinese documentation or file Jira ticket.
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 18, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@aljoscha
Copy link
Contributor Author

Thanks for the speedy review! And I really appreciate the suggestions that I can just apply right here in Github.

I pushed a commit that should address most comments.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this documentation PR @aljoscha. I had a couple of minor comments.

@aljoscha
Copy link
Contributor Author

I addressed more comments and also added the important considerations section by Dawid. Could you please take another look?

@aljoscha
Copy link
Contributor Author

I now also added sections by Klou and a section about state backends. Now all the content is theoretically in.

Copy link
Contributor

@kl0u kl0u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @aljoscha , I left some comments in the PR. Feel free to integrate whichever you agree with.

Copy link
Contributor

@sjwiesman sjwiesman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @kl0u 's comments and few others

Comment on lines +357 to +364
In the batch world though, we believe that such use-cases do not make much
sense, as the input (both the elements and the control stream) are static and
known in advance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the recommendation, to load the dataset in open?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid so. But it's not that nice. We do want to add proper support for broadcast input in the next release, though.

Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one more comment about choosing the BATCH vs. the STREAMING execution mode.

Comment on lines +62 to +65
As a rule of thumb, you should be using `BATCH` execution mode when your program
is bounded because this will be more efficient. You have to use `STREAMING`
execution mode when your program is unbounded because only this mode is general
enough to be able to deal with continuous data streams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it sounds as if it does not really matter whether to choose BATCH or STREAMING for a bounded job from a correctness perspective. However, the FileSink won't commit the in-progress files at the end of the program when using the STREAMING execution mode. It might be worthwhile to document this behaviour somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is unfortunate. Though the fact that we cannot do checkpoints as soon as at least one task has finished, which in turn means that we can't get a "final" checkpoint has been a feature/bug of DataStream execution since the beginning. I wouldn't document it here but we can think about adding this to a general "caveats" section. I'm sure there would be other corner cases that are worth documenting 😅

Comment on lines 326 to 328
This is possible because inputs are bounded. This pushes the cost more towards
the recovery, but makes the regular processing cheaper, because it avoids
checkpoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last paragraph of failure recovery reads as if BATCH execution improves the overall execution time of jobs but here it reads a bit differently. Concretely that BATCH recoveries are more costly than STREAMING recoveries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! @dawidwys what was the original intention here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to briefly remind the batch failure recovery model. For that I actually reused the description from: https://ci.apache.org/projects/flink/flink-docs-master/concepts/stateful-stream-processing.html#state-and-fault-tolerance-in-batch-programs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the description in the failure recovery section, we can probably drop the first paragraph and start with the second one:

It is important to remember that because there are no checkpoints, as described above, certain ...

BTW shall we update it in #stateful-stream-processing.html#state-and-fault-tolerance-in-batch-programs? It is not easy to tell which model recovers "faster" as it very much depends on the state size, number of records to replay, number of tasks to recover etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, this text was first added in 2016: https://github.com/apache/flink/blame/b04d51a129a3341887e7a0866557c9871f58e94c/docs/concepts/concepts.md. I copied it to the current concepts section from there. That's not at all up-to-date anymore.

That section needs an overhaul or should be removed because it's also misleading for DataSet programs or Table/SQL batch programs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this documentation here I think we can go with @dawidwys's suggestion and just drop that paragraph because I added some text about that above.

Copy link
Contributor

@dawidwys dawidwys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from my side for these changes

@aljoscha
Copy link
Contributor Author

I believe I addressed all comments. Please take another look. If there's no objection I would merge this by tomorrow because this PR/discussion is growing a bit unwieldy. Anything else we want to add we can still add later.

@aljoscha aljoscha force-pushed the flink-20153-batch-documentation branch from 4d1d0d1 to f363079 Compare November 24, 2020 16:11
@aljoscha
Copy link
Contributor Author

Thanks for the reviews! I merged this now.

@aljoscha aljoscha closed this Nov 25, 2020
@aljoscha aljoscha deleted the flink-20153-batch-documentation branch November 26, 2020 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants