Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-13027][streaming]: StreamingFileSink bulk-encoded writer supports customized checkpoint policy #10653

Merged
merged 1 commit into from
Jan 4, 2020

Conversation

yxu-valleytider
Copy link
Contributor

@yxu-valleytider yxu-valleytider commented Dec 21, 2019

What is the purpose of the change

This PR allows bulk-encoded StreamingFileSink to instantiate from a generic family of rolling policies that roll files at checkpoint time. A base CheckpointRollingPolicy class is defined, which is extended by the existing OnCheckpointRollingPolicy and a new rolling policy FSizeCheckpointRollingPolicy. The latter policy rolls file not only at the checkpoint time, but also possibly before file size reaches a certain limit, which is useful for preventing file sizes from growing too big. Recurrent builder pattern described in [1] and [2] are used to instantiate the rolling policies whenever appropriate, making individual rolling policy also extensible.

Brief change log

CheckpointRollingPolicy

  • An abstract class implementing the base rolling policy which rolls file at every checkpoint.

FSizeCheckpointRollingPolicy

  • A new rolling policy implementation which rolls part file both when size exceeds a limit, in addition to during a checkpoint event.

StreamingFileSink

  • Bulk-encoded sink writer (forBulkFormat()) takes a generic CheckpointRollingPolicy during instantiation. OnCheckpointRollingPolicy is still the default, but won't be the only option.

Verifying this change

This change is an interface change and already covered by existing tests, such as BulkWriterTest. A new test case on the new FSizeCheckpointRollingPolicy has also been added to the RollingPolicyTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (yes) minor interface change to StreamingFileSink.

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit b36738f (Sat Dec 21 00:12:28 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 21, 2019

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@yxu-valleytider
Copy link
Contributor Author

@flinkbot run travis

@yxu-valleytider
Copy link
Contributor Author

PTAL @kl0u @tweise

@kl0u
Copy link
Contributor

kl0u commented Dec 23, 2019

Hi @yxu-valleytider , thanks for the PR.

I am on holidays, so I do not think I will have time to give it a thorough review but from a first look, I think that the FSizeCheckpointRollingPolicy is a specific implementation of a CheckpointRollingPolicy . So, given that we add the latter, I do not think we should add the FSizeCheckpointRollingPolicy to the master. I do not think we need to provide that many pre-implemented rolling policies and the interface is pretty simple.

In addition, given that this PR introduces a new feature, you should also update the documentation section of the StreamingFileSink.

What do you think?

@yxu-valleytider
Copy link
Contributor Author

yxu-valleytider commented Dec 24, 2019

@kl0u thanks for the quick comments. Yes I agree the main intention of the PR is to allow forBulkFormat() to specify a customized set of rolling policies. Hence I removed FSizeCheckpointRollingPolicy and related test which makes the PR shorter.

As for the documentation, there's a significant amount of changes in master. Compared with the stable version, there is no mentioning of wording such as: Bulk-encoding formats can only be combined with the OnCheckpointRollingPolicy. But I can certainly add a few sentences to strengthen that BulkFormatBuilder allows specifying customized policies derived from CheckpointRollingPolicy. Does that sound good ?

@kl0u
Copy link
Contributor

kl0u commented Dec 24, 2019

@yxu-valleytider I think this is an improvement compared to what we have now. As I said, I am on holidays and I will not be able to make a thorough pass on the PR but I hope that someone else will, or I will do it when I am back.

@yxu-valleytider
Copy link
Contributor Author

@kl0u not an issue. Not urgent and please take time.

Copy link
Contributor

@tweise tweise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the contribution!

@tweise
Copy link
Contributor

tweise commented Jan 2, 2020

@yxu-valleytider are you going to submit the documentation changes as a follow-up PR?

@yxu-valleytider
Copy link
Contributor Author

@tweise Thanks for the review. I can follow up with a separate PR addressing the documentation issue.

@yxu-valleytider
Copy link
Contributor Author

Created FLINK-15476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants