New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-3637] Refactor rolling sink writer #1826
Conversation
Travis appears to have failed on one of the five targets because of a flaky YARN test case. Do I need to re-submit the pull request for travis to rebuild so the test shows as passing? |
@dalegaard no, you don't have to re-submit your PR. We are aware of the instability of some tests ;) |
Okay, great :) Thanks! |
Rebased on to newest master and force pushed. |
hi @dalegaard this must have slipped my mind. I'll review it tomorrow and merge if possible. |
The changes look good. One thing I would like to have changed is to rename |
@aljoscha okay, I'll fix that and rebase onto master. Naming things is one of the only two hard problems in computer science after all :) |
The Writer interface now deals directly with filesystem and path, rather than the raw output stream. Since the RollingSink no longer has access to the raw output stream, it cannot directly determine the current size of the file. A getPos() method has been added to the Writer interface, so the RollingSink can. retrieve the current file size. Finally, flush() has been extended to return the offset that the file must be truncated to at recovery.
SequenceFileWriter and StringWriter are both simple outputs that work directly on a file in HDFS, using the logic that used to reside in RollingSink. This logic has been moved into a new class, SimpleBaseWriter, which both StringWriter and SequenceFileWriter extend.
True, true... 😄 Very nice work! I'm merging. Are you planning to also work on an ORC writer for this? |
@aljoscha Yes I'll be making ORC and possibly Parquet writers to use this functionality. I'm also thinking about calling the Bucketer per item, because the partition currently can't depend on the records passing through, but this is a bit close to windowing so not sure how to proceed. I'll open JIRAs for all of these soon :) |
Great to hear! Could you please close this PR, github didn't close automatically. |
Closing! |
@dalegaard have you created Parquet writer for the same or can you give me idea how i can sink json ->parquet ->HDFS from datastream or streaming Table ?? |
Implements FLINK-3637 by changing the Writer interface such that the Writer must handle the output stream, instead of having the RollingSink handling it. This makes it possible to write outputs for ORC and Parquet.