Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3637] Refactor rolling sink writer #1826

Closed
wants to merge 2 commits into from
Closed

[FLINK-3637] Refactor rolling sink writer #1826

wants to merge 2 commits into from

Conversation

dalegaard
Copy link
Contributor

Implements FLINK-3637 by changing the Writer interface such that the Writer must handle the output stream, instead of having the RollingSink handling it. This makes it possible to write outputs for ORC and Parquet.

@dalegaard
Copy link
Contributor Author

Travis appears to have failed on one of the five targets because of a flaky YARN test case. Do I need to re-submit the pull request for travis to rebuild so the test shows as passing?

@zentol
Copy link
Contributor

zentol commented Mar 22, 2016

@dalegaard no, you don't have to re-submit your PR. We are aware of the instability of some tests ;)

@dalegaard
Copy link
Contributor Author

Okay, great :) Thanks!

@dalegaard
Copy link
Contributor Author

Rebased on to newest master and force pushed.

@dalegaard dalegaard changed the title Refactor rolling sink writer [FLINK-3637] Refactor rolling sink writer Apr 1, 2016
@aljoscha
Copy link
Contributor

aljoscha commented Apr 5, 2016

hi @dalegaard this must have slipped my mind. I'll review it tomorrow and merge if possible.

@aljoscha
Copy link
Contributor

aljoscha commented Apr 5, 2016

The changes look good. One thing I would like to have changed is to rename SimpleWriterBase to StreamWriterBase or StreamWriter based to reflect the fact that it is used for Stream based writers.

@dalegaard
Copy link
Contributor Author

@aljoscha okay, I'll fix that and rebase onto master. Naming things is one of the only two hard problems in computer science after all :)

The Writer interface now deals directly with filesystem and path, rather
than the raw output stream.

Since the RollingSink no longer has access to the raw output stream, it
cannot directly determine the current size of the file. A getPos()
method has been added to the Writer interface, so the RollingSink can.
retrieve the current file size.

Finally, flush() has been extended to return the offset that the file
must be truncated to at recovery.
SequenceFileWriter and StringWriter are both simple outputs that work
directly on a file in HDFS, using the logic that used to reside in
RollingSink. This logic has been moved into a new class,
SimpleBaseWriter, which both StringWriter and SequenceFileWriter extend.
@aljoscha
Copy link
Contributor

aljoscha commented Apr 6, 2016

True, true... 😄

Very nice work! I'm merging.

Are you planning to also work on an ORC writer for this?

@dalegaard
Copy link
Contributor Author

@aljoscha Yes I'll be making ORC and possibly Parquet writers to use this functionality. I'm also thinking about calling the Bucketer per item, because the partition currently can't depend on the records passing through, but this is a bit close to windowing so not sure how to proceed. I'll open JIRAs for all of these soon :)

@aljoscha
Copy link
Contributor

aljoscha commented Apr 6, 2016

Great to hear! Could you please close this PR, github didn't close automatically.

@dalegaard
Copy link
Contributor Author

Closing!

@dalegaard dalegaard closed this Apr 6, 2016
@shashank734
Copy link

@dalegaard have you created Parquet writer for the same or can you give me idea how i can sink json ->parquet ->HDFS from datastream or streaming Table ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants