New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7268] make sorter extension Hadoop-free #8552
Conversation
R: @kanterov |
Great to reduce dependencies. If dependency conflicts are the only concern, an alternative could be shading Hadoop dependencies. It's hard to say without having benchmarks if this code is going to perform better or not. I would suggest having it as an alternative implementation for sorting and keep the previous one. If I'm not mistaken, your work for What do you think about rather investing in having |
Seems like a fine idea to me. It is a good point that you could just add it as a parallel extension. Or do fancy build tricks. But probably it is just easy and clean to have a separate library. It will be a little while before I can give this a good review, and I think @reuvenlax is the better choice anyhow. |
Agree with both. It's still worth having a "lean" option though. I'll work on some benchmark & adding it as a parallel implement, e.g. |
Split |
I've made the change backwards compatible and added the native impl as an option. @kanterov @reuvenlax PTAL? |
I don't have time to review before the 10th of June. |
@kanterov Were you planning to take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I was looking into a couple of things:
- license for the original code
- performance
- compatibility
The code is public domain and can to be used without attribution, in fact, it's already used in Apache, so I assume that this is fine. https://github.com/apache/jackrabbit-oak/blob/trunk/oak-commons/src/main/java/org/apache/jackrabbit/oak/commons/sort/ExternalSort.java#L44-L46
For performance, I'm running a couple of Dataflow jobs that sort 1 billion integers, and so far it looks good.
For compatibility, I checked the source code, and it is source compatible, the default behavior is Hadoop, as it was before, with the possibility to switch to native sorting.
@nevillelyh @clairemcginty @lukecwik let me know if you want to add anything, and I can merge this tomorrow
.../extensions/sorter/src/main/java/org/apache/beam/sdk/extensions/sorter/NativeFileSorter.java
Outdated
Show resolved
Hide resolved
Run Java_Examples_Dataflow PreCommit |
сс @lemire |
@kanterov Thanks for pinging me. How can I help? |
@lemire thanks, no help is needed, just thought that you might want to know |
Indeed. Glad to see our code being used. |
Right now the Java sorter extension depends on Hadoop SequenceFile for external sort. It'll be nice to re-implement it without the dependency to avoid conflicts.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.