Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7268] make sorter extension Hadoop-free #8552

Merged
merged 5 commits into from Jun 13, 2019

Conversation

nevillelyh
Copy link
Contributor

@nevillelyh nevillelyh commented May 10, 2019

Right now the Java sorter extension depends on Hadoop SequenceFile for external sort. It'll be nice to re-implement it without the dependency to avoid conflicts.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@nevillelyh
Copy link
Contributor Author

R: @kanterov

@kanterov
Copy link
Member

kanterov commented May 13, 2019

Great to reduce dependencies. If dependency conflicts are the only concern, an alternative could be shading Hadoop dependencies. It's hard to say without having benchmarks if this code is going to perform better or not. I would suggest having it as an alternative implementation for sorting and keep the previous one.

If I'm not mistaken, your work for SortValues is needed to express GroupByKeyAndSortValues as GroupByKey+SortValues. There is a thread on a mailing list on adding GroupByKeyAndSortValues as a built-in transform in Beam gbk-sort-values. I would rather go that way, runners such as Dataflow, Spark, and Flink would override it with efficient implementation because they have a concept of secondary sorting, that would be way more efficient than anything else.

What do you think about rather investing in having GroupByKeyAndSortValues, because, in the end, that's what we need? It might be better to move this discussion into the mailing list.

cc @kennknowles @reuvenlax

@kennknowles
Copy link
Member

Seems like a fine idea to me. It is a good point that you could just add it as a parallel extension. Or do fancy build tricks. But probably it is just easy and clean to have a separate library. It will be a little while before I can give this a good review, and I think @reuvenlax is the better choice anyhow.

@nevillelyh
Copy link
Contributor Author

Agree with both. It's still worth having a "lean" option though. I'll work on some benchmark & adding it as a parallel implement, e.g. {Native,Hadoop}ExternalSorter?

@nevillelyh
Copy link
Contributor Author

nevillelyh commented May 19, 2019

Split ExternalSorter into Native & Hadoop. Also did some basic benchmark, with 5 million UUID KV pairs (~360MB raw bytes), 64MB memory, Hadoop vs native impl is ~51.4s vs ~17.9s.

@nevillelyh
Copy link
Contributor Author

I've made the change backwards compatible and added the native impl as an option. @kanterov @reuvenlax PTAL?

@kanterov
Copy link
Member

I don't have time to review before the 10th of June.

@lukecwik
Copy link
Member

@kanterov Were you planning to take a look at this?

Copy link
Member

@kanterov kanterov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I was looking into a couple of things:

  • license for the original code
  • performance
  • compatibility

The code is public domain and can to be used without attribution, in fact, it's already used in Apache, so I assume that this is fine. https://github.com/apache/jackrabbit-oak/blob/trunk/oak-commons/src/main/java/org/apache/jackrabbit/oak/commons/sort/ExternalSort.java#L44-L46

For performance, I'm running a couple of Dataflow jobs that sort 1 billion integers, and so far it looks good.

For compatibility, I checked the source code, and it is source compatible, the default behavior is Hadoop, as it was before, with the possibility to switch to native sorting.

@nevillelyh @clairemcginty @lukecwik let me know if you want to add anything, and I can merge this tomorrow

@kanterov
Copy link
Member

Run Java_Examples_Dataflow PreCommit

@kanterov
Copy link
Member

сс @lemire

@kanterov kanterov merged commit 999bc5a into apache:master Jun 13, 2019
@lemire
Copy link

lemire commented Jun 13, 2019

@kanterov Thanks for pinging me. How can I help?

@kanterov
Copy link
Member

@lemire thanks, no help is needed, just thought that you might want to know

@lemire
Copy link

lemire commented Jun 13, 2019

Indeed. Glad to see our code being used.

@nevillelyh nevillelyh deleted the neville/sort branch July 3, 2019 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants