[BEAM-7268] make sorter extension Hadoop-free #8552

nevillelyh · 2019-05-10T18:07:38Z

Right now the Java sorter extension depends on Hadoop SequenceFile for external sort. It'll be nice to re-implement it without the dependency to avoid conflicts.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

nevillelyh · 2019-05-10T18:10:39Z

R: @kanterov

kanterov · 2019-05-13T11:42:38Z

Great to reduce dependencies. If dependency conflicts are the only concern, an alternative could be shading Hadoop dependencies. It's hard to say without having benchmarks if this code is going to perform better or not. I would suggest having it as an alternative implementation for sorting and keep the previous one.

If I'm not mistaken, your work for SortValues is needed to express GroupByKeyAndSortValues as GroupByKey+SortValues. There is a thread on a mailing list on adding GroupByKeyAndSortValues as a built-in transform in Beam gbk-sort-values. I would rather go that way, runners such as Dataflow, Spark, and Flink would override it with efficient implementation because they have a concept of secondary sorting, that would be way more efficient than anything else.

What do you think about rather investing in having GroupByKeyAndSortValues, because, in the end, that's what we need? It might be better to move this discussion into the mailing list.

cc @kennknowles @reuvenlax

kennknowles · 2019-05-14T02:51:39Z

Seems like a fine idea to me. It is a good point that you could just add it as a parallel extension. Or do fancy build tricks. But probably it is just easy and clean to have a separate library. It will be a little while before I can give this a good review, and I think @reuvenlax is the better choice anyhow.

nevillelyh · 2019-05-18T10:29:07Z

Agree with both. It's still worth having a "lean" option though. I'll work on some benchmark & adding it as a parallel implement, e.g. {Native,Hadoop}ExternalSorter?

nevillelyh · 2019-05-19T08:19:49Z

Split ExternalSorter into Native & Hadoop. Also did some basic benchmark, with 5 million UUID KV pairs (~360MB raw bytes), 64MB memory, Hadoop vs native impl is ~51.4s vs ~17.9s.

nevillelyh · 2019-05-26T20:38:48Z

I've made the change backwards compatible and added the native impl as an option. @kanterov @reuvenlax PTAL?

kanterov · 2019-05-29T18:52:03Z

I don't have time to review before the 10th of June.

lukecwik · 2019-06-11T21:24:09Z

@kanterov Were you planning to take a look at this?

kanterov

Sorry for the delay, I was looking into a couple of things:

license for the original code
performance
compatibility

The code is public domain and can to be used without attribution, in fact, it's already used in Apache, so I assume that this is fine. https://github.com/apache/jackrabbit-oak/blob/trunk/oak-commons/src/main/java/org/apache/jackrabbit/oak/commons/sort/ExternalSort.java#L44-L46

For performance, I'm running a couple of Dataflow jobs that sort 1 billion integers, and so far it looks good.

For compatibility, I checked the source code, and it is source compatible, the default behavior is Hadoop, as it was before, with the possibility to switch to native sorting.

@nevillelyh @clairemcginty @lukecwik let me know if you want to add anything, and I can merge this tomorrow

.../extensions/sorter/src/main/java/org/apache/beam/sdk/extensions/sorter/NativeFileSorter.java

kanterov · 2019-06-12T21:44:54Z

Run Java_Examples_Dataflow PreCommit

kanterov · 2019-06-13T09:07:32Z

сс @lemire

lemire · 2019-06-13T15:14:35Z

@kanterov Thanks for pinging me. How can I help?

kanterov · 2019-06-13T15:15:45Z

@lemire thanks, no help is needed, just thought that you might want to know

lemire · 2019-06-13T15:18:10Z

Indeed. Glad to see our code being used.

kennknowles requested a review from reuvenlax May 14, 2019 02:49

nevillelyh force-pushed the neville/sort branch from dfec479 to 94e5171 Compare May 24, 2019 05:02

nevillelyh added 4 commits June 11, 2019 13:49

[BEAM-7268] make sorter extension Hadoop-free

b3d7fc5

split native vs hadoop external sorter

a826052

fix source compat

806888d

parameterize test

784b5d4

nevillelyh force-pushed the neville/sort branch from 6cb37db to 784b5d4 Compare June 11, 2019 17:50

kanterov approved these changes Jun 12, 2019

View reviewed changes

.../extensions/sorter/src/main/java/org/apache/beam/sdk/extensions/sorter/NativeFileSorter.java Outdated Show resolved Hide resolved

make NativeFileSorter package private

70f0c82

kanterov merged commit 999bc5a into apache:master Jun 13, 2019

nevillelyh deleted the neville/sort branch July 3, 2019 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-7268] make sorter extension Hadoop-free #8552

[BEAM-7268] make sorter extension Hadoop-free #8552

nevillelyh commented May 10, 2019 •

edited

nevillelyh commented May 10, 2019

kanterov commented May 13, 2019 •

edited

kennknowles commented May 14, 2019

nevillelyh commented May 18, 2019

nevillelyh commented May 19, 2019 •

edited

nevillelyh commented May 26, 2019

kanterov commented May 29, 2019

lukecwik commented Jun 11, 2019

kanterov left a comment

kanterov commented Jun 12, 2019

kanterov commented Jun 13, 2019

lemire commented Jun 13, 2019

kanterov commented Jun 13, 2019

lemire commented Jun 13, 2019

[BEAM-7268] make sorter extension Hadoop-free #8552

[BEAM-7268] make sorter extension Hadoop-free #8552

Conversation

nevillelyh commented May 10, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

nevillelyh commented May 10, 2019

kanterov commented May 13, 2019 • edited

kennknowles commented May 14, 2019

nevillelyh commented May 18, 2019

nevillelyh commented May 19, 2019 • edited

nevillelyh commented May 26, 2019

kanterov commented May 29, 2019

lukecwik commented Jun 11, 2019

kanterov left a comment

Choose a reason for hiding this comment

kanterov commented Jun 12, 2019

kanterov commented Jun 13, 2019

lemire commented Jun 13, 2019

kanterov commented Jun 13, 2019

lemire commented Jun 13, 2019

nevillelyh commented May 10, 2019 •

edited

kanterov commented May 13, 2019 •

edited

nevillelyh commented May 19, 2019 •

edited