Heap-based global top. by robertwb · Pull Request #5551 · apache/beam

robertwb · 2018-06-04T20:02:15Z

This adds a specialized implementation for global top that greatly reduces the number of compares required in the (single) reducer. Also uses heapq rather than repeated buffer + sort + truncate.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

This adds a specialized implementation for global top that greatly reduces the number of compares required in the (single) reducer. Also uses heapq rather than repeated buffer + sort + truncate. The new implementation doesn't accept side keywords and arguments, but these (if any) are much more efficiently passed via a closure of the comparison operator than on processing every single element.

katsiapis · 2018-07-03T17:48:49Z

sdks/python/apache_beam/transforms/combiners.py

+
+@with_input_types(T)
+@with_output_types(KV[None, List[T]])
+class _TopPerShard(core.DoFn):


Perhaps _TopPerBundle (and MergeTopPerBundle respectively)?

katsiapis · 2018-07-03T17:57:50Z

sdks/python/apache_beam/transforms/combiners.py

+          original_compare = compare
+          compare = lambda a, b: original_compare(b, a)
+      # This is a more efficient global algorithm.
+      return (


This implementation seems to prevent the equivalent of "combiner lifting" for a future where one might take advantage of Multi Shard Combining. Could that be problematic?

That is correct. There are two issues here that lead to this structure:

I want to accumulate into a heap, but then sort the accumulator before emitting it to shift work from the (single) reducer to the (many) mappers. It's unclear how to express this as a CombineFn. Perhaps https://issues.apache.org/jira/browse/BEAM-4030 could help (if we ensure that it is called on all runners).

I want to avoid encoding _ComparableValue objects, which will be done via pickling. This could possibly be worked around if we could set and enforce custom coders.

Perhaps a single accumulator object with a custom reduce would suffice.

stale · 2018-09-01T19:00:43Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

stale · 2018-09-08T19:46:17Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

robertwb

I was unable to re-open this pull request after rebasing, see #6997

robertwb force-pushed the fast-global-top branch from d3be226 to 34d2dfa Compare June 4, 2018 20:04

robertwb force-pushed the fast-global-top branch from 34d2dfa to 7d42d6c Compare June 8, 2018 21:49

katsiapis reviewed Jul 3, 2018

View reviewed changes

stale bot added the stale label Sep 1, 2018

stale bot closed this Sep 8, 2018

robertwb commented Nov 9, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heap-based global top.#5551

Heap-based global top.#5551
robertwb wants to merge 1 commit intoapache:masterfrom
robertwb:fast-global-top

robertwb commented Jun 4, 2018

Uh oh!

katsiapis Jul 3, 2018

Uh oh!

robertwb Nov 9, 2018

Uh oh!

katsiapis Jul 3, 2018

Uh oh!

robertwb Nov 9, 2018

Uh oh!

stale bot commented Sep 1, 2018

Uh oh!

stale bot commented Sep 8, 2018

Uh oh!

robertwb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robertwb commented Jun 4, 2018

Uh oh!

katsiapis Jul 3, 2018

Choose a reason for hiding this comment

Uh oh!

robertwb Nov 9, 2018

Choose a reason for hiding this comment

Uh oh!

katsiapis Jul 3, 2018

Choose a reason for hiding this comment

Uh oh!

robertwb Nov 9, 2018

Choose a reason for hiding this comment

Uh oh!

stale bot commented Sep 1, 2018

Uh oh!

stale bot commented Sep 8, 2018

Uh oh!

robertwb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants