Conversation
This adds a specialized implementation for global top that greatly reduces the number of compares required in the (single) reducer. Also uses heapq rather than repeated buffer + sort + truncate. The new implementation doesn't accept side keywords and arguments, but these (if any) are much more efficiently passed via a closure of the comparison operator than on processing every single element.
|
|
||
| @with_input_types(T) | ||
| @with_output_types(KV[None, List[T]]) | ||
| class _TopPerShard(core.DoFn): |
There was a problem hiding this comment.
Perhaps _TopPerBundle (and MergeTopPerBundle respectively)?
| original_compare = compare | ||
| compare = lambda a, b: original_compare(b, a) | ||
| # This is a more efficient global algorithm. | ||
| return ( |
There was a problem hiding this comment.
This implementation seems to prevent the equivalent of "combiner lifting" for a future where one might take advantage of Multi Shard Combining. Could that be problematic?
There was a problem hiding this comment.
That is correct. There are two issues here that lead to this structure:
- I want to accumulate into a heap, but then sort the accumulator before emitting it to shift work from the (single) reducer to the (many) mappers. It's unclear how to express this as a CombineFn. Perhaps https://issues.apache.org/jira/browse/BEAM-4030 could help (if we ensure that it is called on all runners).
- I want to avoid encoding _ComparableValue objects, which will be done via pickling. This could possibly be worked around if we could set and enforce custom coders.
Perhaps a single accumulator object with a custom reduce would suffice.
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This adds a specialized implementation for global top that greatly reduces the number of compares required in the (single) reducer. Also uses heapq rather than repeated buffer + sort + truncate.
Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username) to look at it.