[SPARK-XXXXX][SQL] Add streaming heap operator for NearestByJoin#56101
Draft
yadavay-amzn wants to merge 1 commit into
Draft
[SPARK-XXXXX][SQL] Add streaming heap operator for NearestByJoin#56101yadavay-amzn wants to merge 1 commit into
yadavay-amzn wants to merge 1 commit into
Conversation
Implements StreamingNearestByJoinExec that uses a broadcast right side + k-sized heap per left row, avoiding the N*M cross-product materialization. Memory benchmark results (30K x 30K, k=5): - Streaming Heap: 31s, ~208 MB memory delta - Cross-product: 404s, ~1733 MB memory delta - Memory ratio: 8.3x less memory for streaming heap - Time ratio: 12.9x faster At constrained heap sizes (<=1GB), cross-product OOMs while streaming heap completes with ~200MB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add
StreamingNearestByJoinExec, a physical operator for NearestByJoin that avoids materializing the full N×M cross product. Instead of rewriting to cross-join + aggregate, the operator broadcasts the right side and iterates per left row with a bounded priority queue of size k.Why are the changes needed?
The current
RewriteNearestByJoinimplementation materializes all N×M candidate pairs, shuffles them, and aggregates. At 30K×30K scale this takes 400s and uses 1.7GB. The streaming heap completes in 31s using 208MB — 13x faster and 8.3x less memory.Does this PR introduce any user-facing change?
No. The feature is opt-in via
spark.sql.join.nearestBy.streamingHeap.enabled(default false).How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Yes.
Note: This is a draft/prototype for discussion. Design doc: https://quip-amazon.com/IeZPAZPA9PF4