feat: Pinot-style colocated-join optimizer for hash-bucketed tables#1676
Draft
wirybeaver wants to merge 1 commit into
Draft
feat: Pinot-style colocated-join optimizer for hash-bucketed tables#1676wirybeaver wants to merge 1 commit into
wirybeaver wants to merge 1 commit into
Conversation
7aa326d to
590a8d8
Compare
Brings three Pinot V2 Physical Optimizer ideas to Ballista so queries
against pre-bucketed tables can avoid the network shuffle on every join.
Architecture (all changes additive; no DataFusion fork required):
DataFusion physical planner
→ DataFusion optimizer rules (incl. EnforceDistribution, JoinSelection)
→ ColocatedJoinRule ← new: strip redundant RepartitionExec
(also handles divisor sub-partitioning)
→ BroadcastSmallSideRule ← new: replace partitioned join with CollectLeft
→ DistributedExchangeRule ← existing: maps remaining repartitions to ExchangeExec
→ DefaultDistributedPlanner ← existing: cuts stages at ExchangeExec
Metadata layer (ballista-core):
- BallistaPartitionMetadata trait — optional contract any TableProvider
can implement to declare on-disk hash bucketing (keys + hash function +
bucket count).
- HashDistribution / HashFn — the declared layout.
- PartitionedTableProvider wrapper — attaches a HashDistribution to any
existing TableProvider without modifying it.
- HashDistributedScanExec adapter — re-advertises the wrapped scan's
output_partitioning() as Partitioning::Hash so optimizer rules can
read it.
- BucketSubPartitionExec — chains input partitions per output partition
via stream::iter + flatten (pure local concat, no network) for the
divisor case.
Optimizer rules (ballista-scheduler):
- ColocatedJoinRule — for each HashJoinExec, when both inputs declare
matching keys + hash function and either equal or divisor-related
bucket counts, strips the RepartitionExec above each input or wraps
the larger side in BucketSubPartitionExec. Divisor case relies on
(hash(k) % 16) % 8 == hash(k) % 8.
- BroadcastSmallSideRule — promotes Partitioned HashJoinExec to
CollectLeft when one side's total_byte_size is below the configured
threshold; uses HashJoinExec::swap_inputs when the small side is on
the right. Restricted to JoinType::Inner pending issue apache#1055.
- default_optimizers now takes the SessionConfig so the broadcast
threshold flows through. ColocatedJoinRule runs before
BroadcastSmallSideRule so colocation wins when both could apply.
Both features are opt-in. Tables only get colocation behavior if the
user wraps them with PartitionedTableProvider; broadcast only fires when
ballista.optimizer.broadcast_threshold_bytes > 0 (default 0). All
existing snapshot tests are unchanged.
Tests:
- 86 ballista-core tests pass (trait, wrapper, scan adapter,
BucketSubPartitionExec correctness + non-divisor rejection).
- 84 ballista-scheduler tests pass: 5 ColocatedJoinRule, 4 divisor,
5 BroadcastSmallSideRule, 3 end-to-end plan-snapshot tests
(matching → no exchange; 8/4 divisor → BucketSubPartitionExec; plain
MemTable → ExchangeExec retained), plus all pre-existing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
590a8d8 to
59c0a52
Compare
Contributor
|
hi @wirybeaver , I am working on coalesce partition into target binary sizes, your work looks good, just can you verify against sf10 or sf100, let's check if they compile well and there is no big regression? It d help reviewers to assess. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1677. Follow-up tracked in #1679.
Summary
Implements the colocated-join optimizer described in #1677: a small metadata trait in
ballista-corelets anyTableProviderdeclare on-disk hash bucketing, and two newPhysicalOptimizerRules slot into the existingAdaptivePlannerrule chain just beforeDistributedExchangeRule. Together they cover three cases — colocated joins (skip the shuffle), divisor sub-partitioning (16 vs 8 buckets, locally coalesce instead of shuffle), and small-side broadcast in the AQE code path.Metadata layer (
ballista-core):BallistaPartitionMetadatais an optional contract anyTableProvidercan implement to declare on-disk hash bucketing (keys + hash function + bucket count).PartitionedTableProvideris a wrapper that attaches aHashDistributionto any existing provider without modifying it, andHashDistributedScanExecre-advertises the wrapped scan'soutput_partitioning()asPartitioning::Hashso optimizer rules can read it.BucketSubPartitionExecchains input partitions per output partition viastream::iter+flatten— pure local concat, no network — for the divisor case (relies on the identity(hash(k) % 16) % 8 == hash(k) % 8).Optimizer rules (
ballista-scheduler):ColocatedJoinRulewalks eachHashJoinExecand, when both inputs declare matching keys + hash function and either equal or divisor-related bucket counts, strips theRepartitionExecabove each input or wraps the larger side inBucketSubPartitionExec.BroadcastSmallSideRulepromotes aPartitionedHashJoinExectoCollectLeftwhen one side'stotal_byte_sizeis below the configured threshold, usingHashJoinExec::swap_inputswhen the small side is on the right. It is restricted toJoinType::Innerpending #1055.default_optimizersnow takes theSessionConfigso the broadcast threshold flows through, andColocatedJoinRuleruns beforeBroadcastSmallSideRuleso colocation wins when both could apply.Relationship to PR #1647: that PR added small-side broadcast lowering to the non-AQE
DefaultDistributedPlanner::plan_query_stages_internal, and the implementation explicitly noted (TODO atstate/aqe/mod.rs) that the same lowering does not fire in the AQE path.BroadcastSmallSideRuleis the AQE-side counterpart that resolves that TODO; this PR removes the resolved comment. The two rules cover disjoint code paths (AQE is opt-in viaballista.planner.adaptive.enabled, default false), share the sameBALLISTA_BROADCAST_JOIN_THRESHOLD_BYTESconfig key, and behave identically.Both features are opt-in. Tables only get colocation behavior when the user wraps them with
PartitionedTableProvider.BroadcastSmallSideRule::from_session_configreturns a no-op rule (threshold = 0) when theBallistaConfigextension is not registered on theSessionConfig, so existing tests using plainSessionConfig::new()are unaffected.Test plan
cargo test -p ballista-core— 92 passed (trait, wrapper, scan adapter,BucketSubPartitionExeccorrectness + rejection of non-divisor inputs)cargo test -p ballista-scheduler— 108 passed (5ColocatedJoinRulecases, 4 divisor cases, 5BroadcastSmallSideRulecases, 3 end-to-end plan-snapshot tests, plus all pre-existing)cargo check -p ballista-scheduler --tests— cleanExchangeExec; 8/4 divisor →BucketSubPartitionExec(out=4, factor=2), no shuffle; plainMemTable→ExchangeExecretained, optimizer is silentCaveats / risks
HashJoinExeconly, so under default settings the rules don't fire — users opt back into hash join viaSET datafusion.optimizer.prefer_hash_join = true. Extending the rules toSortMergeJoinExecis tracked in Extend ColocatedJoinRule and BroadcastSmallSideRule to SortMergeJoinExec #1679.PartitionedTableProvider; documented but no manifest format yet — a follow-up could add one.BUCKETS=Nbut data isn't actually bucketed that way, results may be wrong. We chose trust over verify to keep the metadata layer lightweight.