feat(cubestore): Fold IS NOT DISTINCT FROM join predicates into HashJoin keys#10923
Conversation
…oin keys DataFusion's extract_equijoin_predicate only lifts `=` predicates, so a LEFT JOIN whose ON clause uses IS NOT DISTINCT FROM (typical for FULL OUTER emulation via a driver + chained LEFT JOINs on nullable keys) ends up as NestedLoopJoinExec — O(N*M) instead of hashed. Adds an OptimizerRule that, when a Join has no existing equi-keys, lifts all `expr IS NOT DISTINCT FROM expr` conjuncts whose sides resolve to opposite inputs into `on` and sets null_equals_null=true. DF's physical planner then propagates the flag to HashJoinExec/SortMergeJoinExec, which already handle null-safe key comparison. The rule is conservative: when `on` is already populated by DF's own extraction, it bails out to avoid silently changing the null semantics of existing `=` keys via the shared null_equals_null flag.
|
Claude finished @waralexrom's task in 2m 7s —— View job Code Review:
|
| Test | What it covers |
|---|---|
equality_join_uses_hash_join_and_excludes_null_keys |
Baseline: = joins work as before, NULLs don't match |
is_not_distinct_from_folds_into_hash_join_keys |
Core: IS NOT DISTINCT FROM lifted, NULLs now match |
is_not_distinct_from_lifts_many_predicates |
5-key lift, verifying multi-predicate handling |
full_outer_via_driver_chain_uses_hash_joins |
Real-world motivating pattern with chained LEFT JOINs |
mixed_eq_and_is_not_distinct_from_is_left_alone |
Conservative scope: mixed predicates not touched |
Each test validates three layers: logical plan structure, physical plan choice (HashJoinExec vs NestedLoopJoinExec), and result correctness. The tag column pattern in the driver-chain test is a clever way to distinguish joined NULLs from LEFT-padded NULLs.
Missing coverage (nice-to-have)
- INNER JOIN with
IS NOT DISTINCT FROM - A true single-predicate lift
- Non-hashable type (e.g., a struct column) to confirm the
can_hashguard leaves the filter intact IS NOT DISTINCT FROMwith expressions (e.g.,a.x + 1 IS NOT DISTINCT FROM b.y) rather than bare columns
None of these are blocking — the current test suite covers all the critical paths for the stated use case.
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #10923 +/- ##
===========================================
- Coverage 78.93% 58.43% -20.51%
===========================================
Files 470 216 -254
Lines 92862 16993 -75869
Branches 3449 3449
===========================================
- Hits 73304 9930 -63374
+ Misses 19054 6559 -12495
Partials 504 504
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
LEFT JOIN ... ON a IS NOT DISTINCT FROM b(used to emulate FULL OUTER JOIN via a driver table + chained LEFT JOINs on nullable keys) was planned asNestedLoopJoinExec— O(N×M) — because DataFusion'sextract_equijoin_predicateonly lifts plain=predicates. This PR adds a CubeStore logical optimizer rule that liftsIS NOT DISTINCT FROMconjuncts into the join's equi-keys withnull_equals_null = true, letting DF's physical planner pickHashJoinExec(orSortMergeJoinExec) — both already support null-safe key comparison internally.Changes
optimizations/is_not_distinct_from_join_keys.rs: newOptimizerRule(ApplyOrder::TopDown). For eachLogicalPlan::Joinwith emptyonand a filter that is a conjunction ofIS NOT DISTINCT FROMpredicates, the rule routes left/right-resolved hashable pairs intoon, leaves the rest infilter, and flipsnull_equals_null = true.RollingOptimizerRuleinQueryPlannerImpl::make_execution_context.onis already populated by DF's own extraction, sincenull_equals_nullis a single flag on the whole join and would silently change semantics of pre-extracted=keys.=join, single-pair lift, 5-predicate lift, FULL-OUTER-via-driver + 3 chained LEFT JOINs, and a mixed=+IS NOT DISTINCT FROMcase documenting the conservative no-op.Testing
cargo test -p cubestore is_not_distinct_from_join_test::— 5 passed.IS NOT DISTINCT FROMchain producesNestedLoopJoinExecwith the predicate infilter=; after the rule, the same chain producesHashJoinExecwithon=[...]for every step.