Skip to content

chore(tesseract): resolve join trees once per query#10982

Merged
waralexrom merged 3 commits into
masterfrom
tesseract-multi-fact-groups-refactor
Jun 1, 2026
Merged

chore(tesseract): resolve join trees once per query#10982
waralexrom merged 3 commits into
masterfrom
tesseract-multi-fact-groups-refactor

Conversation

@waralexrom
Copy link
Copy Markdown
Member

Summary

Resolve a query's join graph once instead of repeatedly during planning. Introduces a JoinTree (resolved join with compiled ON SQL) that MultiFactJoinGroups caches, plus a per-query JoinTreeCache so identical join hints are not re-resolved across the many group builds in a single query.

Changes

  • Add JoinTree (BaseCube + compiled on_sql + multiplication_factor), built once by JoinTreeBuilder; MultiFactJoinGroups now stores Rc<JoinTree> instead of Rc<dyn JoinDefinition>, so ON SQL is compiled once at group-build time rather than recompiled in make_join_logical_plan / collect_sub_query_dimensions.
  • LogicalJoin is unchanged and assembled cheaply from a JoinTree.
  • Move compile_join_condition into CommonUtils (shared by JoinTreeBuilder and resolve_join_members); drop the lifetime from MultipliedMeasuresCollector in favor of Rc<JoinTree>.
  • Add JoinTreeCache (owned by QueryTools, keyed by JoinHints, storing (JoinKey, Rc<JoinTree>)); build_groups resolves through it. Eliminates repeated identical resolution — notably the per-pre-aggregation-candidate matching loop and the per-measure passes in full_key_aggregate_measures, which otherwise re-cross the JS bridge and recompile ON SQL. No invalidation needed: the join graph is immutable for a QueryTools lifetime.

Testing

  • cargo check clean, no warnings.
  • 973 unit tests pass.
  • 450 integration tests pass against Postgres (testcontainers), including multi_fact, subquery_dim_with_multi_fact, subquery_dim_with_multiplied_measure, transitive_joins, view_multi_fact.
  • cargo fmt --check clean.

Introduce JoinTree: a resolved join tree (BaseCube + compiled on_sql +
multiplication_factor) built once by JoinTreeBuilder. MultiFactJoinGroups
now caches Rc<JoinTree> instead of Rc<dyn JoinDefinition>, so join ON SQL
is compiled a single time at group-build time rather than recompiled on
every use during planning (make_join_logical_plan and
collect_sub_query_dimensions now reuse the compiled on_sql).

LogicalJoin is unchanged and assembled cheaply from a JoinTree. Move
compile_join_condition into CommonUtils so both JoinTreeBuilder and
resolve_join_members share it. Drop the lifetime from
MultipliedMeasuresCollector in favor of Rc<JoinTree>.
Add JoinTreeCache, owned by QueryTools, keyed by JoinHints and storing
(JoinKey, Rc<JoinTree>). build_groups resolves joins through it, so the
same hints are not re-resolved across the many group builds in a single
query — notably the per-pre-aggregation-candidate matching loop and the
per-measure passes in full_key_aggregate_measures, which otherwise
re-cross the JS bridge and recompile ON SQL for identical hints.

get_or_build takes the build closure per call, so the cache holds no
Rc<QueryTools> back-reference and forms no reference cycle. No
invalidation: the join graph is immutable for a QueryTools lifetime.
@github-actions github-actions Bot added the rust Pull requests that update Rust code label Jun 1, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 1, 2026

Claude finished @waralexrom's task in 1m 39s —— View job


Code review

Overall this is a clean, well-scoped refactor. The split between JoinTree (resolved + compiled) and LogicalJoin (the plan node) is a nice separation, and pushing compile_join_condition down to CommonUtils removes the awkward JoinPlanner dependency from the sub-query collectors. The cache is correctly per-QueryTools and trivially safe given the immutable join graph.

Notes & minor suggestions
  • JoinTreeCache key granularity (join_tree_cache.rs:14): caching by JoinHints rather than by the canonical JoinKey will miss hits when two different hint sets resolve to the same join (the join_for_hints call is what canonicalizes). Not a correctness issue, just a missed reuse opportunity — if profiles still show repeated join_for_hints calls for equivalent hints, consider a two-level cache (JoinHints → JoinKey, then JoinKey → JoinTree). For the current hot paths (per-pre-agg candidate, per-measure passes) the hints are likely identical, so this is probably fine as is.

  • JoinTreeBuilder not routed through the cache in JoinPlanner (join_planner.rs:62-64): make_join_node_with_join_hints builds a fresh JoinTreeBuilder and resolves directly, bypassing query_tools.join_tree_cache(). If this path is ever called with hints that MultiFactJoinGroups has already resolved (or vice-versa), the ON-SQL gets recompiled. Worth verifying whether the join-planner path overlaps with the multi-fact-groups path in practice; if it does, routing both through the same cache would close that gap.

  • JoinTree::joins() return type (join_tree.rs:65): &Vec<JoinTreeItem> could be &[JoinTreeItem] — slightly more idiomatic and lets callers pass it to slice-accepting APIs without an extra deref. Pure nit.

  • multiplication_factor: HashMap<String, bool> (join_tree.rs:43): the bool value is always meaningful only when true (you unwrap_or(false)), so a HashSet<String> of multiplied cubes would express intent more directly. Kept as-is presumably to mirror static_data().multiplication_factor — fine either way.

  • JoinTreeCache::get_or_build re-entrancy (join_tree_cache.rs:25-34): the borrow() is dropped before build() runs and re-borrowed on insert, so recursive calls into the cache from build are safe. Worth a one-line comment if anyone is tempted to fold the lookup and insert into a single entry() call (which would deadlock if build recursed).

  • Repeated JoinTreeBuilder::new(query_tools.clone()) at call sites: trivial allocation cost; could be a &self method on JoinPlanner / live on QueryTools if you want one less call-site clone. Definitely not blocking.

No correctness, security, or testing concerns. Test coverage (multi_fact, subquery_dim_with_multi_fact, subquery_dim_with_multiplied_measure, transitive_joins, view_multi_fact) covers the touched paths well.

· Branch: `tesseract-multi-fact-groups-refactor`

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.82%. Comparing base (8265a64) to head (f2e5dc9).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10982      +/-   ##
==========================================
- Coverage   78.84%   78.82%   -0.03%     
==========================================
  Files         470      470              
  Lines       93390    93438      +48     
  Branches     3466     3466              
==========================================
+ Hits        73637    73649      +12     
- Misses      19250    19286      +36     
  Partials      503      503              
Flag Coverage Δ
cube-backend 58.43% <ø> (ø)
cubesql 83.37% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Explain why the lookup borrow is dropped before build() runs and must
not be folded into a single entry() call.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 1, 2026

Claude finished @waralexrom's task in 42s —— View job


Code review (re-run on synchronize)

Verdict: LGTM — no blocking issues. 0 high, 0 medium, 2 low (carried over from prior review, non-blocking).

Details
  • Re-fetch diff vs origin/master
  • Review new commit f2e5dc9 (re-entrancy doc)
  • Re-check prior findings against current diff

What changed since last review

f2e5dc9 adds a 4-line doc comment on JoinTreeCache::get_or_build explaining why the lookup borrow is dropped before build() runs and why this must not be folded into a single entry() call. This directly addresses the re-entrancy nit from the previous review — comment reads cleanly and captures both the rationale ("build may itself call back into the cache") and the failure mode ("would hold the borrow across build and panic").

Still-open (non-blocking) notes carried over from prior review

  • JoinTreeCache keyed by JoinHints, not canonical JoinKey (join_tree_cache.rs:16) — two hint sets that canonicalize to the same join still resolve twice. Likely fine for the current hot paths where hints are identical; a two-level cache (JoinHints → JoinKey → JoinTree) would close the residual gap if profiling shows it.
  • JoinTreeBuilder bypasses the cache in JoinPlanner::make_join_node_with_join_hints (join_planner.rs:62-64) — recompiles ON SQL if this path overlaps with the MultiFactJoinGroups path for the same hints. Worth checking once whether the overlap exists in practice.

Pure nits (&[JoinTreeItem] instead of &Vec<JoinTreeItem>, HashSet<String> instead of HashMap<String, bool>) remain as-is; not worth chasing.

Quality / security / perf / tests / docs

  • Quality: Refactor is well-scoped; JoinTree (resolved + compiled) vs LogicalJoin (plan node) separation is clean. compile_join_condition move to CommonUtils removes the awkward JoinPlanner dep from sub-query collectors.
  • Security: No new attack surface — cache is in-memory, per-QueryTools, no user input reaches keys beyond what already flowed through JoinHints.
  • Perf: Cache is the whole point and the gain (per-pre-agg candidate + per-measure pass deduplication) is real. Rc<JoinTree> swap-in for Rc<dyn JoinDefinition> in MultiFactJoinGroups is a strict win — ON SQL compiled once at group-build time.
  • Tests: 973 unit + 450 integration green per PR body; multi_fact, subquery_dim_with_multi_fact, subquery_dim_with_multiplied_measure, transitive_joins, view_multi_fact cover the touched paths. No new tests requested for the cache itself, which is reasonable given existing integration coverage exercises both hit and miss paths transitively.
  • Docs: Inline rustdoc on JoinTreeCache and get_or_build is sufficient; no user-facing docs needed for an internal planner cache.
· Branch: `tesseract-multi-fact-groups-refactor`

@waralexrom waralexrom merged commit d86ab6f into master Jun 1, 2026
249 of 257 checks passed
@waralexrom waralexrom deleted the tesseract-multi-fact-groups-refactor branch June 1, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants