Skip to content

feat(tesseract): Support separate pre-aggregations for different multi-fact subqueries#10718

Merged
waralexrom merged 18 commits intomasterfrom
tesseract-multi-fact-separate-pre-aggregations
Apr 28, 2026
Merged

feat(tesseract): Support separate pre-aggregations for different multi-fact subqueries#10718
waralexrom merged 18 commits intomasterfrom
tesseract-multi-fact-separate-pre-aggregations

Conversation

@waralexrom
Copy link
Copy Markdown
Member

Check List

  • Tests have been run in packages where changes have been made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

@waralexrom waralexrom requested a review from a team as a code owner April 20, 2026 12:38
@github-actions github-actions Bot added rust Pull requests that update Rust code javascript Pull requests that update Javascript code labels Apr 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Claude finished @waralexrom's task in 2m 37s —— View job


Code Review: Separate Pre-aggregations for Multi-fact Subqueries

  • Gather context and understand PR changes
  • Review code quality and architecture
  • Review security considerations
  • Review performance implications
  • Review test coverage
  • Post comprehensive review

Summary

This PR refactors how multiplied (multi-fact) measures are planned and optimized in Tesseract. The key architectural change is unifying the CTE management for both multi-stage members and multiplied measures into a single CteState, eliminating the separate ResolveMultipliedMeasures logical node. This enables pre-aggregation matching to work independently on each multi-fact subquery (regular or multiplied) rather than requiring the entire multiplied-measures subtree to match or fail as a unit.

What changed:

  1. Deleted ResolveMultipliedMeasures (logical + physical) — multiplied measures are now planned as individual LogicalMultiStageMember CTEs
  2. New CteState — shared counter/accumulator for both MultiStageQueryPlanner and MultipliedMeasuresQueryPlanner
  3. MultiStageLeafMeasure.measure.measures (Vec) — supports multi-measure CTEs from regular measure subqueries
  4. pre_aggregation_override on AggregateMultipliedSubquery — allows the pre-agg optimizer to short-circuit individual multiplied subqueries
  5. Pre-agg optimizer refactoredtry_rewrite_query now handles simple rewrite + multi-stage fallback in one place; new try_rewrite_schema_and_filter for AggregateMultipliedSubquery matching
  6. New bucketing integration tests — 8 tests covering multi-stage dimensions with bucketing logic

The approach is sound and the refactoring materially simplifies the codebase (~200 net lines removed despite adding tests and new functionality).


Code Quality

Positive:

  • Good separation of concerns: CteState is a clean, focused abstraction
  • The ResolveMultipliedMeasures + ResolvedMultipliedMeasures enum removal significantly simplifies FullKeyAggregate and all three aggregate strategy processors
  • The pre-aggregation optimizer flows more linearly now

Issues:

  1. try_rewrite_schema_and_filter always uses TimeShiftState::default()rust/cubesqlplanner/cubesqlplanner/src/logical_plan/optimizers/pre_aggregation/optimizer.rs:139-143

    When matching a AggregateMultipliedSubquery against pre-aggregations, try_rewrite_schema_and_filter hardcodes TimeShiftState::default() for date range extraction. This is likely correct since multiplied measure subqueries don't have time shifts, but the same hardcoding also appears in the PlanNode::AggregateMultipliedSubquery arm at line 218. Consider whether the time shifts from the enclosing context should be threaded through here.

    Investigate this →

  2. root_filter used for all multiplied subquery matchingoptimizer.rs:189-191

    The comment says "multiplied-measure CTEs don't carry their own filter — logically they apply the same filter as the root query." This is a key semantic assumption. If a future change adds filter pushdown into multiplied measure subqueries, this would silently become wrong. The comment is helpful but consider adding a debug assertion or at minimum ensuring this invariant is documented at the AggregateMultipliedSubquery struct level.

  3. Duplicate code in try_rewrite_simple_query vs try_rewrite_schema_and_filteroptimizer.rs:117-131 vs optimizer.rs:139-170

    These two methods share nearly identical logic (schema/filter matching → make source → build Query). The only differences are: (a) try_rewrite_simple_query copies the existing query's modifiers while try_rewrite_schema_and_filter creates empty modifiers, and (b) the date range extraction. Consider extracting a shared helper.

    Fix this →

  4. Empty data_queries left behind in strategy files — The full_join_aggregate_strategy.rs:103, inner_join_aggregate_strategy.rs:29, and keys_aggregate_strategy.rs:33 files still declare let mut data_queries = vec![]; or let mut keys_queries = vec![]; immediately before the remaining loop. Not a bug, but now the code looks like it was hastily trimmed rather than cleanly restructured. No need to fix now, but worth noting.


Security

No security concerns. This is internal query planning logic with no user-facing input parsing, no SQL injection vectors (queries are built via typed builders), and no auth/authz changes.


Performance

  1. CTE naming is globally sequential now — With CteState shared across multi-stage and multiplied planners, CTE names are cte_0, cte_1, ... in order of planning, not grouped by type. This is fine functionally but means CTE numbering is now sensitive to planning order. If the order of multi_stage_query_planner.plan_queries() vs multiplied_measures_query_planner.plan_queries() ever changes, CTE names will shift, potentially breaking snapshot tests. This is acceptable but worth being aware of.

  2. No performance regression expected — The refactoring doesn't add extra passes or copies. The pre_aggregation_override field adds one Option<Rc<Query>> per AggregateMultipliedSubquery which is negligible.

  3. Pre-aggregation matching improvement — The new approach tries simple rewrite first, then falls back to multi-stage rewriting (line 95-100). This is more efficient than the old code which checked !plan.multistage_members().is_empty() upfront and went straight to the multi-stage path, potentially missing a simpler match.


Testing

Strengths:

  • 8 new integration tests covering bucketing scenarios: simple, multistage measures, complex dimensions, cross-cube references, joins, multi-dimension, and concatenated dimensions
  • Tests exercise both SQL generation (build_sql) and actual execution against Postgres (try_execute_pg)
  • Good seed data with diverse customer revenue patterns enabling meaningful assertions

Concerns:

  1. No test for the pre-aggregation optimization path — The new PlanNode::AggregateMultipliedSubquery arm in the optimizer (line 216-233) and try_rewrite_schema_and_filter are not directly tested. The bucketing tests validate the planning pipeline but don't exercise the pre-aggregation rewrite of multiplied subqueries. Consider adding a test that provides a pre-aggregation definition matching a multiplied measure subquery.

  2. No negative/error tests — All tests are happy-path. Consider adding a test where pre-aggregation matching partially fails (some subqueries match, some don't) to verify the rollback behavior (line 243-250) works correctly.

  3. cumulative measure rejection — The new guard in measure_matcher.rs:34-38 that rejects cumulative measures from pre-aggregation matching is important but not directly tested.


Minor Notes

  • Pre-existing typo: only_addictive should be only_additive in measure_matcher.rs — not introduced by this PR.
  • modifers typo in Query builder (line 125, 161) — also pre-existing, not from this PR.
  • The regex update in pre-aggregation-utils.ts to handle WITH in addition to SELECT is correct and necessary since multiplied measure CTEs may now produce WITH ... SELECT patterns.

Verdict

This is a well-designed architectural refactoring that simplifies the codebase while enabling more granular pre-aggregation matching. The main actionable feedback is around code duplication between the two rewrite methods and the missing test coverage for the pre-aggregation optimization of multiplied subqueries. The rest is solid.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.78%. Comparing base (46b7106) to head (48526d0).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #10718   +/-   ##
=======================================
  Coverage   78.78%   78.78%           
=======================================
  Files         466      466           
  Lines       92050    92050           
  Branches     3378     3378           
=======================================
+ Hits        72525    72526    +1     
+ Misses      19033    19032    -1     
  Partials      492      492           
Flag Coverage Δ
cube-backend 58.03% <ø> (ø)
cubesql 83.40% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@waralexrom waralexrom force-pushed the tesseract-multi-stage-pre-aggregations-full-support branch from a884e18 to 64242fe Compare April 24, 2026 10:53
@waralexrom waralexrom requested review from a team as code owners April 24, 2026 10:53
Base automatically changed from tesseract-multi-stage-pre-aggregations-full-support to master April 27, 2026 09:02
@waralexrom waralexrom force-pushed the tesseract-multi-fact-separate-pre-aggregations branch from 8a7bcd9 to 48526d0 Compare April 27, 2026 10:30
@waralexrom waralexrom closed this Apr 28, 2026
@waralexrom waralexrom reopened this Apr 28, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 28, 2026

Claude finished @waralexrom's task in 2m 14s —— View job


Code Review: Separate Pre-aggregations for Multi-fact Subqueries

  • Gather context and understand PR changes
  • Review code quality and architecture
  • Review security considerations
  • Review performance implications
  • Review test coverage
  • Post comprehensive review

Summary

This PR makes a significant architectural improvement to Tesseract's query planning for multiplied (multi-fact) measures. The key change is unifying CTE management for both multi-stage members and multiplied measures into a single CteState, and removing the separate ResolveMultipliedMeasures logical/physical node. This enables the pre-aggregation optimizer to independently match each multi-fact subquery (regular or multiplied) against pre-aggregations, rather than requiring the entire multiplied-measures subtree to match or fail as a unit.

Core architectural changes:

  1. Deleted ResolveMultipliedMeasures (logical + physical) — multiplied measures now planned as individual LogicalMultiStageMember CTEs
  2. New CteState — shared counter/accumulator for both planners
  3. MultiStageLeafMeasure.measure.measures (Vec) — supports multi-measure CTEs from regular measure subqueries
  4. pre_aggregation_override on AggregateMultipliedSubquery — lets the pre-agg optimizer short-circuit individual multiplied subqueries
  5. Pre-agg optimizer refactoredtry_rewrite_query now handles simple + multi-stage fallback; new try_rewrite_schema_and_filter for AggregateMultipliedSubquery matching
  6. Comprehensive test coverage — 8 bucketing tests, 12+ multi-fact pre-agg tests, 3 multi-stage pre-agg tests

The refactoring is well-designed — it reduces net code (~200 lines removed despite significant new tests) while enabling genuinely new functionality.


Code Quality

Strengths:

  • CteState is a clean, focused abstraction that eliminates dual tracking between MultiStageQueryPlanner and MultipliedMeasuresQueryPlanner
  • The removal of ResolveMultipliedMeasures + the FullKeyAggregateContent::ResolvedMultipliedMeasures variant significantly simplifies all three aggregate strategy processors (full join, inner join, keys)
  • The pre-aggregation optimizer flows more linearly — try simple match first, then fall back to multi-stage rewriting

Issues:

  1. Code duplication between try_rewrite_simple_query and try_rewrite_schema_and_filteroptimizer.rs:115-178

    These two methods share nearly identical logic: iterate pre-aggregations, call is_schema_and_filters_match, call make_pre_aggregation_source, build a Query. The only differences are: (a) try_rewrite_simple_query copies existing query modifiers while try_rewrite_schema_and_filter creates empty modifiers, and (b) try_rewrite_schema_and_filter loops over all pre-aggregations internally while try_rewrite_simple_query is called per pre-aggregation.

    A shared helper that takes schema, filter, pre-aggregation, and modifiers would reduce this duplication to a few lines.

    Fix this →

  2. TimeShiftState::default() hardcoded for multiplied subquery matchingoptimizer.rs:149-153 and optimizer.rs:224-226

    When matching AggregateMultipliedSubquery against pre-aggregations, try_rewrite_schema_and_filter always uses TimeShiftState::default() for date range extraction. This is likely correct since multiplied measure subqueries don't have time shifts at this level (time shifts live on the enclosing multi-stage member). However, the assumption is implicit. Consider adding a brief comment at the call site in try_rewrite_query_with_multistages (line 224) noting that this is intentional, or threading the time shift state through for correctness.

  3. root_filter used for all multiplied subquery matchingoptimizer.rs:194

    The comment at line 192 explains that multiplied-measure CTEs logically apply the same filter as the root query. This is a key semantic invariant. If a future change adds filter pushdown into multiplied measure subqueries, this would silently become wrong. Consider adding a debug_assert! or documenting this invariant at the AggregateMultipliedSubquery struct level (aggregate_multiplied_subquery.rs).

  4. Mixed external/non-external rejection could fail silentlyoptimizer.rs:268-277

    The check for mixed external/non-external pre-aggregation usages at line 270-276 rolls back all usages if there's a mix. This is correct, but only covers usages added during the multi-stage rewrite (from saved_usages_len). If a pre-aggregation from a different optimization pass were somehow present, it wouldn't be checked. This is fine today but worth noting as a fragile boundary.

  5. Pre-existing typos (not introduced by this PR):

    • only_addictive should be only_additive (measure_matcher.rs:8,14,41)
    • modifers should be modifiers (optimizer.rs:129,166)

Security

No security concerns. All changes are internal query planning logic. No user-facing input parsing, no SQL injection vectors (queries built via typed builders), and no auth/authz changes.


Performance

  1. No performance regression expected. The refactoring doesn't add extra passes or copies. The pre_aggregation_override field adds one Option<Rc<Query>> per AggregateMultipliedSubquery, which is negligible.

  2. Pre-aggregation matching improvement. The new approach tries simple rewrite first, then falls back to multi-stage rewriting (optimizer.rs:96-112). This is more efficient than the old code which checked !plan.multistage_members().is_empty() upfront and went straight to the multi-stage path, potentially missing a simpler match.

  3. CTE naming is globally sequential now. With CteState shared across multi-stage and multiplied planners, CTE names are cte_0, cte_1, ... in order of planning, not grouped by type. This is fine functionally but means CTE numbering is sensitive to planning order. Snapshot tests already account for this.


Testing

Strengths:

  • 12+ new multi-fact pre-aggregation tests (multi_fact.rs) covering: separate pre-aggs, totals, single rollup match, mixed regular + multiplied, partial match rollback, filtered variants — all with both SQL generation and Postgres execution
  • 3 new multi-stage pre-aggregation tests (multi_stage.rs) covering time shift with leaf measures, time shift with multi-stage measures, and multiplied leaf time shift
  • 8 new bucketing integration tests (bucketing.rs) covering simple, multi-stage measures, complex dimensions, cross-cube references, joins, multi-dimension, and concatenated dimensions
  • Tests exercise both build_sql_with_used_pre_aggregations (assertions on pre-agg matching) and actual execution against Postgres (try_execute_pg)
  • Good seed data with diverse patterns enabling meaningful assertions
  • The test_multi_fact_partial_match_rolls_back test (line 454) specifically validates the rollback behavior when one of multiple subqueries cannot match — this was a gap identified in the prior review and is now addressed

One observation:

  • The cumulative measure rejection guard in measure_matcher.rs:34-38 is not directly unit-tested (there's no test that creates a cumulative measure and verifies it returns false). The existing MeasureMatcher unit tests only cover non-cumulative measures. This is a minor gap — the behavior is covered indirectly by the integration tests.

Verdict

This is a well-executed architectural refactoring. It simplifies the codebase materially while enabling more granular pre-aggregation matching for multi-fact queries. The test coverage is thorough and addresses gaps from the prior review (rollback behavior, filtered variants). The main actionable feedback is around code duplication between the two rewrite methods and documenting the root_filter invariant. The rest is solid work.

@waralexrom waralexrom merged commit 187ee17 into master Apr 28, 2026
231 of 234 checks passed
@waralexrom waralexrom deleted the tesseract-multi-fact-separate-pre-aggregations branch April 28, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

javascript Pull requests that update Javascript code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants