Skip to content

fix(tesseract): make pre-aggregations tests work under native SQL planner#10992

Merged
waralexrom merged 13 commits into
masterfrom
tesseract-enable-all-pre-aggrs-test
Jun 2, 2026
Merged

fix(tesseract): make pre-aggregations tests work under native SQL planner#10992
waralexrom merged 13 commits into
masterfrom
tesseract-enable-all-pre-aggrs-test

Conversation

@waralexrom
Copy link
Copy Markdown
Member

Summary

Fixes the bugs that surface when running pre-aggregation integration/smoke tests under the Tesseract native SQL planner (CUBEJS_TESSERACT_PRE_AGGREGATIONS=1), so pre-aggregations are safe to enable on Tesseract. Each fix restores parity with the legacy query builder. Also turns the flag on for the Tesseract leg of CI.

Changes

  • HLL count_distinct_approx: render as an HLL state on build (hll_init) and merge on read — cardinality at top level (hll_cardinality_merge), merged state when feeding a further aggregation (hll_merge, e.g. rolling-window over the rollup).
  • KSQL dialect mirrored into sqlTemplates (consumed by the Rust renderer): backtick identifier quoting, CONCAT(...) instead of the unsupported ||, and GROUP BY by full expressions instead of column ordinals.
  • rollupLambda external inherited from its member rollups, so the union read renders with the external store's dialect/placeholders (?) — fixes CubeStore No field named "$1".
  • preAggregationId forwarded in buildSqlAndParamsRust, so the pre-aggregation partitions API / scheduled refresh isolate a single pre-agg instead of matching a same-reference sibling.
  • No default ORDER BY for pre-aggregation/total build queries (matches legacy defaultOrder()); the stray Sort made CubeStore reject streaming pre-aggregation builds.
  • Skip rollupJoin with transitive joins under the native planner (genuinely unsupported), keeping the legacy assertion for the non-Tesseract path.
  • CI: enable CUBEJS_TESSERACT_PRE_AGGREGATIONS on the Tesseract matrix leg (integration + smoke).

Testing

  • Rust: full cubesqlplanner lib green (977 tests); new sql_generation tests cover HLL build/read, multi-stage-over-rollup, and rolling-window-over-rollup state path.
  • JS: pre-aggregations.test.ts (PreAggregations) under Tesseract; smoke-lambda 7/7 green with CUBEJS_TESSERACT_SQL_PLANNER=1 CUBEJS_TESSERACT_PRE_AGGREGATIONS=1 (Postgres rollups, lambda + unionWithSourceData, and KSQL streaming pre-aggregations).

waralexrom added 10 commits June 1, 2026 18:24
…re-aggregations

Pre-aggregation build now serializes count_distinct_approx as an HLL
state (hll_init) instead of computing an approximate distinct, and
queries reading the rollup merge the stored states (hll_merge /
hll_cardinality_merge) instead of re-hashing the state column.

Build side: render_measure_as_state is set for pre-aggregation queries.
Read side: FinalPreAggregationMeasureSqlNode wraps the matched column
with a merge, keeping the merged state when the result feeds a further
aggregation and taking its cardinality otherwise.
The Postgres-style mock rendered hll_merge and hll_cardinality_merge
identically (both as a cardinality), so tests could not tell which
branch a count_distinct_approx pre-aggregation read used. Switch the
mock to CubeStore-style forms where the two differ (merge vs
cardinality(merge)).

Update the approx pre-agg read/build tests to the new forms and add a
multi-stage case (sum over count_distinct_approx) asserting the rollup
read finalizes to a cardinality, matching the non-pre-agg shape.
…gation

Add a rolling-window count_distinct_approx whose pre-aggregation stores
the rolling measure itself (mirrors the JS partitionedRolling case). The
leaf reads the rollup's HLL state and keeps it merged (hll_merge), while
the rolling-window stage finalizes the merged states to a cardinality.

This exercises the state branch of FinalPreAggregationMeasureSqlNode that
the other approx tests do not reach.
Tesseract renders SQL from sqlTemplates() in Rust, while KsqlQuery only
overrode the JS dialect methods. Identifiers came out double-quoted and
string concatenation / LIKE wildcards used the unsupported `||` operator.

Override sqlTemplates() so KSQL pre-aggregations and queries built by the
native planner use backtick quoting and CONCAT(...) instead of `||`.
…lanner

Tesseract's rollupJoin key resolution requires each join's ON condition to
reference only the two joined cubes so it can extract the stitch keys. The
test_facts -> merchant_and_product_dims join is transitive (its condition
references intermediate cubes), which the binary resolver rejects. This is an
unsupported feature, not a regression — skip it under nativeSqlPlanner and keep
the legacy assertion for the non-Tesseract path.
A rollupLambda has no `external` default in the data model, so it reached
the planner as external=None -> false. The lambda-union read then rendered
with the source cube's templates (Postgres $N placeholders) instead of the
external store's (CubeStore ?), and CubeStore could not bind the filter
parameter ("No field named $1").

Derive the lambda's external from its member rollups (external only when all
members are), mirroring the legacy R.all(p => p.external, descriptions) check.
This selects the CubeStore dialect/placeholders for the union read.
…dParamsRust

The Rust build path dropped preAggregationId, so getSql() (used by the
pre-aggregation partitions API and scheduled refresh) could not isolate a
single pre-aggregation. When several rollups share identical references —
e.g. a rollupLambda member and a real-time rollup — Tesseract matched the
wrong one, attributing one rollup's partitions to another.

Forward preAggregationId (as findPreAggregationForQuery already does) so the
optimizer filters to the requested pre-aggregation.
Tesseract always applied a default ORDER BY when no explicit order was set,
including for pre-aggregation build and total queries. Legacy defaultOrder()
returns [] for those. The stray ORDER BY is pointless for materialization and
breaks streaming pre-aggregations: CubeStore's Kafka target only accepts an
unsorted projection/filter over the source, so a top-level Sort is rejected.

Skip default order when pre_aggregation_query or total_query, matching legacy.
…lates

ksqlDB does not support positional GROUP BY. The native planner renders
GROUP BY from the shared sqlTemplates, which default to column ordinals, so a
KSQL streaming pre-aggregation build (CREATE TABLE AS SELECT ... GROUP BY 1,2)
was rejected: 'Non-aggregate SELECT expression(s) not part of GROUP BY'.

Override statements.group_by_exprs in KsqlQuery to emit the full expressions.
Run integration and integration-smoke jobs with Tesseract pre-aggregations on
whenever the Tesseract SQL planner is on, so the full matrix exercises the
native pre-aggregation path.
@waralexrom waralexrom requested review from a team as code owners June 2, 2026 09:52
@github-actions github-actions Bot added rust Pull requests that update Rust code javascript Pull requests that update Javascript code data source driver labels Jun 2, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 2, 2026

Claude finished @waralexrom's task in 2m 30s —— View job


Code Review — fix(tesseract): make pre-aggregations work under native SQL planner

Verdict: LGTM. Surgical, well-motivated fixes that move Tesseract toward parity with the legacy builder for pre-aggregations. Code-level findings: 0 high, 0 medium, 3 low / nits.

Details (todos + findings)
  • Checked existing review threads (none)
  • Read diff (git diff 35bed42..HEAD)
  • Reviewed JS/TS changes
  • Reviewed Rust changes
  • Reviewed CI workflow change

Per-area notes

rust/.../pre_aggregations_compiler.rs (rollupLambda external inheritance)
The new rule — external if every member rollup is external — is correct for routing, but two edge cases are worth noting:

  • If pre_aggrs_for_lambda were ever empty, .iter().all(...) returns true, incorrectly granting Some(true). In practice a lambda always has members, but a defensive !is_empty() && ...all(...) would make this robust.
  • When members mix Some(false) / None, the result falls back to static_data.external (which is None for a lambda that omits external). Today downstream code already treats None as "not external," so behaviorally fine, but worth keeping in mind if external semantics ever tighten.

rust/.../final_pre_aggregation_measure.rs + physical_plan_builder/builder.rs
The count_approx_as_state plumbing is consistent with the existing render_measure_as_state flag used elsewhere (multistage leaf, final_measure). The cardinality-vs-state branch in FinalPreAggregationMeasureSqlNode::process_node (lines ~64–80) is the correct place for the decision and the comment explains why well. Tests pin both paths (test_count_distinct_approx_pre_agg_read_merges_state, ..._rolling_pre_agg_keeps_state, ..._multistage_pre_agg_reads_cardinality) — good coverage.

rust/.../query_properties.rs (no default ORDER BY for build/total)
Correct condition (pre_aggregation_query || total_query) and the comment explains the streaming-pre-agg motivation. No regression risk to ordinary queries since the branch is gated on these flags.

packages/cubejs-ksql-driver/src/KsqlQuery.ts (sqlTemplates override)

  • BaseQuery.sqlTemplates() returns a fresh object each call, so mutating templates.quotes, expressions.concat_strings, etc. is safe.
  • The three mirrored bits (backtick quoting, CONCAT, full-expr GROUP BY) match what KsqlQuery already does in its non-template overrides (quoteIdentifier, concatStringsSql, KsqlFilter). Nit: no JS-level unit test exercises sqlTemplates() for KSQL — drift between the two code paths would be silent. Consider a small assertion in a future change.

packages/cubejs-schema-compiler/src/adapter/BaseQuery.js
preAggregationId: this.options.preAggregationId || null — straightforward forwarding; matches the PR description's isolation goal. Nit: every other passthrough on these lines preserves undefined (e.g., joinHints, ungrouped); coercing to null is harmless but slightly inconsistent.

packages/cubejs-schema-compiler/test/integration/postgres/pre-aggregations.test.ts
Wrapping the existing tests in if (getEnv('nativeSqlPlanner')) { it.skip(...) } else { it(...) } keeps coverage on the legacy path and documents the gap on Tesseract. Verbose but fine.

.github/workflows/push.yml
Setting CUBEJS_TESSERACT_PRE_AGGREGATIONS from matrix.use_tesseract_sql_planner only — i.e. the flag is enabled exactly when the planner is — is the right shape. No further matrix expansion needed for this PR.

Suggestions (low priority)

  1. Defensive !is_empty() && pre_aggrs_for_lambda.iter().all(...) in pre_aggregations_compiler.rs lambda-external block.
  2. Add a JS unit test asserting KsqlQuery#sqlTemplates() returns the three overrides (quotes, concat_strings, group_by_exprs).
  3. Consider preserving undefined (or documenting the null coercion) for preAggregationId in BaseQuery.js:957 for consistency with sibling fields.
· branch: `tesseract-enable-all-pre-aggrs-test`

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.81%. Comparing base (35bed42) to head (ef20929).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...bejs-schema-compiler/src/adapter/CubeStoreQuery.ts 0.00% 2 Missing ⚠️
...s/cubejs-schema-compiler/src/adapter/MysqlQuery.ts 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10992      +/-   ##
==========================================
- Coverage   78.81%   78.81%   -0.01%     
==========================================
  Files         470      471       +1     
  Lines       93438    94328     +890     
  Branches     3466     3494      +28     
==========================================
+ Hits        73644    74345     +701     
- Misses      19291    19474     +183     
- Partials      503      509       +6     
Flag Coverage Δ
cube-backend 58.50% <33.33%> (+0.07%) ⬆️
cubesql 83.32% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

A subQuery dimension's inner query is aggregated per primary key and joined
back by it, so its order is meaningless. The default order it received
rendered as ORDER BY + OFFSET/FETCH inside a derived table, which MSSQL
rejects (Invalid usage of the option NEXT in the FETCH statement) when the
subquery sits inside a pre-aggregation build (SELECT ... INTO). Legacy emits
no order there.

Set an empty order on the subquery's QueryProperties.
A segment projected as a pre-aggregation dimension is a boolean expression.
The native planner emitted it raw, which MSSQL rejects (no boolean type). Mark
such members (segment resolved via add_dimension_evaluator) and render them
through a new wrap_segment_select template: MSSQL CAST(... AS BIT),
CubeStore/MySQL IF(...,1,0), default unchanged.
…e-aggregation

A segment is stored as a boolean column in a rollup. On read MSSQL renders it
as a BIT column, which can't be used as a bare WHERE predicate. Add a
wrap_segment_filter template (default passthrough, MSSQL/MySQL/CubeStore
`{{ expr }} = 1`) and apply it in BaseSegment::to_sql when the filters context
is reading a pre-aggregation.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 2, 2026

Claude finished @waralexrom's task in 1m 4s —— View job


Code Review — fix(tesseract): make pre-aggregations work under native SQL planner (delta since last review)

Verdict: LGTM. The three new commits (51c6b85, b2a8f24, ef20929) extend Tesseract→legacy parity for segments-as-dimensions and dimension subquery ordering. Findings on the new delta: 0 high, 0 medium, 2 nits. Earlier-review findings still apply.

Details — new commits, file-by-file

What's new since the last review (6e0e31c..HEAD)

  • 51c6b85 — no default ORDER BY in dimension subquery (dimension_subquery_planner.rs)
  • b2a8f24 — wrap segment-as-dimension per dialect (new SegmentDimensionSqlNode, wrap_segment_select/wrap_segment_filter templates, MemberExpressionSymbol::with_is_segment, MSSQL/MySQL/CubeStore overrides)
  • ef20929 — when reading a pre-aggregation, compare segment to its truthy value (BaseSegment::to_sql + reading_pre_aggregation flag in FiltersContext)

Per-area notes

dimension_subquery_planner.rs:128.order_by(Some(vec![]))
Correct. The subquery is grouped by primary key and joined back by it, so any ordering is wasted work; the explicit empty list also prevents MSSQL choking when the derived table grows an OFFSET/FETCH. Comment captures the why well.

SegmentDimensionSqlNode (physical_plan/sql_nodes/segment_dimension.rs)
Wraps only MemberSymbol::MemberExpression whose is_segment() is true; everything else passes through. The placement in factory.rs:194 is right — only on the build path; the read path keeps RenderReferencesSqlNode because the segment is already a stored column.

MemberExpressionSymbol::with_is_segment (member_expression_symbol.rs)
Mirrors the existing with_parenthesized builder. Default false for is_segment in new(...) preserves existing behavior for all other call sites. Compiler::add_dimension_evaluator_by_path is the only producer — chained with with_parenthesized() so a segment-as-dimension is always both. Good.

BaseSegment::to_sql + FiltersContext.reading_pre_aggregation
Correctly gates on filters_ctx.reading_pre_aggregation (derived from nodes_factory.reading_pre_aggregation() = !pre_aggregation_dimensions_references.is_empty()), so live-base-table filters still emit a bare boolean. Both VisitorContext::new and ::new_with_cube_alias_prefix are updated. Nit below.

JS dialect templates

  • BaseQuery.js: defaults to identity ({{ expr }}) for both — passthrough for Postgres/BigQuery/Snowflake/etc. ✓
  • CubeStoreQuery.ts: IF({{ expr }}, 1, 0) / {{ expr }} = 1. ✓
  • MysqlQuery.ts: same as CubeStore. ✓
  • MssqlQuery.ts: CAST((CASE WHEN ... THEN 1 ELSE 0 END) AS BIT) / {{ expr }} = 1. Comments call out the BIT rationale clearly.

Nits (low priority)

  1. BaseSegment::to_sql only checks reading_pre_aggregation — not the dialect's wrapper. For Postgres/BigQuery the wrapper renders identity, so it's a no-op string copy, but you still pay a template render per segment under a rollup read. Probably negligible; mentioning only because the symmetric build-side SegmentDimensionSqlNode likewise unconditionally calls wrap_segment_select for any segment. If template render shows up in profiles, gate on a dialect-level "needs wrap" boolean.

  2. SegmentDimensionSqlNode activation is broad. It wraps every segment projected via the evaluate_sql_processor path, even though the bug it fixes only requires wrapping when the segment is selected/grouped (not when it appears inside a larger expression that already coerces it). For Postgres this is moot; for MSSQL the CAST(... AS BIT) becomes visible inside derived expressions where the user might have expected the raw boolean. Worth a follow-up only if someone reports it — the legacy builder behaves similarly here.

Sanity checks I ran

  • All new templates are defined in the mock renderer (mock_sql_templates_render.rs) so Rust tests don't NPE on lookup.
  • is_segment defaults to false in MemberExpressionSymbol::new, so no other producer is affected.
  • reading_pre_aggregation flows through both VisitorContext constructors — no path bypasses it.
  • No new override needed on OracleQuery/SqliteQuery/PrestodbQuery — they don't currently support pre-aggregations with segments-as-dimensions; the BaseQuery passthrough is safe.
· branch: `tesseract-enable-all-pre-aggrs-test`

@waralexrom waralexrom changed the title fix(tesseract): make pre-aggregations work under native SQL planner fix(tesseract): make pre-aggregations tests work under native SQL planner Jun 2, 2026
@waralexrom waralexrom merged commit 33ca484 into master Jun 2, 2026
187 of 191 checks passed
@waralexrom waralexrom deleted the tesseract-enable-all-pre-aggrs-test branch June 2, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data source driver javascript Pull requests that update Javascript code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants