fix(tesseract): multi-stage pre-aggregation usage substitution and IS NOT DISTINCT FROM#10925
Conversation
…e-aggregations Tesseract emits SQL with `<base>__usage_N` suffixes when a multi-stage query needs multiple distinct scans of the same pre-aggregation (e.g. a base measure plus a multi-stage measure over it in one query). The orchestrator only populated `usageTargetTableNames` inside the partitioned branch of `PreAggregationPartitionRangeLoader.loadPreAggregations`, so for non-partitioned rollups the suffix-aware replacement never ran; `QueryCache.replacePreAggregationTableNames` then matched the bare base name inside `<base>__usage_0` and left the suffix glued onto the substituted full table name, producing `prod_pre_aggregations.<full>__usage_0 was not found` against Cubestore. Mirror the partitioned branch for the non-partitioned path: when `usageMapping` is present on the description, map every suffix to the single `result.targetTableName`. Coverage: - Integration test in schema-compiler (`two multi-stage branches sharing one pre-aggregation`) that exercises the multi-usage planner path. - Testing-drivers test `querying BigECommerce: base measure plus multi-stage over non-partitioned pre-aggregation` reproducing the production failure end-to-end against real Postgres + Cubestore. - New `CategoryFlat` non-partitioned pre-aggregation in driver fixtures to host the scenario; test enabled in Tesseract mode for the drivers that handle Tesseract cleanly (postgres, bigquery, databricks-jdbc, athena, mysql, redshift, snowflake) and skipped for mssql/clickhouse.
Without the operator registered, Tesseract falls back to `a = b OR (a IS NULL AND b IS NULL)` for every null-safe equality in multi-stage join conditions. For a query with N dimensions this produces a JOIN ON with N such triplets — Cubestore handles it, but the SQL balloons and is unreadable in logs. Cubestore's SQL frontend supports `IS NOT DISTINCT FROM` natively, so register the operator template the same way Postgres/BigQuery/Snowflake do. Verified on the testing-drivers `base measure plus multi-stage over non-partitioned pre-aggregation` scenario: the generated JOIN ON collapses from per-dim `(= ... OR (IS NULL AND IS NULL))` clauses to single `IS NOT DISTINCT FROM` predicates and the test still passes end-to-end against real Cubestore.
|
Claude finished @waralexrom's task in 7m 7s —— View job Code Review: Multi-stage pre-aggregation fix
Overall AssessmentThis is a well-targeted bug fix with good test coverage. The core change is minimal and correct — filling in the non-partitioned branch that was missing Core Fix —
|
| Area | Status | Notes |
|---|---|---|
| Core bug fix | ✅ Correct | Non-partitioned branch now populates usageTargetTableNames |
| IS NOT DISTINCT FROM | ✅ Correct | Follows established pattern |
| Integration test | ✅ Good coverage | Validates two-branch multi-usage scenario |
| Driver test - base measure | ✅ Complete | Snapshots and skips for all drivers |
| Driver test - two branches | Only postgres has skip entry and snapshot | |
| Snapshot changes | ✅ Clean | Only additions + one benign reorder |
The core fix is solid. The main item to address is the missing skip/tesseractSkip entries (or snapshots) for the "two multi-stage branches" test across non-postgres drivers.
…ithout snapshots The earlier-added `querying BigECommerce: two multi-stage branches sharing one pre-aggregation` test only has a snapshot in postgres-full, but was not present in the skip lists of the other driver fixtures. In Tesseract mode Jest then tries to run it against those drivers (databricks-jdbc, athena, mysql, redshift, snowflake, bigquery, clickhouse, mssql) and fails with `New snapshot was not written. The update flag must be explicitly passed to write a new snapshot.` Add the test name to both `skip` and `tesseractSkip` for every non- postgres driver fixture (it was already in `tesseractSkip` for clickhouse/mssql via the previous commit pattern), matching how `base measure plus multi-stage over non-partitioned pre-aggregation` is handled where snapshots are missing.
…acts BigQuery returns sums with trailing FP precision noise (e.g. `459.7526` → `459.75260000000003`, `-0.2355` → `-0.23549999999999993`) that the postgres NUMERIC representation does not exhibit. The snapshot for `querying BigECommerce: base measure plus multi-stage over non-partitioned pre-aggregation` was seeded from the postgres values when the test was tiraged across drivers; update it to the values actually produced by BigQuery so CI passes.
Athena returns the same FP precision noise as BigQuery for the `base measure plus multi-stage over non-partitioned pre-aggregation` test. Apply the same set of value adjustments as the bigquery snap.
…ckhouse ClickHouse cannot build the non-partitioned `BigECommerce.CategoryFlat` rollup. The corresponding test (`base measure plus multi-stage over non-partitioned pre-aggregation`) is already skipped in both regular and tesseract skip lists for clickhouse, so the rollup is never used — skip the build call too to avoid failing the `must built pre-aggregations` step on clickhouse-full.
…se fixture ClickHouse cannot build the non-partitioned `BigECommerce.CategoryFlat` rollup that was added for the multi-usage repro test. The test itself is already skipped in both regular and tesseract skip lists for clickhouse, so the rollup is dead config on this driver — remove it so the cubestore build step does not have to materialize it.
…ranches test A stray edit replaced one of the multi-stage measures with the regular `revenue` measure, which broke the assertion. Restore the original `revenue_no_id_sum` + `revenue_no_id_pct` pair the test was designed to exercise.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10925 +/- ##
===========================================
- Coverage 78.93% 58.41% -20.53%
===========================================
Files 470 216 -254
Lines 92862 17000 -75862
Branches 3449 3450 +1
===========================================
- Hits 73304 9931 -63373
+ Misses 19054 6564 -12490
- Partials 504 505 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
When a multi-stage query needs several distinct scans of the same pre-aggregation, Tesseract emits SQL with
<base>__usage_Nsuffixes (e.g.payer_360_lives_cube_julio__usage_0). The orchestrator's suffix-aware substitution lived only inside the partitioned branch ofPreAggregationPartitionRangeLoader.loadPreAggregations, so for non-partitioned rollups the suffix was never mapped to a real table name. Combined with the substitution's lack of word boundaries, this left__usage_Nglued onto the substituted full table name and Cubestore failed withTable prod_pre_aggregations.<full>__usage_0 was not found.This PR fixes the substitution path for non-partitioned rollups and, while at it, registers
IS NOT DISTINCT FROMin the Cubestore SQL templates so multi-stage JOIN ON clauses don't balloon intoOR (a IS NULL AND b IS NULL)triplets per dimension.Changes
PreAggregationPartitionRangeLoader.ts— populateusageTargetTableNamesin the non-partitionedelsebranch too. For non-partitioned rollups every__usage_Nmaps to the singleresult.targetTableName.CubeStoreQuery.ts— registeroperators.is_not_distinct_from = 'IS NOT DISTINCT FROM'(Cubestore's SQL frontend supports it natively).cubejs-schema-compiler(two multi-stage branches sharing one pre-aggregation) exercising the multi-usage planner path.cubejs-testing-drivers(querying BigECommerce: base measure plus multi-stage over non-partitioned pre-aggregation) reproducing the production failure end-to-end against real Postgres + Cubestore. New non-partitionedCategoryFlatpre-aggregation added to driver fixtures to host the scenario.*-fullplusexport-bucket-*/encrypted-pkvariants). Test is skipped for mssql (integer encoding of decimals) and clickhouse (FP artifacts) — they keep it intesseractSkip. Non-Tesseract mode skipped everywhere (multi-stage requires Tesseract).Testing
Error during planning: Table prod_pre_aggregations.big_e_commerce__category_flat_external_<hashes>__usage_1 was not found.--mode=local).IS NOT DISTINCT FROMtemplate — JOIN ON clauses collapse from per-dim(a = b OR (a IS NULL AND b IS NULL))to single(a IS NOT DISTINCT FROM b)predicates; values in the snapshot unchanged.Test plan
PreAggregationsMultiStagedescribe.-uto catch any per-driver decimal-format discrepancies.