perf: only alias schema members referenced by the query#252
Conversation
When a dimension SQL references a struct field like `stage.stage_id`, the aliaser previously skipped it because `column_names.length !== 1`. After joining tables that have a column with the same name as the struct root (e.g. `devusers.stage` alongside `issue.stage`), DuckDB's binder raises an ambiguous column reference error before the query can run. Make the aliaser schema-aware: introspect each table's physical columns via `DESCRIBE` in the browser and node wrappers, then rewrite a multi-part column ref as `<tableName>.<root>.<rest...>` only when the leading identifier is a known column on the current table. Cross-table references (`customers.id`), already-qualified refs, lambda-bound identifiers, and unknown multi-part refs are left untouched. Falls back to the legacy length-1 behavior when schema information is not supplied, preserving backwards compatibility.
ensureTableSchemasAlias was collecting every measure and dimension across every table schema into a single batch and sending them through ensureColumnAliasBatch — two DuckDB roundtrips (parseExpressions and serializeExpressions) that scale linearly with the number of members in the batch. For tenants with ~8k fields, a single call took ~16.5s even when the active query referenced only a handful of them. Make ensureTableSchemaAliasSql take the Query as a required arg and only alias members the query actually touches — walking measures, dimensions, order, and filters (recursive and/or tree). joinPaths is intentionally excluded: its left/right values are table names, not member references, and the actual join SQL lives in TableSchema.joins[].sql which is already fully qualified. Members not referenced by the query are passed through untouched on the cloned schema, so DuckDB never parses them. For a widget referencing ~20 members out of 8k, that is roughly a 400x reduction in parse/serialize work per call. BREAKING CHANGE: - @devrev/meerkat-core: EnsureTableSchemaAliasSqlParams now requires `query: Query`. - @devrev/meerkat-browser: EnsureTableSchemasAliasParams now requires `query: Query`; the curried ensureTableSchemaAlias(connection) factory returns (tableSchemas, query) => .... - @devrev/meerkat-node: mirrored changes. Bumped meerkat-core, meerkat-browser, meerkat-node to 0.0.128. Unit tests cover query-driven filtering, member preservation for unreferenced members, filter-only refs, order-only refs, and joinPaths exclusion. All pre-existing tests updated to pass a query. Issue: ISS-299745
|
semantic-release (dev dep) pins lodash-es@^4.17.21, which npm resolves to 4.17.23 — flagged by Snyk as SNYK-JS-LODASH-15869625 (high) and SNYK-JS-LODASH-15869619 (medium). Latest lodash-es 4.18.1 carries the patches. Adds a root npm override for lodash-es >=4.18.1 so every nested semantic-release dep de-dupes to the patched version. Follows the same pattern as the earlier qs/picomatch/@tootallnate/once overrides from #244. Issues: ISS-279504, ISS-279506
|
…ration tests The two meerkat-node integration tests for ensureTableSchemaAliasSql invoke the core helper directly (rather than through the node wrapper), so they need the new required 'query' arg that drives referenced-member filtering. Reference every fixture member so the aliaser behaves the same as before. Issue: ISS-299745
| const isAliasableIdentifier = (identifier: string | undefined): boolean => { | ||
| if (!identifier) { | ||
| return false; | ||
| } | ||
| if (/\s/.test(identifier)) { | ||
| return false; | ||
| } | ||
| return ALIASABLE_IDENTIFIER_REGEX.test(identifier); | ||
| }; |
There was a problem hiding this comment.
what are we doing here?
| const candidate = filter as { | ||
| and?: unknown[]; | ||
| or?: unknown[]; | ||
| member?: string; | ||
| }; |
There was a problem hiding this comment.
is this type correct?
| const collectReferencedMembers = (query: Query): Set<string> => { | ||
| const referenced = new Set<string>(); | ||
|
|
||
| query.measures.forEach((m) => referenced.add(m)); | ||
| query.dimensions?.forEach((d) => referenced.add(d)); | ||
|
|
||
| if (query.order) { | ||
| Object.keys(query.order).forEach((m) => referenced.add(m)); | ||
| } | ||
|
|
||
| query.filters?.forEach((filter) => collectFilterMembers(filter, referenced)); | ||
|
|
||
| return referenced; | ||
| }; |
| const collectReferencedMembers = (query: Query): Set<string> => { | ||
| const referenced = new Set<string>(); | ||
|
|
||
| query.measures.forEach((m) => referenced.add(m)); | ||
| query.dimensions?.forEach((d) => referenced.add(d)); | ||
|
|
||
| if (query.order) { | ||
| Object.keys(query.order).forEach((m) => referenced.add(m)); | ||
| } | ||
|
|
||
| query.filters?.forEach((filter) => collectFilterMembers(filter, referenced)); | ||
|
|
||
| return referenced; | ||
| }; | ||
|
|
There was a problem hiding this comment.
we have similar functionality for tableName, getUsedTables, this util can be parallel to that
| scopedIdentifiers: Set<string> | ||
| scopedIdentifiers: Set<string>, | ||
| tableName: string, | ||
| knownTableNames?: Set<string> |
| scopedIdentifiers: Set<string>, | ||
| tableName?: string | ||
| tableName?: string, | ||
| knownTableNames?: Set<string> |
There was a problem hiding this comment.
why do we need to maintain a set of knownTables?
Addresses review feedback on #252: - Move the filter/order/measures/dimensions walker out of ensure-table-schema-alias-sql.ts into src/get-used-members/, mirroring the structure of src/get-used-table-schema/. - Reuse traverseMeerkatQueryFilter for the filter traversal so we share a single walker with getUsedTableSchema and get a properly typed path through MeerkatQueryFilter (drops the prior `unknown`-cast workaround). - Drop duplicate tests from ensure-table-schema-alias-sql.spec.ts that are now owned by get-used-members.spec.ts; keep the one integration test that asserts the aliaser filters descriptors end-to-end. - Add brief doc comments to ensureTableSchemaAliasSql explaining why TableSchema.joins[].sql is preserved as-is, and to the struct-field helpers clarifying what `knownTableNames` does (optional for the legacy fallback path). Issue: ISS-299745
Addresses review feedback on #252: - Move the filter/order/measures/dimensions walker out of ensure-table-schema-alias-sql.ts into src/get-used-members/, mirroring the structure of src/get-used-table-schema/. - Reuse traverseMeerkatQueryFilter for the filter traversal so we share a single walker with getUsedTableSchema and get a properly typed path through MeerkatQueryFilter (drops the prior `unknown`-cast workaround). - Drop duplicate tests from ensure-table-schema-alias-sql.spec.ts that are now owned by get-used-members.spec.ts; keep the one integration test that asserts the aliaser filters descriptors end-to-end. - Add brief doc comments to ensureTableSchemaAliasSql explaining why TableSchema.joins[].sql is preserved as-is, and to the struct-field helpers clarifying what `knownTableNames` does (optional for the legacy fallback path). Issue: ISS-299745
Summary
ensureTableSchemasAliaswas taking ~16.5s for tenants with ~8k fields even when the active query referenced only a handful of them:Every measure and dimension across every table schema was collected into one flat batch and pushed through
ensureColumnAliasBatch, which does two DuckDB roundtrips (parseExpressions+serializeExpressions) that scale linearly with the batch size. Most of that work was wasted on members the query never touches.What changed
ensureTableSchemaAliasSql(meerkat-core) now takes theQueryas a required arg and filters descriptors before batching. It walks:query.measuresquery.dimensionsquery.orderkeysquery.filters(recursiveand/ortree)joinPathsis intentionally excluded:left/rightare table names, not dotted member refs. The actual join SQL lives inTableSchema.joins[].sqlwhich is already fully qualified and bypassed by the aliaser.Members not referenced by the query are passed through untouched on the cloned schema — no parse/serialize roundtrip for them.
meerkat-browserandmeerkat-nodethread the new `query` arg through their thin wrappers.Expected impact
For a widget referencing ~20 members out of 8k, DuckDB sees 20 SQL strings per batch instead of 8000 — roughly a 400x reduction in parse/serialize work per call.
BREAKING CHANGE
API surface change on `@devrev/meerkat-core`, `@devrev/meerkat-browser`, `@devrev/meerkat-node`:
All three packages bumped to `0.0.128`. A companion devrev-web PR updates the 4 call sites.
Notes for reviewers
This PR currently includes the commit from #247 (`fix: qualify struct field access to avoid binder ambiguity post-join`) as a base, because it touches the same `ensureTableSchemaAliasSql` surface. Once #247 merges, this branch will rebase cleanly to only the perf commit. Please focus review on the top commit (`perf: only alias schema members referenced by the query`).
Test plan
Related