[improvement](fe) TopN lazy materialization support struct/variant nested column pruning#63736
[improvement](fe) TopN lazy materialization support struct/variant nested column pruning#63736englefly wants to merge 3 commits into
Conversation
…sted column pruning Core changes: - LazyMaterializeTopN: find projects below TopN with PreferPushDownProject or variant subColPath expressions, simplify them to expose base columns as lazy candidates. Pull up expressions above Materialize for lazy mat. - LazySlotPruning: extract shouldPruneChild() for safe override when logical properties are stale after plan restructuring - OperativeColumnDerive: skip PreferPushDownProject input slots from operative propagation so struct/variant columns can be lazy - PhysicalLazyMaterialize: propagate access paths to lazy output slots - MaterializationNode: use materializeTupleDescriptor for nested column info display (was outputTupleDesc, never set) - PlanNode: add subColLables to printNestedColumns for variant Tests cover struct (struct_element + pruned type) and variant (element_at + sub path) lazy materialization with nested column pruning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
|
run buildall |
There was a problem hiding this comment.
I found one blocking correctness issue that should be fixed before merge.
Critical checkpoint conclusions:
- Goal/test coverage: The PR aims to enable TopN lazy materialization for nested/PPD expressions. It adds regression coverage for several
id, nested_exprshapes, but misses output-order permutations where the pulled-up expression appears before other selected columns. - Scope/focus: The feature is generally focused, but the final TopN projection reconstruction changes user-visible column order in the new pulled-up expression path.
- Concurrency/lifecycle: No new shared mutable concurrency or non-obvious lifecycle issue found in the reviewed FE paths.
- Config/compatibility: No new config or incompatible storage/protocol format change found.
- Parallel paths: Normal and
using_indexTopN lazy materialization cases were reviewed; the reported issue is specific to the new nested/PPD pull-up path. - Special checks/error handling: No ignored Status or FE exception-boundary issue found in the actual PR diff.
- Tests/results: Regression tests are deterministic, but they do not cover pulled-up nested expressions in non-tail output positions.
- Observability/performance/transactions: No additional observability, performance, transaction, or persistence blocker found for this PR.
User focus: No additional user-provided review focus was specified.
| } | ||
| } | ||
| outputExprs.addAll(pulledUpExprs); | ||
| result = new PhysicalProject(ImmutableList.copyOf(outputExprs), null, result); |
There was a problem hiding this comment.
This rebuilds the final projection as all non-pulled slots followed by all pulled-up expressions, which changes the user-visible column order whenever a pulled-up nested expression is not already the last selected column. For example select substring(struct_element(struct_col, 'city'), 1) as city, id ... order by id limit 3 has userVisibleOutput = [city, id], but this code produces [id, city], so the result schema and row values are swapped relative to the SQL. Please preserve the original userVisibleOutput order by replacing each pulled slot in-place with its corresponding pulled-up expression, and add a regression case with the nested expression before another selected column.
TPC-H: Total hot run time: 32544 ms |
TPC-DS: Total hot run time: 172873 ms |
FE Regression Coverage ReportIncrement line coverage |
Previously, TopN lazy materialization only worked for top-level scalar
columns. Complex-type projection expressions like struct_element() and
element_at() remained eagerly evaluated at the scan, losing the benefit
of reading the expensive base column for only the TopN-selected rows.
This PR extends TopN lazy materialization to recognize complex-type
projection expressions (struct_element, element_at, map_keys, etc.)
and defer reading their base columns (struct, variant, map, array)
until after the TopN filter step.
Motivation
For queries like:
The struct column is very wide (many fields). Previously, struct_col
was read for ALL rows before sorting, even though only 10 rows survive
TopN. Now struct_col becomes a lazy slot — the scan outputs only the
columns needed for sorting (id) + rowId; struct_col is fetched remotely
only for the 10 winning rows, with nested column pruning further
limiting the read to just the 'city' sub-field.
The same applies to variant (element_at + subColPath), map/array
(element_at subscript), and map functions (map_keys, map_values, etc.).
Core Changes
1. LazyMaterializeTopN — restructure plan to pull up PPD/subPath exprs
findLeafProject()walks the standard TopN physical plan shape(MERGE_SORT → Distribute → LOCAL_SORT → Project) to locate the leaf
project where complex-type expressions reside.
isNestedLazyExpression()identifies lazy candidates via two paths:a)
containsType(PreferPushDownProject.class)— catchesStructElement, ElementAt (map/array/variant subscript), MapKeys,
MapValues, MapContainsKey, MapContainsValue, and other functions
b)
(SlotReference) slot.hasSubColPath()— catches variant sub-pathslots generated by VariantSubPathPruning
The leaf project is split: PPD/subPath expressions are pulled up into
a new project above Materialize, while their input slots (base columns
like struct_col, payload, map_col, arr_col) become lazy candidates.
replaceLeafProject()rebuilds the plan chain bottom-up after removingpulled-up expressions. New PhysicalProject nodes use null
LogicalProperties because outputs have changed.
createLazySlotPruning()returns an anonymous subclass that overridesshouldPruneChild()to always return true. This is necessary becauseintermediate nodes (localTopN, distribute) retain stale logical
properties after plan restructuring that don't include the new lazy
slots — the default
containsAllcheck would skip the subtree.computeMaterializeSource()gains a fallback: when the lazy candidateis not directly in the TopN output (it's a hidden base column only
referenced by a pulled-up PPD expression), probing starts from the
leaf project's child (the scan) instead of from the TopN.
Base slot access paths (
subColPath/allAccessPaths) are carriedfrom the baseSlot to the materialize slot so the BE receives the
correct nested column / sub-path pruning metadata.
When
moveRowIdsToTailreturns null (stale logical properties),correctInputis built frommaterializedSlots + rowIdsdirectlyinstead of using
result.getOutput().2. LazySlotPruning — extract shouldPruneChild() for override
child.getOutput().containsAll(context.lazySlots)guard isextracted from
visit()into a protectedshouldPruneChild()method.This allows the caller to override it when logical properties are
stale, without duplicating the entire traversal logic.
3. OperativeColumnDerive — skip PreferPushDownProject input slots
visitLogicalProject()is refactored: input slots are only propagatedto operative when the output slot is actually needed upstream.
Previously all input slots of complex expressions were unconditionally
marked operative, blocking them from lazy materialization.
New
addOperativeSlotsSkipPreferPushDownProject()traverses theexpression tree adding Slot nodes to operative, but skips entire
subtrees rooted at PreferPushDownProject nodes (StructElement,
ElementAt, MapKeys, etc.). This prevents base columns of complex
types from being marked operative when they only appear as inputs to
PPD functions, allowing them to be lazy.
4. PhysicalLazyMaterialize — propagate access paths to lazy outputs
paths (
allAccessPaths,predicateAccessPaths, display variants)are copied to the lazy output slot. This ensures nested column
pruning metadata survives through the materialization pipeline and
reaches the BE.
5. MaterializationNode — fix nested column info display
printNestedColumns()usesmaterializeTupleDescriptorinstead ofoutputTupleDesc. The latter was never set, so nested column infowas silently missing from EXPLAIN output.
6. PlanNode — add subColLables to nested columns display
printNestedColumns()now printssub path: [a.b.c]for variantsub-column paths, making them visible in EXPLAIN output.
Type Coverage
Note: map/array subscript uses ACCESS_ALL (entire column read) because
individual key/element lookup requires scanning the whole map/array.
The lazy materialization benefit comes from deferring this read to only
the TopN-surviving rows. Struct and variant additionally benefit from
sub-field/key pruning within the lazy read.
Constraints
Assumes standard OLAP single-table TopN physical plan shape:
MERGE_SORT → Distribute → LOCAL_SORT → Project. If the plan shape
differs, lazy mat is skipped gracefully.
Aggregate key tables are excluded (BE limitation).
topn_lazy_materialization_thresholdmust be >= query LIMIT.What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)