Summary
For a selection query ... ORDER BY <identifier> [DESC] LIMIT n, Pinot's single-stage engine uses MinMaxValueBasedSelectionOrderByCombineOperator, which skips a segment when its column min/max can't beat the current top-n boundary (so an unfiltered "latest n" query touches only a couple of segments). Adding any filter predicate — even a trivially-true one on an unrelated column — silently defeats this optimization, and the query falls back to scanning all segments.
Environment
- Apache Pinot
1.6.0-SNAPSHOT (commit dd6520c726761a6988107b96bfa41daf6617e522)
- OFFLINE table, single-stage engine (
SET useMultistageEngine=false). The multi-stage engine shows the same behavior at the leaf stage.
Reproduction
Any multi-segment OFFLINE table works. Here: TPC-H lineitem, 600,037,902 rows across 300 OFFLINE segments, time column l_ship_ts (TIMESTAMP), data laid out so each segment covers a contiguous l_ship_ts range (tight per-segment min/max). Numbers are from the broker response.
| # |
query |
numSegmentsProcessed |
numDocsScanned |
time |
| 1 |
SELECT * FROM lineitem ORDER BY l_ship_ts DESC LIMIT 10 |
2 |
281,125 |
143 ms |
| 2 |
SELECT * FROM lineitem WHERE l_orderkey > 0 ORDER BY l_ship_ts DESC LIMIT 10 |
300 |
26,930,581 |
2,864 ms |
| 3 |
SELECT * FROM lineitem WHERE l_ship_ts < CAST('1998-11-22 00:00:00' AS TIMESTAMP) ORDER BY l_ship_ts DESC LIMIT 10 |
299 |
28,812,078 |
2,569 ms |
Expected behavior
The min/max segment-skip should remain valid in the presence of a filter. For descending order a segment is skipped when segment.max(orderByCol) < boundary; since the filtered rows are a subset of all rows, if the segment's (unfiltered) max is already below the current boundary then no row — filtered or not — can enter the top-n. So #2 and #3 should also process ~2 segments.
Pointers
CombinePlanNode#getCombineOperator selects MinMaxValueBasedSelectionOrderByCombineOperator for isSelectionQuery && limit > 0 && orderBy[0] is IDENTIFIER — with no filter guard — so the operator is used for filtered queries.
MinMaxValueBasedSelectionOrderByCombineOperator#processSegments reads each segment's DataSourceMetadata.getMin/MaxValue() and skips via segment.max < boundary (DESC) once a segment returns ≥ limit + offset rows. This skip logic does not reference the filter.
So the operator is selected and the skip would be safe, yet a filter prevents the skip from engaging (the boundary is established — in #2 the newest segment returns 10 rows — but no segments are pruned). I have not pinned the exact root-cause line (boundary propagation / how the filtered per-segment results feed the combine); the behavior is fully reproducible via the trivial-filter case (#2).
Impact
Any "latest n matching rows" or keyset-pagination query with a predicate scans the whole table instead of a couple of segments (here 281K → 27M docs, 143 ms → 2.9 s). Notably, deep OFFSET without a filter stays cheap, so adding a WHERE counter-intuitively makes a top-n query dramatically slower.
Summary
For a selection query
... ORDER BY <identifier> [DESC] LIMIT n, Pinot's single-stage engine usesMinMaxValueBasedSelectionOrderByCombineOperator, which skips a segment when its column min/max can't beat the current top-nboundary (so an unfiltered "latest n" query touches only a couple of segments). Adding any filter predicate — even a trivially-true one on an unrelated column — silently defeats this optimization, and the query falls back to scanning all segments.Environment
1.6.0-SNAPSHOT(commitdd6520c726761a6988107b96bfa41daf6617e522)SET useMultistageEngine=false). The multi-stage engine shows the same behavior at the leaf stage.Reproduction
Any multi-segment OFFLINE table works. Here: TPC-H
lineitem, 600,037,902 rows across 300 OFFLINE segments, time columnl_ship_ts(TIMESTAMP), data laid out so each segment covers a contiguousl_ship_tsrange (tight per-segment min/max). Numbers are from the broker response.numSegmentsProcessednumDocsScannedSELECT * FROM lineitem ORDER BY l_ship_ts DESC LIMIT 10SELECT * FROM lineitem WHERE l_orderkey > 0 ORDER BY l_ship_ts DESC LIMIT 10SELECT * FROM lineitem WHERE l_ship_ts < CAST('1998-11-22 00:00:00' AS TIMESTAMP) ORDER BY l_ship_ts DESC LIMIT 10Expected behavior
The min/max segment-skip should remain valid in the presence of a filter. For descending order a segment is skipped when
segment.max(orderByCol) < boundary; since the filtered rows are a subset of all rows, if the segment's (unfiltered) max is already below the current boundary then no row — filtered or not — can enter the top-n. So #2 and #3 should also process ~2 segments.Pointers
CombinePlanNode#getCombineOperatorselectsMinMaxValueBasedSelectionOrderByCombineOperatorforisSelectionQuery && limit > 0 && orderBy[0] is IDENTIFIER— with no filter guard — so the operator is used for filtered queries.MinMaxValueBasedSelectionOrderByCombineOperator#processSegmentsreads each segment'sDataSourceMetadata.getMin/MaxValue()and skips viasegment.max < boundary(DESC) once a segment returns≥ limit + offsetrows. This skip logic does not reference the filter.So the operator is selected and the skip would be safe, yet a filter prevents the skip from engaging (the boundary is established — in #2 the newest segment returns 10 rows — but no segments are pruned). I have not pinned the exact root-cause line (boundary propagation / how the filtered per-segment results feed the combine); the behavior is fully reproducible via the trivial-filter case (#2).
Impact
Any "latest n matching rows" or keyset-pagination query with a predicate scans the whole table instead of a couple of segments (here 281K → 27M docs, 143 ms → 2.9 s). Notably, deep
OFFSETwithout a filter stays cheap, so adding aWHEREcounter-intuitively makes a top-nquery dramatically slower.