Skip to content

Selection ORDER BY <col> LIMIT n does a full segment scan when any filter is present (min/max segment-skip disabled) #18685

@gortiz

Description

@gortiz

Summary

For a selection query ... ORDER BY <identifier> [DESC] LIMIT n, Pinot's single-stage engine uses MinMaxValueBasedSelectionOrderByCombineOperator, which skips a segment when its column min/max can't beat the current top-n boundary (so an unfiltered "latest n" query touches only a couple of segments). Adding any filter predicate — even a trivially-true one on an unrelated column — silently defeats this optimization, and the query falls back to scanning all segments.

Environment

  • Apache Pinot 1.6.0-SNAPSHOT (commit dd6520c726761a6988107b96bfa41daf6617e522)
  • OFFLINE table, single-stage engine (SET useMultistageEngine=false). The multi-stage engine shows the same behavior at the leaf stage.

Reproduction

Any multi-segment OFFLINE table works. Here: TPC-H lineitem, 600,037,902 rows across 300 OFFLINE segments, time column l_ship_ts (TIMESTAMP), data laid out so each segment covers a contiguous l_ship_ts range (tight per-segment min/max). Numbers are from the broker response.

# query numSegmentsProcessed numDocsScanned time
1 SELECT * FROM lineitem ORDER BY l_ship_ts DESC LIMIT 10 2 281,125 143 ms
2 SELECT * FROM lineitem WHERE l_orderkey > 0 ORDER BY l_ship_ts DESC LIMIT 10 300 26,930,581 2,864 ms
3 SELECT * FROM lineitem WHERE l_ship_ts < CAST('1998-11-22 00:00:00' AS TIMESTAMP) ORDER BY l_ship_ts DESC LIMIT 10 299 28,812,078 2,569 ms

Expected behavior

The min/max segment-skip should remain valid in the presence of a filter. For descending order a segment is skipped when segment.max(orderByCol) < boundary; since the filtered rows are a subset of all rows, if the segment's (unfiltered) max is already below the current boundary then no row — filtered or not — can enter the top-n. So #2 and #3 should also process ~2 segments.

Pointers

  • CombinePlanNode#getCombineOperator selects MinMaxValueBasedSelectionOrderByCombineOperator for isSelectionQuery && limit > 0 && orderBy[0] is IDENTIFIER — with no filter guard — so the operator is used for filtered queries.
  • MinMaxValueBasedSelectionOrderByCombineOperator#processSegments reads each segment's DataSourceMetadata.getMin/MaxValue() and skips via segment.max < boundary (DESC) once a segment returns ≥ limit + offset rows. This skip logic does not reference the filter.

So the operator is selected and the skip would be safe, yet a filter prevents the skip from engaging (the boundary is established — in #2 the newest segment returns 10 rows — but no segments are pruned). I have not pinned the exact root-cause line (boundary propagation / how the filtered per-segment results feed the combine); the behavior is fully reproducible via the trivial-filter case (#2).

Impact

Any "latest n matching rows" or keyset-pagination query with a predicate scans the whole table instead of a couple of segments (here 281K → 27M docs, 143 ms → 2.9 s). Notably, deep OFFSET without a filter stays cheap, so adding a WHERE counter-intuitively makes a top-n query dramatically slower.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions