Skip to content

[Bug](nereids) distributionPrune doesn't slice nereidsPrunedTabletIds per partition #63854

@Larborator

Description

@Larborator

Search before asking

  • I had searched in the issues and found no similar issues.

Version

4.x

What's Wrong?

OlapScanNode.distributionPrune returns the entire nereidsPrunedTabletIds set whenever the query goes through Nereids. The caller OlapScanNode.computeTabletInfo invokes it inside a per-partition loop:

// fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java
for (Long partitionId : selectedPartitionIds) {
    final MaterializedIndex selectedTable = partition.getIndex(selectedIndexId);
    Collection<Long> prunedTabletIds = distributionPrune(..., true);
    //   ↑ returns the GLOBAL nereidsPrunedTabletIds set, copied into a new ArrayList,
    //     not just tablets belonging to this partition

    if (prunedTabletIds != null) {
        for (Long id : prunedTabletIds) {
            if (selectedTable.getTablet(id) != null) {       // hash lookup
                tablets.add(selectedTable.getTablet(id));    // hash lookup again
                scanTabletIds.add(id);
            }
            // else: id belongs to another partition; result is null and discarded
        }
    }
}

Because nereidsPrunedTabletIds is the union of all selected tablets across all partitions (set in PhysicalPlanTranslator.visitPhysicalOlapScan):

olapScanNode.setNereidsPrunedTabletIds(
    new LinkedHashSet<>(olapScan.getSelectedTabletIds()));

each per-partition iteration walks the entire global set and does a MaterializedIndex.getTablet(id) HashMap lookup on ids that belong to other partitions. The vast majority of these lookups return null and are discarded, but the hash work is already paid.

Complexity

Let P = selectedPartitionIds.size() and M = nereidsPrunedTabletIds.size().

  • getTablet calls: O(P × M × 2) instead of O(M × 2)
  • HashSet → ArrayList copies inside distributionPrune: O(P × M) elements

The regression is most pronounced on tables whose distribution layout is many partitions × few buckets per partition (e.g. fine-grained hourly partitioning with 2 buckets per partition), because then MP × bucketsPerPartition, and the total getTablet calls grow as O(P² × bucketsPerPartition).

Concrete example

A range-partitioned table with hourly partitions and 2 buckets per partition, queried with a one-year time range and a bucket-key equality predicate that prunes one of the two buckets per partition:

  • P ≈ 24 × 365 ≈ 8760 partitions
  • M ≈ 8760 (one tablet per partition selected after bucket pruning)
  • Total getTablet lookups: ≈ 76 million
  • HashSet copies: ≈ 76M element copies

CPU profiling of the FE plan thread:

OlapScanNode.computeTabletInfo                      83%
  ├─ MaterializedIndex.getTablet -> HashMap.getNode 54%
  ├─ ArrayList.<init> -> HashSet.toArray            17%
  └─ OlapScanNode.addScanRangeLocations              2%

In practice this means a query that planned in ~0.1 s before #53403 now spends ~2.7 s in Nereids Translate Time for the same SQL on the same data.

What You Expected?

distributionPrune under the Nereids path should only return tablets belonging to the partition currently being processed, matching the semantics of the non-Nereids HashDistributionPruner path. Plan time should not depend quadratically on partition count.

How to Reproduce?

  1. Create a range-partitioned table with many partitions and few buckets per partition, e.g.:

    CREATE TABLE t (
        event_time DATETIME,
        user_id    BIGINT,
        v          BIGINT
    )
    DUPLICATE KEY(event_time, user_id)
    PARTITION BY RANGE(event_time) (...)              -- hourly, ~8000 partitions
    DISTRIBUTED BY HASH(user_id) BUCKETS 2;
  2. Run a query that selects most partitions plus a bucket-key predicate that prunes most of the buckets per partition:

    SET enable_profile = true;
    EXPLAIN SELECT * FROM t
      WHERE event_time >= '2025-01-01' AND event_time < '2026-01-01'
        AND user_id = 12345;
  3. SHOW QUERY PROFILE "/<query_id>" and look at Nereids Translate Time. Compare against a build prior to [fix](nereids) fix prune tablets twice #53403, or against an equivalent table whose partition count is small.

  4. Optional — attach an async-profiler CPU profile to the FE process during a stress run; the hottest stack will be MaterializedIndex.getTablet -> HashMap.getNode under OlapScanNode.computeTabletInfo.

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions