You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar issues.
Version
4.x
What's Wrong?
OlapScanNode.distributionPrune returns the entire nereidsPrunedTabletIds set whenever the query goes through Nereids. The caller OlapScanNode.computeTabletInfo invokes it inside a per-partition loop:
// fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.javafor (LongpartitionId : selectedPartitionIds) {
finalMaterializedIndexselectedTable = partition.getIndex(selectedIndexId);
Collection<Long> prunedTabletIds = distributionPrune(..., true);
// ↑ returns the GLOBAL nereidsPrunedTabletIds set, copied into a new ArrayList,// not just tablets belonging to this partitionif (prunedTabletIds != null) {
for (Longid : prunedTabletIds) {
if (selectedTable.getTablet(id) != null) { // hash lookuptablets.add(selectedTable.getTablet(id)); // hash lookup againscanTabletIds.add(id);
}
// else: id belongs to another partition; result is null and discarded
}
}
}
Because nereidsPrunedTabletIds is the union of all selected tablets across all partitions (set in PhysicalPlanTranslator.visitPhysicalOlapScan):
each per-partition iteration walks the entire global set and does a MaterializedIndex.getTablet(id) HashMap lookup on ids that belong to other partitions. The vast majority of these lookups return null and are discarded, but the hash work is already paid.
Complexity
Let P = selectedPartitionIds.size() and M = nereidsPrunedTabletIds.size().
getTablet calls: O(P × M × 2) instead of O(M × 2)
HashSet → ArrayList copies inside distributionPrune: O(P × M) elements
The regression is most pronounced on tables whose distribution layout is many partitions × few buckets per partition (e.g. fine-grained hourly partitioning with 2 buckets per partition), because then M ≈ P × bucketsPerPartition, and the total getTablet calls grow as O(P² × bucketsPerPartition).
Concrete example
A range-partitioned table with hourly partitions and 2 buckets per partition, queried with a one-year time range and a bucket-key equality predicate that prunes one of the two buckets per partition:
P ≈ 24 × 365 ≈ 8760 partitions
M ≈ 8760 (one tablet per partition selected after bucket pruning)
In practice this means a query that planned in ~0.1 s before #53403 now spends ~2.7 s in Nereids Translate Time for the same SQL on the same data.
What You Expected?
distributionPrune under the Nereids path should only return tablets belonging to the partition currently being processed, matching the semantics of the non-Nereids HashDistributionPruner path. Plan time should not depend quadratically on partition count.
How to Reproduce?
Create a range-partitioned table with many partitions and few buckets per partition, e.g.:
CREATETABLEt (
event_time DATETIME,
user_id BIGINT,
v BIGINT
)
DUPLICATE KEY(event_time, user_id)
PARTITION BY RANGE(event_time) (...) -- hourly, ~8000 partitions
DISTRIBUTED BY HASH(user_id) BUCKETS 2;
Run a query that selects most partitions plus a bucket-key predicate that prunes most of the buckets per partition:
SET enable_profile = true;
EXPLAIN SELECT*FROM t
WHERE event_time >='2025-01-01'AND event_time <'2026-01-01'AND user_id =12345;
SHOW QUERY PROFILE "/<query_id>" and look at Nereids Translate Time. Compare against a build prior to [fix](nereids) fix prune tablets twice #53403, or against an equivalent table whose partition count is small.
Optional — attach an async-profiler CPU profile to the FE process during a stress run; the hottest stack will be MaterializedIndex.getTablet -> HashMap.getNode under OlapScanNode.computeTabletInfo.
Search before asking
Version
4.x
What's Wrong?
OlapScanNode.distributionPrunereturns the entirenereidsPrunedTabletIdsset whenever the query goes through Nereids. The callerOlapScanNode.computeTabletInfoinvokes it inside a per-partition loop:Because
nereidsPrunedTabletIdsis the union of all selected tablets across all partitions (set inPhysicalPlanTranslator.visitPhysicalOlapScan):each per-partition iteration walks the entire global set and does a
MaterializedIndex.getTablet(id)HashMap lookup on ids that belong to other partitions. The vast majority of these lookups returnnulland are discarded, but the hash work is already paid.Complexity
Let
P = selectedPartitionIds.size()andM = nereidsPrunedTabletIds.size().getTabletcalls: O(P × M × 2) instead of O(M × 2)distributionPrune: O(P × M) elementsThe regression is most pronounced on tables whose distribution layout is many partitions × few buckets per partition (e.g. fine-grained hourly partitioning with 2 buckets per partition), because then
M≈P × bucketsPerPartition, and the totalgetTabletcalls grow as O(P² × bucketsPerPartition).Concrete example
A range-partitioned table with hourly partitions and 2 buckets per partition, queried with a one-year time range and a bucket-key equality predicate that prunes one of the two buckets per partition:
P≈ 24 × 365 ≈ 8760 partitionsM≈ 8760 (one tablet per partition selected after bucket pruning)getTabletlookups: ≈ 76 millionCPU profiling of the FE plan thread:
In practice this means a query that planned in ~0.1 s before #53403 now spends ~2.7 s in
Nereids Translate Timefor the same SQL on the same data.What You Expected?
distributionPruneunder the Nereids path should only return tablets belonging to the partition currently being processed, matching the semantics of the non-NereidsHashDistributionPrunerpath. Plan time should not depend quadratically on partition count.How to Reproduce?
Create a range-partitioned table with many partitions and few buckets per partition, e.g.:
Run a query that selects most partitions plus a bucket-key predicate that prunes most of the buckets per partition:
SHOW QUERY PROFILE "/<query_id>"and look atNereids Translate Time. Compare against a build prior to [fix](nereids) fix prune tablets twice #53403, or against an equivalent table whose partition count is small.Optional — attach an async-profiler CPU profile to the FE process during a stress run; the hottest stack will be
MaterializedIndex.getTablet -> HashMap.getNodeunderOlapScanNode.computeTabletInfo.Anything Else?
No response
Are you willing to submit PR?
Code of Conduct