-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I was running benchmarks this morning and noticed this fragment of the query plan when running TPC-H query 5.
CoalesceBatchesExec: target_batch_size=4096
RepartitionExec: partitioning=Hash([Column { name: "n_nationkey" }], 24)
RepartitionExec: partitioning=RoundRobinBatch(24)
ParquetExec: batch_size=8192,
It seems redundant to repartition with round-robin and then immediately repartition again using a hash.
A more complex example:
RepartitionExec: partitioning=Hash([Column { name: "r_regionkey" }], 24)
CoalesceBatchesExec: target_batch_size=4096
FilterExec: r_name = ASIA
RepartitionExec: partitioning=RoundRobinBatch(24)
ParquetExec: batch_size=8192,
In this example, we could push the repartition on r_regionkey down to the scan.
Describe the solution you'd like
Implement new optimizer rule to remove redundant repartitions and/or push down repartitions.
Describe alternatives you've considered
None
Additional context
None
Dandandan, msathis and jorgecarleitao
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster