Skip to content

Remove redundant RepartitionExec from plan #384

@andygrove

Description

@andygrove

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I was running benchmarks this morning and noticed this fragment of the query plan when running TPC-H query 5.

CoalesceBatchesExec: target_batch_size=4096
  RepartitionExec: partitioning=Hash([Column { name: "n_nationkey" }], 24)
    RepartitionExec: partitioning=RoundRobinBatch(24)
      ParquetExec: batch_size=8192,

It seems redundant to repartition with round-robin and then immediately repartition again using a hash.

A more complex example:

RepartitionExec: partitioning=Hash([Column { name: "r_regionkey" }], 24)
  CoalesceBatchesExec: target_batch_size=4096
    FilterExec: r_name = ASIA
      RepartitionExec: partitioning=RoundRobinBatch(24)
        ParquetExec: batch_size=8192,

In this example, we could push the repartition on r_regionkey down to the scan.

Describe the solution you'd like
Implement new optimizer rule to remove redundant repartitions and/or push down repartitions.

Describe alternatives you've considered
None

Additional context
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions