Skip to content

Reduce channel fanout overhead in RepartitionExec #22202

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

RepartitionExec emits one small RecordBatch per (input batch × non-empty output partition), then coalesces them back to target size on the consumer side. The channel layer (memory accounting, sender gate, await suspensions) therefore does work proportional to num_output_partitions per input batch, even though each sub-batch only carries ~batch_size / num_output_partitions rows.

This becomes a real cost at high fanout. In datafusion-distributed, RepartitionExec is the backbone for network shuffles and is scaled to P * W partitions (P ≈ 12–24, W up to thousands), where per-batch channel overhead dominates.

Additional context

A candidate implementation is in #22010.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions