Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a way to preserve partitioning through UnionExec without losing ordering #10314

Open
Tracked by #10313
alamb opened this issue Apr 30, 2024 · 3 comments
Open
Tracked by #10313
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

Is your feature request related to a problem or challenge?

The EnforceDistribution physical optimizer pass in DataFusion in some cases will introduce InterleaveExec to increase partitioning when data passes through a UnionExec:

plan = if plan.as_any().is::<UnionExec>()
&& !config.optimizer.prefer_existing_union
&& can_interleave(children_plans.iter())
{
// Add a special case for [`UnionExec`] since we want to "bubble up"
// hash-partitioned data. So instead of
//
// Agg:
// Repartition (hash):
// Union:
// - Agg:
// Repartition (hash):
// Data
// - Agg:
// Repartition (hash):
// Data
//
// we can use:
//
// Agg:
// Interleave:
// - Agg:
// Repartition (hash):
// Data
// - Agg:
// Repartition (hash):
// Data
Arc::new(InterleaveExec::try_new(children_plans)?)
} else {
plan.with_new_children(children_plans)?
};

Here is what InterleaveExec does:

/// Combines multiple input streams by interleaving them.
///
/// This only works if all inputs have the same hash-partitioning.
///
/// # Data Flow
/// ```text
/// +---------+
/// | |---+
/// | Input 1 | |
/// | |-------------+
/// +---------+ | |
/// | | +---------+
/// +------------------>| |
/// +---------------->| Combine |-->
/// | +-------------->| |
/// | | | +---------+
/// +---------+ | | |
/// | |-----+ | |
/// | Input 2 | | |
/// | |---------------+
/// +---------+ | | |
/// | | | +---------+
/// | +-------->| |
/// | +------>| Combine |-->
/// | +---->| |
/// | | +---------+
/// +---------+ | |
/// | |-------+ |
/// | Input 3 | |
/// | |-----------------+
/// +---------+
/// ```

However, this has the potential downside of destroying and pre-existing ordering which is sometimes preferable than increasing / improving partitionining (e.g. see #10257 and datafusion.optimizer.prefer_existing_sort setting)

Describe the solution you'd like

I would like there to be some way to preserve the partitioning after a UnionExec without losing the ordering information and then remove the prefer_existing_union flag

Describe alternatives you've considered

One possibility is to add a preserve_order flag to InterleaveExec the same way as RepartitionExec has a preserve_order flag:

/// Maps `N` input partitions to `M` output partitions based on a
/// [`Partitioning`] scheme.
///
/// # Background
///
/// DataFusion, like most other commercial systems, with the
/// notable exception of DuckDB, uses the "Exchange Operator" based
/// approach to parallelism which works well in practice given
/// sufficient care in implementation.
///
/// DataFusion's planner picks the target number of partitions and
/// then `RepartionExec` redistributes [`RecordBatch`]es to that number
/// of output partitions.
///
/// For example, given `target_partitions=3` (trying to use 3 cores)
/// but scanning an input with 2 partitions, `RepartitionExec` can be
/// used to get 3 even streams of `RecordBatch`es
///
///
///```text
/// ▲ ▲ ▲
/// │ │ │
/// │ │ │
/// │ │ │
///┌───────────────┐ ┌───────────────┐ ┌───────────────┐
///│ GroupBy │ │ GroupBy │ │ GroupBy │
///│ (Partial) │ │ (Partial) │ │ (Partial) │
///└───────────────┘ └───────────────┘ └───────────────┘
/// ▲ ▲ ▲
/// └──────────────────┼──────────────────┘
/// │
/// ┌─────────────────────────┐
/// │ RepartitionExec │
/// │ (hash/round robin) │
/// └─────────────────────────┘
/// ▲ ▲
/// ┌───────────┘ └───────────┐
/// │ │
/// │ │
/// .─────────. .─────────.
/// ,─' '─. ,─' '─.
/// ; Input : ; Input :
/// : Partition 0 ; : Partition 1 ;
/// ╲ ╱ ╲ ╱
/// '─. ,─' '─. ,─'
/// `───────' `───────'
///```
///
/// # Output Ordering
///
/// If more than one stream is being repartitioned, the output will be some
/// arbitrary interleaving (and thus unordered) unless
/// [`Self::with_preserve_order`] specifies otherwise.
///
/// # Footnote
///
/// The "Exchange Operator" was first described in the 1989 paper
/// [Encapsulation of parallelism in the Volcano query processing
/// system
/// Paper](https://w6113.github.io/files/papers/volcanoparallelism-89.pdf)
/// which uses the term "Exchange" for the concept of repartitioning
/// data across threads.
#[derive(Debug)]
pub struct RepartitionExec {
/// Input execution plan
input: Arc<dyn ExecutionPlan>,
/// Partitioning scheme to use
partitioning: Partitioning,
/// Inner state that is initialized when the first output stream is created.
state: LazyState,
/// Execution metrics
metrics: ExecutionPlanMetricsSet,
/// Boolean flag to decide whether to preserve ordering. If true means
/// `SortPreservingRepartitionExec`, false means `RepartitionExec`.
preserve_order: bool,
/// Cache holding plan properties like equivalences, output partitioning etc.
cache: PlanProperties,
}
#[derive(Debug, Clone)]
struct RepartitionMetrics {
/// Time in nanos to execute child operator and fetch batches
fetch_time: metrics::Time,
/// Time in nanos to perform repartitioning
repartition_time: metrics::Time,
/// Time in nanos for sending resulting batches to channels.
///
/// One metric per output partition.
send_time: Vec<metrics::Time>,
}

Additional context

We encountered this while working on #10259 @mustafasrepo and @phillipleblanc pointed out that config flag prefer_existing_union was effectively the same as prefer_existing_sort

@alamb alamb added the enhancement New feature or request label Apr 30, 2024
@alamb alamb changed the title Implement order preserving variant of InterleaveExec Implement a way to preserve partitioning through UnionExec Apr 30, 2024
@alamb alamb changed the title Implement a way to preserve partitioning through UnionExec Implement a way to preserve partitioning through UnionExec without losing ordering Apr 30, 2024
@xinlifoobar
Copy link
Contributor

Hi @alamb, I am trying to work on this.

I am not very familiar on the InterleaveExec in the optimizer. As initial thought, the interleaveExec is acting as a Repartition with equal number of input partitions and output partitions and thus a nature idea is to reuse streaming_merge with respect to the input size. Wdyt?

@alamb
Copy link
Contributor Author

alamb commented May 21, 2024

Hi @alamb, I am trying to work on this.

I am not very familiar on the InterleaveExec in the optimizer. As initial thought, the interleaveExec is acting as a Repartition with equal number of input partitions and output partitions and thus a nature idea is to reuse streaming_merge with respect to the input size. Wdyt?

Hi @xinlifoobar -- this sounds like it is on the right track

@xinlifoobar
Copy link
Contributor

xinlifoobar commented May 21, 2024

Hi @alamb, found another interesting case while testing. I am not very sure, do you think this could apply InterleaveExec with same order by sets?

 explain select count(*) from ((select distinct c1, c2 from t3 order by c1 ) union all (select distinct c1, c2 from t4 order by c1)) group by cube(c1,c2);
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: COUNT(*)                                                                                                                                                   |
|               |   Aggregate: groupBy=[[CUBE (t3.c1, t3.c2)]], aggr=[[COUNT(Int64(1)) AS COUNT(*)]]                                                                                     |
|               |     Union                                                                                                                                                              |
|               |       Sort: t3.c1 ASC NULLS LAST                                                                                                                                       |
|               |         Aggregate: groupBy=[[t3.c1, t3.c2]], aggr=[[]]                                                                                                                 |
|               |           TableScan: t3 projection=[c1, c2]                                                                                                                            |
|               |       Sort: t4.c1 ASC NULLS LAST                                                                                                                                       |
|               |         Aggregate: groupBy=[[t4.c1, t4.c2]], aggr=[[]]                                                                                                                 |
|               |           TableScan: t4 projection=[c1, c2]                                                                                                                            |
| physical_plan | ProjectionExec: expr=[COUNT(*)@2 as COUNT(*)]                                                                                                                          |
|               |   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[COUNT(*)], ordering_mode=PartiallySorted([0])                                              |
|               |     SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[true]                                                                                                 |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                                      |
|               |         RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=14                                                                                      |
|               |           RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=2                                                                                        |
|               |             AggregateExec: mode=Partial, gby=[(c1@0 as c1, c2@1 as c2), (NULL as c1, c2@1 as c2), (c1@0 as c1, NULL as c2), (NULL as c1, NULL as c2)], aggr=[COUNT(*)] |
|               |               UnionExec                                                                                                                                                |
|               |                 CoalescePartitionsExec                                                                                                                                 |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                          |
|               |                     CoalesceBatchesExec: target_batch_size=8192                                                                                                        |
|               |                       RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=1                                                                         |
|               |                         AggregateExec: mode=Partial, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                             |
|               |                           MemoryExec: partitions=1, partition_sizes=[0]                                                                                                |
|               |                 CoalescePartitionsExec                                                                                                                                 |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                          |
|               |                     CoalesceBatchesExec: target_batch_size=8192                                                                                                        |
|               |                       RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=1                                                                         |
|               |                         AggregateExec: mode=Partial, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                             |
|               |                           MemoryExec: partitions=1, partition_sizes=[0]                                                                                                |
|               |                                                                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 

With InterleaveExec:

 ProjectionExec: 
   AggregateExec:
    InterleaveExec: 
      SortExec:
         AggregateExec:
      SortExec:
         AggregateExec:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants