Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314

alamb · 2024-04-30T16:29:07Z

Is your feature request related to a problem or challenge?

The EnforceDistribution physical optimizer pass in DataFusion in some cases will introduce InterleaveExec to increase partitioning when data passes through a UnionExec:

datafusion/datafusion/core/src/physical_optimizer/enforce_distribution.rs

Lines 1196 to 1226 in 2231183

    
           plan = if plan.as_any().is::<UnionExec>() 
        
               && !config.optimizer.prefer_existing_union 
        
               && can_interleave(children_plans.iter()) 
        
           { 
        
               // Add a special case for [`UnionExec`] since we want to "bubble up" 
        
               // hash-partitioned data. So instead of 
        
               // 
        
               // Agg: 
        
               //   Repartition (hash): 
        
               //     Union: 
        
               //       - Agg: 
        
               //           Repartition (hash): 
        
               //             Data 
        
               //       - Agg: 
        
               //           Repartition (hash): 
        
               //             Data 
        
               // 
        
               // we can use: 
        
               // 
        
               // Agg: 
        
               //   Interleave: 
        
               //     - Agg: 
        
               //         Repartition (hash): 
        
               //           Data 
        
               //     - Agg: 
        
               //         Repartition (hash): 
        
               //           Data 
        
               Arc::new(InterleaveExec::try_new(children_plans)?) 
        
           } else { 
        
               plan.with_new_children(children_plans)? 
        
           };

Here is what InterleaveExec does:

datafusion/datafusion/physical-plan/src/union.rs

Lines 286 to 317 in 4edbdd7

    
           /// Combines multiple input streams by interleaving them. 
        
           /// 
        
           /// This only works if all inputs have the same hash-partitioning. 
        
           /// 
        
           /// # Data Flow 
        
           /// ```text 
        
           /// +---------+ 
        
           /// |         |---+ 
        
           /// | Input 1 |   | 
        
           /// |         |-------------+ 
        
           /// +---------+   |         | 
        
           ///               |         |         +---------+ 
        
           ///               +------------------>|         | 
        
           ///                 +---------------->| Combine |--> 
        
           ///                 | +-------------->|         | 
        
           ///                 | |     |         +---------+ 
        
           /// +---------+     | |     | 
        
           /// |         |-----+ |     | 
        
           /// | Input 2 |       |     | 
        
           /// |         |---------------+ 
        
           /// +---------+       |     | | 
        
           ///                   |     | |       +---------+ 
        
           ///                   |     +-------->|         | 
        
           ///                   |       +------>| Combine |--> 
        
           ///                   |         +---->|         | 
        
           ///                   |         |     +---------+ 
        
           /// +---------+       |         | 
        
           /// |         |-------+         | 
        
           /// | Input 3 |                 | 
        
           /// |         |-----------------+ 
        
           /// +---------+ 
        
           /// ```

However, this has the potential downside of destroying and pre-existing ordering which is sometimes preferable than increasing / improving partitionining (e.g. see #10257 and datafusion.optimizer.prefer_existing_sort setting)

Describe the solution you'd like

I would like there to be some way to preserve the partitioning after a UnionExec without losing the ordering information and then remove the prefer_existing_union flag

Describe alternatives you've considered

One possibility is to add a preserve_order flag to InterleaveExec the same way as RepartitionExec has a preserve_order flag:

datafusion/datafusion/physical-plan/src/repartition/mod.rs

Lines 328 to 417 in 4edbdd7

    
           /// Maps `N` input partitions to `M` output partitions based on a 
        
           /// [`Partitioning`] scheme. 
        
           /// 
        
           /// # Background 
        
           /// 
        
           /// DataFusion, like most other commercial systems, with the 
        
           /// notable exception of DuckDB, uses the "Exchange Operator" based 
        
           /// approach to parallelism which works well in practice given 
        
           /// sufficient care in implementation. 
        
           /// 
        
           /// DataFusion's planner picks the target number of partitions and 
        
           /// then `RepartionExec` redistributes [`RecordBatch`]es to that number 
        
           /// of output partitions. 
        
           /// 
        
           /// For example, given `target_partitions=3` (trying to use 3 cores) 
        
           /// but scanning an input with 2 partitions, `RepartitionExec` can be 
        
           /// used to get 3 even streams of `RecordBatch`es 
        
           /// 
        
           /// 
        
           ///```text 
        
           ///        ▲                  ▲                  ▲ 
        
           ///        │                  │                  │ 
        
           ///        │                  │                  │ 
        
           ///        │                  │                  │ 
        
           ///┌───────────────┐  ┌───────────────┐  ┌───────────────┐ 
        
           ///│    GroupBy    │  │    GroupBy    │  │    GroupBy    │ 
        
           ///│   (Partial)   │  │   (Partial)   │  │   (Partial)   │ 
        
           ///└───────────────┘  └───────────────┘  └───────────────┘ 
        
           ///        ▲                  ▲                  ▲ 
        
           ///        └──────────────────┼──────────────────┘ 
        
           ///                           │ 
        
           ///              ┌─────────────────────────┐ 
        
           ///              │     RepartitionExec     │ 
        
           ///              │   (hash/round robin)    │ 
        
           ///              └─────────────────────────┘ 
        
           ///                         ▲   ▲ 
        
           ///             ┌───────────┘   └───────────┐ 
        
           ///             │                           │ 
        
           ///             │                           │ 
        
           ///        .─────────.                 .─────────. 
        
           ///     ,─'           '─.           ,─'           '─. 
        
           ///    ;      Input      :         ;      Input      : 
        
           ///    :   Partition 0   ;         :   Partition 1   ; 
        
           ///     ╲               ╱           ╲               ╱ 
        
           ///      '─.         ,─'             '─.         ,─' 
        
           ///         `───────'                   `───────' 
        
           ///``` 
        
           /// 
        
           /// # Output Ordering 
        
           /// 
        
           /// If more than one stream is being repartitioned, the output will be some 
        
           /// arbitrary interleaving (and thus unordered) unless 
        
           /// [`Self::with_preserve_order`] specifies otherwise. 
        
           /// 
        
           /// # Footnote 
        
           /// 
        
           /// The "Exchange Operator" was first described in the 1989 paper 
        
           /// [Encapsulation of parallelism in the Volcano query processing 
        
           /// system 
        
           /// Paper](https://w6113.github.io/files/papers/volcanoparallelism-89.pdf) 
        
           /// which uses the term "Exchange" for the concept of repartitioning 
        
           /// data across threads. 
        
           #[derive(Debug)] 
        
           pub struct RepartitionExec { 
        
               /// Input execution plan 
        
               input: Arc<dyn ExecutionPlan>, 
        
               /// Partitioning scheme to use 
        
               partitioning: Partitioning, 
        
               /// Inner state that is initialized when the first output stream is created. 
        
               state: LazyState, 
        
               /// Execution metrics 
        
               metrics: ExecutionPlanMetricsSet, 
        
               /// Boolean flag to decide whether to preserve ordering. If true means 
        
               /// `SortPreservingRepartitionExec`, false means `RepartitionExec`. 
        
               preserve_order: bool, 
        
               /// Cache holding plan properties like equivalences, output partitioning etc. 
        
               cache: PlanProperties, 
        
           } 
        
           #[derive(Debug, Clone)] 
        
           struct RepartitionMetrics { 
        
               /// Time in nanos to execute child operator and fetch batches 
        
               fetch_time: metrics::Time, 
        
               /// Time in nanos to perform repartitioning 
        
               repartition_time: metrics::Time, 
        
               /// Time in nanos for sending resulting batches to channels. 
        
               /// 
        
               /// One metric per output partition. 
        
               send_time: Vec<metrics::Time>, 
        
           }

Additional context

We encountered this while working on #10259 @mustafasrepo and @phillipleblanc pointed out that config flag prefer_existing_union was effectively the same as prefer_existing_sort

The text was updated successfully, but these errors were encountered:

xinlifoobar · 2024-05-18T13:25:17Z

Hi @alamb, I am trying to work on this.

I am not very familiar on the InterleaveExec in the optimizer. As initial thought, the interleaveExec is acting as a Repartition with equal number of input partitions and output partitions and thus a nature idea is to reuse streaming_merge with respect to the input size. Wdyt?

alamb · 2024-05-21T00:13:49Z

Hi @alamb, I am trying to work on this.

I am not very familiar on the InterleaveExec in the optimizer. As initial thought, the interleaveExec is acting as a Repartition with equal number of input partitions and output partitions and thus a nature idea is to reuse streaming_merge with respect to the input size. Wdyt?

Hi @xinlifoobar -- this sounds like it is on the right track

xinlifoobar · 2024-05-21T09:36:35Z

Hi @alamb, found another interesting case while testing. I am not very sure, do you think this could apply InterleaveExec with same order by sets?

 explain select count(*) from ((select distinct c1, c2 from t3 order by c1 ) union all (select distinct c1, c2 from t4 order by c1)) group by cube(c1,c2);
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                   |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: COUNT(*)                                                                                                                                                   |
|               |   Aggregate: groupBy=[[CUBE (t3.c1, t3.c2)]], aggr=[[COUNT(Int64(1)) AS COUNT(*)]]                                                                                     |
|               |     Union                                                                                                                                                              |
|               |       Sort: t3.c1 ASC NULLS LAST                                                                                                                                       |
|               |         Aggregate: groupBy=[[t3.c1, t3.c2]], aggr=[[]]                                                                                                                 |
|               |           TableScan: t3 projection=[c1, c2]                                                                                                                            |
|               |       Sort: t4.c1 ASC NULLS LAST                                                                                                                                       |
|               |         Aggregate: groupBy=[[t4.c1, t4.c2]], aggr=[[]]                                                                                                                 |
|               |           TableScan: t4 projection=[c1, c2]                                                                                                                            |
| physical_plan | ProjectionExec: expr=[COUNT(*)@2 as COUNT(*)]                                                                                                                          |
|               |   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[COUNT(*)], ordering_mode=PartiallySorted([0])                                              |
|               |     SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[true]                                                                                                 |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                                      |
|               |         RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=14                                                                                      |
|               |           RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=2                                                                                        |
|               |             AggregateExec: mode=Partial, gby=[(c1@0 as c1, c2@1 as c2), (NULL as c1, c2@1 as c2), (c1@0 as c1, NULL as c2), (NULL as c1, NULL as c2)], aggr=[COUNT(*)] |
|               |               UnionExec                                                                                                                                                |
|               |                 CoalescePartitionsExec                                                                                                                                 |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                          |
|               |                     CoalesceBatchesExec: target_batch_size=8192                                                                                                        |
|               |                       RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=1                                                                         |
|               |                         AggregateExec: mode=Partial, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                             |
|               |                           MemoryExec: partitions=1, partition_sizes=[0]                                                                                                |
|               |                 CoalescePartitionsExec                                                                                                                                 |
|               |                   AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                          |
|               |                     CoalesceBatchesExec: target_batch_size=8192                                                                                                        |
|               |                       RepartitionExec: partitioning=Hash([c1@0, c2@1], 14), input_partitions=1                                                                         |
|               |                         AggregateExec: mode=Partial, gby=[c1@0 as c1, c2@1 as c2], aggr=[]                                                                             |
|               |                           MemoryExec: partitions=1, partition_sizes=[0]                                                                                                |
|               |                                                                                                                                                                        |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.

With InterleaveExec:

 ProjectionExec: 
   AggregateExec:
    InterleaveExec: 
      SortExec:
         AggregateExec:
      SortExec:
         AggregateExec:

alamb added the enhancement New feature or request label Apr 30, 2024

alamb changed the title ~~Implement order preserving variant of InterleaveExec~~ Implement a way to preserve partitioning through UnionExec Apr 30, 2024

alamb changed the title ~~Implement a way to preserve partitioning through UnionExec~~ Implement a way to preserve partitioning through UnionExec without losing ordering Apr 30, 2024

This was referenced Apr 30, 2024

[Epic] A Collection of Sort Based Optimizations #10313

Open

feat: add optimizer config param to avoid grouping partitions prefer_existing_union #10259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314

Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314

alamb commented Apr 30, 2024 •

edited

xinlifoobar commented May 18, 2024

alamb commented May 21, 2024

xinlifoobar commented May 21, 2024 •

edited

Implement a way to preserve partitioning through UnionExec without losing ordering #10314

Implement a way to preserve partitioning through UnionExec without losing ordering #10314

Comments

alamb commented Apr 30, 2024 • edited

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

xinlifoobar commented May 18, 2024

alamb commented May 21, 2024

xinlifoobar commented May 21, 2024 • edited

Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314

Implement a way to preserve partitioning through `UnionExec` without losing ordering #10314

alamb commented Apr 30, 2024 •

edited

xinlifoobar commented May 21, 2024 •

edited