Issues with repartition #9701

matt-martin · 2024-03-19T23:02:50Z

matt-martin
Mar 19, 2024

Hello,

I'm fairly new to Rust and Datafusion so please excuse the basic question. I'm trying to understand why repartition does not always produce the desired number of partitions. Here's a very basic test I constructed:

#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn repartition_test() {
        let data_vec = (1..200).map(|x| x.to_string()).collect::<Vec<_>>();

        let batch = RecordBatch::try_new(
            Arc::new(Schema::new(vec![Field::new("foo", DataType::Utf8, false)])),
            vec![Arc::new(StringArray::from(data_vec))]
        ).unwrap();

        let result =  SessionContext::new().read_batch(batch)
            .unwrap()
            .repartition(Partitioning::Hash(vec![col("foo")], 3))
            .unwrap()
            .collect_partitioned()
            .await
            .unwrap();

        print!("RESULTS look like: {:?}", result);
        assert_eq!(result.len(), 3);
    }
}

If I run the test:

cargo test -- tests::repartition_test --exact

I see the following output:

running 1 test
test tests::repartition_test ... FAILED

failures:

---- tests::repartition_test stdout ----
RESULTS look like: [[RecordBatch { schema: Schema { fields: [Field { name: "foo", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [StringArray
[
  "1",
  "2",
  "3",
  "4",
  "5",
  "6",
  "7",
  "8",
  "9",
  "10",
  ...179 elements...,
  "190",
  "191",
  "192",
  "193",
  "194",
  "195",
  "196",
  "197",
  "198",
  "199",
]], row_count: 199 }]]thread 'tests::repartition_test' panicked at src/main.rs:808:9:
assertion `left == right` failed
  left: 1
 right: 3
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tests::repartition_test

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.02s

I'm probably missing something obvious, but shouldn't the top level vector returned by collect_partitioned have 3 elements (i.e. 1 for each partition)?

Answered by JasonLi-cn

Mar 22, 2024

AboutQ2:
I believe the purpose of the optimization rule EnforceDistribution is to speed up the query while ensuring the correctness of the results. Although it replaces Hash with RoundRobinBatch, the final result is correct (by that I mean the output row data, not the result data with partitions information that you wanted).
In other words, if there is an Aggregation operator downstream of Repartition(Hash), then it would not be replaced at that time.
Therefore, if you want to achieve the desired result, the solution is to remove EnforceDistribution as I mentioned earlier.

View full answer

JasonLi-cn · 2024-03-20T08:47:32Z

JasonLi-cn
Mar 20, 2024

First, let's take a look at the query plan:

Logical Plan
Repartition: Hash(foo) partition_count=3
  TableScan: ?table?

Optimized Logical Plan
Repartition: Hash(foo) partition_count=3
  TableScan: ?table? projection=[foo]

Physical Plan
MemoryExec: partitions=1, partition_sizes=[1]

As we can see, during the Logical Plan and Optimized Logical Plan stages, the query plan appears normal, but by the Physical Plan stage, the Repartition operator has disappeared.

Reason:
I tracked the optimization process of the Physical Plan and found:

...

[2024-03-20T08:38:22Z TRACE datafusion::physical_planner] Optimized physical plan by LimitedDistinctAggregation:
    OutputRequirementExec
      RepartitionExec: partitioning=Hash([foo@0], 3), input_partitions=1
        MemoryExec: partitions=1, partition_sizes=[1]
    
    
[2024-03-20T08:38:22Z TRACE datafusion::physical_planner] Optimized physical plan by EnforceDistribution:
    OutputRequirementExec
      MemoryExec: partitions=1, partition_sizes=[1]

...

After the EnforceDistribution optimization rule, the RepartitionExec was optimized away. As for why it was optimized away, we need to look further into the implementation of EnforceDistribution.

0 replies

JasonLi-cn · 2024-03-20T09:33:16Z

JasonLi-cn
Mar 20, 2024

Change your code to look like this:

async fn repartition_test() {
    let data_vec = (1..200).map(|x| x.to_string()).collect::<Vec<_>>();

    let batch = RecordBatch::try_new(
        Arc::new(Schema::new(vec![Field::new("foo", DataType::Utf8, false)])),
        vec![Arc::new(StringArray::from(data_vec))],
    )
    .unwrap();

    let state = SessionState::new_with_config_rt(
        SessionConfig::new(),
        Arc::new(RuntimeEnv::default()),
    )
    .with_physical_optimizer_rules(vec![]); // without physical optimizers

    let result = SessionContext::new_with_state(state)
        .read_batch(batch)
        .unwrap()
        .repartition(Partitioning::Hash(vec![col("foo")], 3))
        .unwrap()
        .collect_partitioned()
        .await
        .unwrap();

    print!("RESULTS look like: {:?}", result);
    assert_eq!(result.len(), 3);
}

1 reply

matt-martin Mar 20, 2024
Author

Thanks for the reply! I tested out some other variations and have some more questions about the EnforceDistribution optimizer. I think, for example, that if I try to repartition by the hash of a column, this rule will sometimes replaces it with a roundrobin repartition. That seems like it could potentially lead to some incorrect results (i.e. if I have code that depends on rows with the same hash all end up in the same partition, then replacing a hash partitioning with a round robin partition violates that assumption?). I'll try to write up a more concrete example later tonight or tomorrow morning.

matt-martin · 2024-03-21T16:09:28Z

matt-martin
Mar 21, 2024
Author

Here's an expanded test:

mod tests {
    use datafusion::datasource::MemTable;

    use super::*;

    #[tokio::test]
    async fn repartition_test() {
        let data_vec = (0..100).map(|x| x.to_string()).collect::<Vec<_>>();

        let schema = Arc::new(Schema::new(vec![Field::new("foo", DataType::Utf8, false)]));
        let batch = RecordBatch::try_new(
            schema.clone(),
            vec![Arc::new(StringArray::from(data_vec))]
        ).unwrap();

        let n_batches = 4;
        let mut batches = vec![batch.clone(); n_batches];
        for _ in 1..n_batches {
            batches.push(batch.clone())
        }

        // setting a small batch size helps trigger EnforceDistribution
        let config = SessionConfig::from_env().unwrap()
            .with_batch_size(1);

        let ctx = SessionContext::new_with_config(config);

        /* TEST SQL "DISTRIBUTED BY" */
        let mem_table = MemTable::try_new(
            schema.clone(),
            vec![batches.clone()]
        ).unwrap();
        ctx
            .register_table(
                TableReference::bare("test_table"),
                Arc::new(mem_table)
            )
            .unwrap();

        // I naively assumed that "DISTRIBUTED BY foo" would be similar to `.repartition(Partitioning::Hash(vec![col("foo")], ...))`
        // but without the ability to specify the number of partitions.
        // However, it seems that "DISTRIBUTED BY foo" is not enforced at all?
        let query = "SELECT trim(foo) FROM test_table DISTRIBUTED BY foo";
        let explain_sql = ctx.sql(format!("EXPLAIN VERBOSE {query}").as_str()).await.unwrap().collect().await.unwrap();
        let explain_str = datafusion::arrow::util::pretty::pretty_format_batches(&explain_sql).unwrap().to_string();
        println!("===Explain  SQl output===\n{explain_str}");
        let result = ctx.sql(query).await.unwrap().collect_partitioned().await.unwrap();
        println!("===SQL result len:{}", result.len());



        /* TEST DATAFRAME REPARTITION */
        let num_partitions_expected = 20;
        let df = ctx
            .read_batches(batches.clone())
            .unwrap()
            .repartition(Partitioning::Hash(vec![col("foo")], num_partitions_expected))
            .unwrap()
            .select(vec![trim(col("foo"))]) // add some non-trivial select (i.e. trim) to trigger EnforceDistribution
            .unwrap();
        

        let explain_unoptimized = df.clone().explain(true, false).unwrap().collect().await.unwrap();
        let explain_unoptimized_str = datafusion::arrow::util::pretty::pretty_format_batches(&explain_unoptimized).unwrap().to_string();
        println!("===Explain (unoptimized)===\n{explain_unoptimized_str}");
        // let explain_optimized = df.clone().explain(true, true).unwrap().collect().await.unwrap();
        // let explain_optimized_str = datafusion::arrow::util::pretty::pretty_format_batches(&explain_optimized).unwrap().to_string();
        // println!("===Explain (optimized)===\n{explain_optimized_str}");
        // println!("===Physical plan===\n{:?}", df.clone().create_physical_plan().await.unwrap());

        let result = df
            .collect_partitioned()
            .await
            .unwrap();

        println!("===RESULTS len: {}", result.len());
        assert_eq!(result.len(), num_partitions_expected);
    }
}

When I run cargo test -- tests::repartition_test --exact, I see the following (note: I've omitted some lines of the explain output and replaced it with "..." when there were a lot of rows with "SAME TEXT AS ABOVE"):

---- tests::repartition_test stdout ----
===Explain  SQl output===
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Projection: trim(distributed.foo)                                                                                                                   |
|                                                            |   SubqueryAlias: distributed                                                                                                                        |
|                                                            |     TableScan: test_table                                                                                                                           |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                                                                                                                  |
...
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                                                                                                                                  |
| logical_plan after optimize_projections                    | Projection: trim(distributed.foo)                                                                                                                   |
|                                                            |   SubqueryAlias: distributed                                                                                                                        |
|                                                            |     TableScan: test_table projection=[foo]                                                                                                          |
| logical_plan after eliminate_nested_union                  | SAME TEXT AS ABOVE                                                                                                                                  |
...
| logical_plan after optimize_projections                    | SAME TEXT AS ABOVE                                                                                                                                  |
| logical_plan                                               | Projection: trim(distributed.foo)                                                                                                                   |
|                                                            |   SubqueryAlias: distributed                                                                                                                        |
|                                                            |     TableScan: test_table projection=[foo]                                                                                                          |
| initial_physical_plan                                      | ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)]                                                                                         |
|                                                            |   MemoryExec: partitions=1, partition_sizes=[7]                                                                                                     |
|                                                            |                                                                                                                                                     |
| initial_physical_plan_with_stats                           | ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]:)]]                           |
|                                                            |   MemoryExec: partitions=1, partition_sizes=[7], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]]                         |
|                                                            |                                                                                                                                                     |
| physical_plan after OutputRequirements                     | OutputRequirementExec                                                                                                                               |
|                                                            |   ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)]                                                                                       |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7]                                                                                                   |
|                                                            |                                                                                                                                                     |
| physical_plan after aggregate_statistics                   | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after join_selection                         | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after LimitedDistinctAggregation             | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after EnforceDistribution                    | OutputRequirementExec                                                                                                                               |
|                                                            |   ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)]                                                                                       |
|                                                            |     RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                           |
|                                                            |       MemoryExec: partitions=1, partition_sizes=[7]                                                                                                 |
|                                                            |                                                                                                                                                     |
| physical_plan after CombinePartialFinalAggregate           | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after EnforceSorting                         | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after coalesce_batches                       | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after OutputRequirements                     | ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)]                                                                                         |
|                                                            |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                             |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7]                                                                                                   |
|                                                            |                                                                                                                                                     |
| physical_plan after PipelineChecker                        | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after LimitAggregation                       | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after ProjectionPushdown                     | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan                                              | ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)]                                                                                         |
|                                                            |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                             |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7]                                                                                                   |
|                                                            |                                                                                                                                                     |
| physical_plan_with_stats                                   | ProjectionExec: expr=[trim(foo@0) as trim(distributed.foo)], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]:)]]                           |
|                                                            |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1, statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]] |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]]                       |
|                                                            |                                                                                                                                                     |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
===SQL result len:16
===Explain (unoptimized)===
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type                                                  | plan                                                                                                                                                |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| initial_logical_plan                                       | Projection: trim(?table?.foo)                                                                                                                       |
|                                                            |   Repartition: Hash(foo) partition_count=20                                                                                                         |
|                                                            |     TableScan: ?table?                                                                                                                              |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                                                                                                                  |
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                                                                                                                                  |
| logical_plan after optimize_projections                    | Projection: trim(?table?.foo)                                                                                                                       |
|                                                            |   Repartition: Hash(foo) partition_count=20                                                                                                         |
|                                                            |     TableScan: ?table? projection=[foo]                                                                                                             |
| logical_plan after eliminate_nested_union                  | SAME TEXT AS ABOVE                                                                                                                                  |
...
| logical_plan after optimize_projections                    | SAME TEXT AS ABOVE                                                                                                                                  |
| logical_plan                                               | Projection: trim(?table?.foo)                                                                                                                       |
|                                                            |   Repartition: Hash(foo) partition_count=20                                                                                                         |
|                                                            |     TableScan: ?table? projection=[foo]                                                                                                             |
| initial_physical_plan                                      | ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)]                                                                                             |
|                                                            |   RepartitionExec: partitioning=Hash([foo@0], 20), input_partitions=1                                                                               |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7]                                                                                                   |
|                                                            |                                                                                                                                                     |
| initial_physical_plan_with_stats                           | ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]:)]]                               |
|                                                            |   RepartitionExec: partitioning=Hash([foo@0], 20), input_partitions=1, statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]]   |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]]                       |
|                                                            |                                                                                                                                                     |
| physical_plan after OutputRequirements                     | OutputRequirementExec                                                                                                                               |
|                                                            |   ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)]                                                                                           |
|                                                            |     RepartitionExec: partitioning=Hash([foo@0], 20), input_partitions=1                                                                             |
|                                                            |       MemoryExec: partitions=1, partition_sizes=[7]                                                                                                 |
|                                                            |                                                                                                                                                     |
| physical_plan after aggregate_statistics                   | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after join_selection                         | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after LimitedDistinctAggregation             | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after EnforceDistribution                    | OutputRequirementExec                                                                                                                               |
|                                                            |   ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)]                                                                                           |
|                                                            |     RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                           |
|                                                            |       MemoryExec: partitions=1, partition_sizes=[7]                                                                                                 |
|                                                            |                                                                                                                                                     |
| physical_plan after CombinePartialFinalAggregate           | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after EnforceSorting                         | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after coalesce_batches                       | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after OutputRequirements                     | ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)]                                                                                             |
|                                                            |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                             |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7]                                                                                                   |
|                                                            |                                                                                                                                                     |
| physical_plan after PipelineChecker                        | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after LimitAggregation                       | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan after ProjectionPushdown                     | SAME TEXT AS ABOVE                                                                                                                                  |
| physical_plan                                              | ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)]                                                                                             |
|                                                            |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                             |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7]                                                                                                   |
|                                                            |                                                                                                                                                     |
| physical_plan_with_stats                                   | ProjectionExec: expr=[trim(foo@0) as trim(?table?.foo)], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]:)]]                               |
|                                                            |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1, statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]] |
|                                                            |     MemoryExec: partitions=1, partition_sizes=[7], statistics=[Rows=Exact(700), Bytes=Exact(5768), [(Col[0]: Null=Exact(0))]]                       |
|                                                            |                                                                                                                                                     |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
===RESULTS len: 16
thread 'tests::repartition_test' panicked at src/main.rs:865:9:
assertion `left == right` failed
  left: 16
 right: 20

Two things stand out to me:

The "DISTRIBUTED BY foo" in the SQL seems to not be reflected in the initial logical plan at all?
The use of ".repartition(..., ...)" does appear in the initial logical plan (i.e. Repartition: Hash(foo) partition_count=20), but the physical plan at the end has a round robin partition instead (i.e. RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1). This code seems to be too aggressive in its removal of "unnecessary" repartitions? The Repartition: Hash(foo) partition_count=20 is lost as a result of that code and then the round robin partition is added here.

2 replies

JasonLi-cn Mar 22, 2024

About Q1:
Change DISTRIBUTED BY into DISTRIBUTE BY.
However, datafusion currently does not support DistributeBy partitioning in physical plan, and the error message is as follows:

called `Result::unwrap()` on an `Err` value: NotImplemented("Physical plan does not support DistributeBy partitioning")

😅

JasonLi-cn Mar 22, 2024

AboutQ2:
I believe the purpose of the optimization rule EnforceDistribution is to speed up the query while ensuring the correctness of the results. Although it replaces Hash with RoundRobinBatch, the final result is correct (by that I mean the output row data, not the result data with partitions information that you wanted).
In other words, if there is an Aggregation operator downstream of Repartition(Hash), then it would not be replaced at that time.
Therefore, if you want to achieve the desired result, the solution is to remove EnforceDistribution as I mentioned earlier.

Answer selected by alamb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with repartition #9701

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Issues with repartition #9701

matt-martin Mar 19, 2024

Replies: 3 comments · 3 replies

JasonLi-cn Mar 20, 2024

JasonLi-cn Mar 20, 2024

matt-martin Mar 20, 2024 Author

matt-martin Mar 21, 2024 Author

JasonLi-cn Mar 22, 2024

JasonLi-cn Mar 22, 2024

matt-martin
Mar 19, 2024

Replies: 3 comments 3 replies

JasonLi-cn
Mar 20, 2024

JasonLi-cn
Mar 20, 2024

matt-martin Mar 20, 2024
Author

matt-martin
Mar 21, 2024
Author