feat: Determine ordering of file groups #9593

suremarc · 2024-03-13T08:17:32Z

Which issue does this PR close?

Closes #7490 .

Rationale for this change

See details in #7490 - this feature helps DataFusion eliminate sorts when files can be shown to be non-overlapping in terms of min/max statistics.

What changes are included in this PR?

Add a new FileScanConfig::sort_file_groups method that distribute files via a bin packing algorithm, ensuring that no two files have overlapping statistics
Make FileScanConfig::project check if file groups are sorted when determining projected output orderings
Add a new internal MinMaxStatistics struct that uses the Arrow Row API to efficiently sort & compare file statistics.

Are these changes tested?

Yes - there is a unit test and a sqllogictest.

Are there any user-facing changes?

Yes - there is a new optional statistics field in PartitionedFile, which is part of the proposal in #7490.

There is also the new FileScanConfig::sort_file_groups API

Dandandan · 2024-03-15T09:33:03Z

/benchmark

Dandandan · 2024-03-16T01:13:42Z

/benchmark

github-actions · 2024-03-16T01:24:12Z

Benchmark results

Benchmarks comparing 3d13091 (main) and cca5f0f (PR)

Comparing 3d13091 and cca5f0f
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/3d13091/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist

Dandandan · 2024-03-16T08:48:16Z

/benchmark

Dandandan · 2024-03-16T08:48:43Z

@suremarc sorry for the noise, just trying to run the benchmark command!

github-actions · 2024-03-16T09:00:21Z

Benchmark results

Benchmarks comparing 6e90f01 (main) and cca5f0f (PR)

Comparing 6e90f01 and cca5f0f
Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/6e90f01/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist

alamb · 2024-04-29T15:47:06Z

@NGA-TRAN do you have time to review this PR as well?

alamb · 2024-04-29T15:57:04Z

No worries @suremarc -- I am very excited about this PR. I plan to review it sometime this week (hopefully later today)

NGA-TRAN · 2024-04-29T16:17:57Z

I will review this either today or tomorrow

NGA-TRAN

Thanks @suremarc for adding tests and all refactoring works. The PR looks good to me.

I have some suggestions in the tests to keep them deterministic. Since I went over commit by commit, you might have moved the files/tests around but I think you will get the ideas

NGA-TRAN · 2024-04-30T18:17:23Z

datafusion/core/src/datasource/listing/mod.rs

+    /// DataFusion relies on these statistics for planning so if they are incorrect
+    /// incorrect answers may result.


Suggested change

/// DataFusion relies on these statistics for planning so if they are incorrect

/// incorrect answers may result.

/// DataFusion relies on these statistics for planning so if they are incorrect,

/// incorrect answers may result.

I am guessing you use statistics for column min and max and determine whether data overlaps or not, right? And if they do not overlap, we do not need to merge them before sorting. Maybe adding that to make it clear what you mean about incorrect statistics will lead to incorrect results

NGA-TRAN · 2024-04-30T18:44:19Z

datafusion/sqllogictest/test_files/parquet.slt

--SortExec: expr=[string_col@1 ASC NULLS LAST,int_col@0 ASC NULLS LAST]
----ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}, projection=[int_col, string_col]
-
+--ParquetExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}, projection=[int_col, string_col], output_ordering=[string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST]


So nice 🎉

NGA-TRAN · 2024-04-30T18:50:46Z

datafusion/core/src/datasource/physical_plan/file_scan_config.rs

+                ],
+                sort: vec![col("value").sort(true, false)],
+                expected_result: Err("construct min/max statistics\ncaused by\ncollect min/max values\ncaused by\nError during planning: statistics not found"),
+            },


Nice newly added tests

NGA-TRAN · 2024-04-30T18:54:05Z

datafusion/core/src/datasource/listing/table.rs

+                    partitioned_file_lists = new_groups;
+                } else {
+                    log::debug!("attempted to split file groups by statistics, but there were more file groups than target_partitions; falling back to unordered")
+                }


NGA-TRAN · 2024-04-30T19:04:31Z

datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

+
+# File 1:
+query ITID
+COPY (SELECT * FROM src_table LIMIT 3)


I am a bit surprised this query always return the first 3 rows and its dat tis sorted. Maybe you want to make it deterministic in case this behavior no longer holds in the future by using SELECT * FROM src_table where int_col <=3 order by int_col

I think since you make your columns have corresponding increasing data, the order by can be on any column and your data is always sorted on any column

I'm not sure what happened here but I think you were looking at an old copy of the code -- GH says this is outdated. The latest version has a sort in this command

Right. See my comment here #9593 (review)

Ah, ok that makes sense

NGA-TRAN · 2024-04-30T19:05:06Z

datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

+
+# File 2:
+query ITID
+COPY (SELECT * FROM src_table WHERE int_col > 3 LIMIT 3)


Similar as above, SELECT * FROM src_table where int_col >=4 and int_col <=6 order by int_col

NGA-TRAN · 2024-04-30T19:17:01Z

datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

+
+# Add another file to the directory underlying test_table
+query ITID
+COPY (SELECT * FROM src_table WHERE int_col > 6 LIMIT 3)


Same as above, you want filter that provide deterministic result and data is sorted

NGA-TRAN · 2024-04-30T19:23:09Z

datafusion/sqllogictest/test_files/parquet.slt

-# Check output plan again, expect an "output_ordering" clause in the physical_plan -> ParquetExec:
-# After https://github.com/apache/arrow-datafusion/pull/9593 this should not require a sort.
+# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
+# due to there being more files than partitions:


Nice negative test

Co-authored-by: Nga Tran <nga-tran@live.com>

suremarc · 2024-04-30T20:47:45Z

I'm not sure why changing a comment caused the tests to start failing.... oof.

alamb · 2024-04-30T20:48:13Z

Added API change label as I it adds a new field to PartitionedFile

alamb

Thank you so much @suremarc -- this PR is pretty epic

I am a little worried about the lack of coverage in MinMaxStatistics, especially around some of the tricky edge cases with projected schemas and projections. However this PR is already quite large and been open for quite a while

Here is how I suggest we proceed:

Add a config value for this feature and default it to false (not enabled by default)
Merge this PR
File additional tickets / tests before enabling it
Add additional test coverage as follow on PRs and do some more testing.

Once we feel good about that we can make a PR to turn it on by default.

Thank you very much @NGA-TRAN for your reviews

alamb · 2024-04-30T20:45:59Z

datafusion/core/src/datasource/listing/mod.rs

+    ///
+    /// DataFusion relies on these statistics for planning (in particular to sort file groups),
+    /// so if they are incorrect, incorrect answers may result.
+    pub statistics: Option<Statistics>,


this is actually a nice API to potentially provide pre-known statistics 👍

alamb · 2024-04-30T20:54:49Z

datafusion/core/src/datasource/physical_plan/mod.rs

    }
    all_orderings
 }

+/// A normalized representation of file min/max statistics that allows for efficient sorting & comparison.


Could you please put this structure into its own module (e.g. datafusion/core/src/datasource/physical_plan/statistics.rs) so that it is easier to find

alamb · 2024-04-30T20:56:43Z

datafusion/core/src/datasource/physical_plan/mod.rs

+    fn new_from_files<'a>(
+        projected_sort_order: &[PhysicalSortExpr], // Sort order with respect to projected schema
+        projected_schema: &SchemaRef,              // Projected schema
+        projection: Option<&[usize]>, // Indices of projection in full table schema (None = all columns)


I didn't see any tests that covered a non None projection and I am a little confused about how it could be correct if the projection was in terms of another schema 🤔

I am not sure what you mean, but projection is what was used to produce projected_schema. It tells us what position the columns of projected_schema would be in the full schema. Does that make it more clear?

I guess I was thinking of subtle bugs related to when:

The schema of the files is different but compatible (e.g. one file as (time, date, symbol) but the other file had (date, symbol, time) for example

The query orders by a subset of the columns (e.g. ORDER BY time)

The query orders by a subset of the columns that is not the sort order (ORDER BY date)

Oh... I didn't even think about option 1. But I was assuming that the layout of the file statistics should match the table schema and not the individual file's schema. It seems that that's what DataFusion does currently.

alamb · 2024-04-30T21:00:05Z

datafusion/core/src/datasource/physical_plan/file_scan_config.rs

+            e.context("construct min/max statistics for split_groups_by_statistics")
+        })?;
+
+        let indices_sorted_by_min = {


I wonder if we could move this into statistics itself somehing like

let indices_sorted_by_min = statistics.indices_sorted_by_min()

alamb · 2024-04-30T21:02:23Z

datafusion/core/src/datasource/physical_plan/file_scan_config.rs

+                sort: vec![col("value").sort(true, false)],
+                expected_result: Ok(vec![vec!["0"], vec!["1"], vec!["2"]]),
+            },
+            TestCase {


can we please add a test for a single input file too?

alamb · 2024-04-30T21:04:03Z

datafusion/core/src/datasource/physical_plan/file_scan_config.rs

+                    false,
+                )]),
+                files: vec![
+                    File::new("0", "2023-01-01", vec![Some((0.00, 0.49))]),


I think all these tests also always have the first file with the minimum stastistics value -- can you possibly also test what happens when it is not (aka add a test that runs this test with file ids 2, 1, 0)?

suremarc · 2024-05-01T15:53:48Z

@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.

alamb · 2024-05-01T20:35:54Z

@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.

Sorry -- sounds good. I am going to give this PR another look and file some follow on tickets.

alamb

Thank you @suremarc -- epic work. Let's merge this one in and keep iterating on main

Thanks again for sticking with this -- this is very exciting

alamb · 2024-05-01T20:52:16Z

Filed #10336 to track enable this flag by default

suremarc added 11 commits March 12, 2024 22:49

add statistics to PartitionedFile

7587a07

just dump work for now

1e380b2

working test case

263453f

fix jumbled rebase

5634bd7

forgot to annotate #[test]

7428fe0

more refactoring

4816343

add a link

c7be9e0

refactor again

fc1a668

whitespace

1c42e00

format debug log

3446fed

remove useless itertools

3fe8558

github-actions bot added core Core datafusion crate substrait labels Mar 13, 2024

suremarc added 10 commits March 13, 2024 10:59

refactor test

8ba4001

fix bug

9c8729a

use sort_file_groups in ListingTable

6df9832

move check into a better place

f855a8a

refactor test a bit

3e5263b

more testing

5b7b307

more testing

4761096

better error message

a95dffa

fix log msg

1a66604

fix again

cca5f0f

Dandandan mentioned this pull request Mar 15, 2024

/benchmark command seems not to work #9620

Closed

alamb mentioned this pull request Apr 29, 2024

DataFusion weekly project plan (Andrew Lamb) - April 29, 2024 #10283

Closed

8 tasks

alamb mentioned this pull request Apr 30, 2024

Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

Open

NGA-TRAN approved these changes Apr 30, 2024

View reviewed changes

suremarc and others added 2 commits April 30, 2024 15:28

Update datafusion/core/src/datasource/listing/mod.rs

9bc29cf

Co-authored-by: Nga Tran <nga-tran@live.com>

update comment on in

aa89433

alamb added the api change Changes the API exposed to users of the crate label Apr 30, 2024

suremarc added 2 commits April 30, 2024 15:50

fix test?

d7fc78a

un-fix?

f3a69e5

alamb reviewed Apr 30, 2024

View reviewed changes

suremarc added 6 commits April 30, 2024 16:14

add fix back in?

1a010b7

move indices_sorted_by_min to MinMaxStatistics

f41d1c9

move MinMaxStatistics to its own module

15e1339

fix license

a2c9b4e

add feature flag

d7c9af6

update config

82166fd

suremarc requested a review from alamb May 1, 2024 15:53

alamb approved these changes May 1, 2024

View reviewed changes

alamb merged commit 7c1c794 into apache:main May 1, 2024
24 checks passed

This was referenced May 1, 2024

Enable split_file_groups_by_statistics by default #10336

Open

[Epic] A Collection of Sort Based Optimizations #10313

Open

yyy1000 mentioned this pull request May 4, 2024

Add more sqllogictests for parquet_sorted_statistics #10381

Closed

alamb mentioned this pull request May 6, 2024

DataFusion weekly project plan (Andrew Lamb) - May 6, 2024 #10395

Closed

7 tasks

		/// DataFusion relies on these statistics for planning so if they are incorrect
		/// incorrect answers may result.

feat: Determine ordering of file groups #9593

feat: Determine ordering of file groups #9593

Conversation

suremarc commented Mar 13, 2024 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan commented Mar 15, 2024

Dandandan commented Mar 16, 2024

github-actions bot commented Mar 16, 2024

Benchmark results

Dandandan commented Mar 16, 2024

Dandandan commented Mar 16, 2024

github-actions bot commented Mar 16, 2024

Benchmark results

alamb commented Apr 29, 2024

alamb commented Apr 29, 2024

NGA-TRAN commented Apr 29, 2024

NGA-TRAN left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suremarc commented Apr 30, 2024

alamb commented Apr 30, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suremarc commented May 1, 2024

alamb commented May 1, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented May 1, 2024

suremarc commented Mar 13, 2024 •

edited

NGA-TRAN left a comment •

edited