Upgrade to DataFusion 14.0.0 #903

andygrove · 2022-11-03T18:43:05Z

Changes in this PR:

Use DataFusion 14.0.0
Added copy of filter_push_down rule from DataFusion 13.0.0 because there are changes in the DataFusion 14.0.0 version that cause regressions for us. We should revert back to using DataFusion's version at some point. I filed [ENH] Use filter_push_down rule from DataFusion #908 for this.

ayushdg · 2022-11-03T22:55:46Z

Some of the failures here come from the fact that with datafusion 14.0 the plans generated are slightly different leading to the addition of an additional ddf.a=1 step before the filters are applied:

SELECT a FROM parquet_ddf WHERE (b > 5 AND b < 10) OR a = 1

# dask-sql main (datafusion rev)
Projection: parquet_ddf.a, parquet_ddf.b, parquet_ddf.c, parquet_ddf.d
  Filter: parquet_ddf.b > Int64(5) AND parquet_ddf.b < Int64(10) OR parquet_ddf.a = Int64(1)
    TableScan: parquet_ddf projection=[a, b, c, d]

# df 14.0
Projection: parquet_ddf.a, parquet_ddf.b, parquet_ddf.c, parquet_ddf.d
  Filter: (parquet_ddf.b > Int64(5) OR parquet_ddf.a = Int64(1)Int64(1)parquet_ddf.a) AND (parquet_ddf.b < Int64(10) OR parquet_ddf.a = Int64(1)Int64(1)parquet_ddf.a)
    Projection: parquet_ddf.a = Int64(1) AS parquet_ddf.a = Int64(1)Int64(1)parquet_ddf.a, parquet_ddf.a, parquet_ddf.b, parquet_ddf.c, parquet_ddf.d
      TableScan: parquet_ddf projection=[a, b, c, d]

In this case it's safe to push down the filter to the IO since the df.a=1 op is selecting the same value as the filter. In a more arbitrary case if we happened to have df.a=other_val it would not be safe to push the filter of df.a=1 down to the IO.

The way dask handles it today is not by looking at the val, but generally allowing a subset of operations (irrespective of the values) to appear between the IO and filter stage to push predicates down.

ayushdg · 2022-11-04T00:15:35Z

Looking into this a bit more the step that introduces this additional projection after the table scan comes from the common_sub_expression_eliminate rule. I'm not sure if this is a case where a=1 is a common sub expression so it might be erroneously being applied?

Also while looking through this I realized that the filter_push_down optimizer rule might be able to push these filter down to the table scan for us and allow to pass these filter into the IO. I don't recall if there was a specific reason this rule was not added at the time but it might be worth exploring re-adding this rule?

cc: @andygrove

andygrove · 2022-11-08T13:41:32Z

Thanks @ayushdg. I am going to work on this today. I have updated this PR to use the official 14.0.0 release of DataFusion now.

…tafusion-14

codecov-commenter · 2022-11-08T17:26:20Z

Codecov Report

Merging #903 (b9dfc08) into main (c7017a7) will decrease coverage by 2.18%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #903      +/-   ##
==========================================
- Coverage   75.18%   72.99%   -2.19%     
==========================================
  Files          73       73              
  Lines        3985     3985              
  Branches      713      713              
==========================================
- Hits         2996     2909      -87     
- Misses        829      912      +83     
- Partials      160      164       +4

Impacted Files	Coverage Δ
dask_sql/physical/rel/logical/join.py	`80.67% <ø> (ø)`
dask_sql/physical/utils/filter.py	`77.84% <ø> (ø)`
dask_sql/input_utils/hive.py	`18.25% <0.00%> (-81.75%)`	⬇️
dask_sql/physical/rex/core/literal.py	`60.95% <0.00%> (+2.85%)`	⬆️
dask_sql/physical/rel/logical/filter.py	`84.84% <0.00%> (+3.03%)`	⬆️
dask_sql/_version.py	`34.74% <0.00%> (+3.38%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

andygrove · 2022-11-08T18:18:38Z

I don't understand this failure with Python 3.8 / mac:

FAILED tests/integration/test_fugue.py::test_fsql - AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (0, 1)
[right]: (1, 1)

@ayushdg any ideas?

ayushdg · 2022-11-08T19:00:29Z

I don't understand this failure with Python 3.8 / mac:
FAILED tests/integration/test_fugue.py::test_fsql - AssertionError: DataFrame are different

DataFrame shape mismatch
[left]:  (0, 1)
[right]: (1, 1)
@ayushdg any ideas?

This is a known flaky test that appears occasionally on Mac and windows 3.8.

It should be safe to ignore for now

ayushdg · 2022-11-08T19:01:35Z

A bunch of gpu tests seem to be failing though. @charlesbluca Could you take a look if you get the chance?

charlesbluca · 2022-11-09T15:59:32Z

Yeah can take a look into this

EDIT:

On first glance, looks like all the failures are query regressions; will dig into them individually, but opened #911 to track adding a check to CI that makes the changes to the logical plan more prominent so that it's easier to pinpoint where the regressions are coming from.

charlesbluca · 2022-11-09T17:23:40Z

From a quick glance, it looks like q4, q11, and q74 are failing because we are trying to concat a dask-cudf and dask CPU dataframe together; the underlying cause for this is that new optimizations mean we are beginning to use the EmptyRelation plugin more prominently, and it always creates a CPU dataframe:

dask-sql/dask_sql/physical/rel/logical/empty.py

Lines 32 to 35 in 9f97cc7

    
           return DataContainer( 
        
               dd.from_pandas(pd.DataFrame(data, columns=col_names), npartitions=1), 
        
               ColumnContainer(col_names), 
        
           )

q33, q56, q60, and q83 all seem to be failing when attempting to regenerate the HLG with predicate pushdown, which is a little harder to diagnose - will look into those queries more closely.

… 'in' expressions

charlesbluca · 2022-11-14T20:29:00Z

dask_sql/physical/utils/filter.py

@@ -91,7 +91,7 @@ def attempt_predicate_pushdown(ddf: dd.DataFrame) -> dd.DataFrame:
    try:
        return dsk.layers[name]._regenerate_collection(
            dsk,
-            new_kwargs={io_layer: {"filters": filters}},
+            new_kwargs={io_layer: {"filters": filters, "index": False}},


Looks like the issues with predicate pushdown were stemming from the automatic setting of an index in read_parquet by default, which this kwarg override should disallow.

Chatting with @rjzamora, we agreed that this shouldn't be the default behavior, so we may be able to remove this override later on when changes are made upstream.

charlesbluca

Thanks @andygrove! We can iterate on the optimizer changes once #908 and #914 are resolved

andygrove added 5 commits November 3, 2022 12:11

upgrade to latest datafusion rev

9a61598

panic on unexpected value

c2f2d1c

remove panic

4e9339d

fix regression with window functions

1acfa69

fix regression

193b25d

andygrove added 2 commits November 7, 2022 17:14

use official release of DataFusion

b3667ed

Merge branch 'main' into datafusion-14

cf69b86

andygrove added 5 commits November 8, 2022 07:28

update optimizer rules list

60551a9

Merge branch 'datafusion-14' of github.com:andygrove/dask-sql into da…

ce2db78

…tafusion-14

add filter_push_down rule from DataFusion 13

0a4733e

Merge remote-tracking branch 'upstream/main' into datafusion-14

355a385

fix

0b06335

andygrove marked this pull request as ready for review November 8, 2022 17:06

andygrove requested review from ayushdg, charlesbluca, galipremsagar and jdye64 as code owners November 8, 2022 17:06

charlesbluca mentioned this pull request Nov 9, 2022

[ENH] Add CI check for query logical plan regressions #911

Open

andygrove and others added 3 commits November 10, 2022 16:12

add expr simplifier rule but without optimization for rewriting small…

2d1f8a4

… 'in' expressions

remove unused imports

07e171e

Disable EliminateFilter optimization to unblock regressions

97a1ea6

charlesbluca mentioned this pull request Nov 14, 2022

[ENH] Add GPU support to plugins that create tables on the fly #914

Open

charlesbluca added 2 commits November 14, 2022 09:42

Use upstream SimplifyExpressions, catch associated KeyError

3ad4d7a

Forbid auto-index setting in attempt_predicate_pushdown

459edf1

charlesbluca reviewed Nov 14, 2022

View reviewed changes

Ignore index in test_predicate_pushdown

3a2e68c

rjzamora mentioned this pull request Nov 14, 2022

Remove statistics-based set_index logic from read_parquet dask/dask#9661

Merged

3 tasks

charlesbluca added 3 commits November 15, 2022 07:41

Add dask version check to predicate pushdown tests

63b4e5d

Merge remote-tracking branch 'origin/main' into datafusion-14

65c5669

Add TODO for index specification

b9dfc08

charlesbluca approved these changes Nov 15, 2022

View reviewed changes

charlesbluca merged commit ab246b0 into dask-contrib:main Nov 15, 2022

randerzander mentioned this pull request Nov 30, 2022

[DOC] RAPIDS 22.12 Release Blog Outline rapidsai/cudf#12057

Closed

ayushdg mentioned this pull request Feb 13, 2023

Upgrade to DataFusion 17.0.0 #998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to DataFusion 14.0.0 #903

Upgrade to DataFusion 14.0.0 #903

andygrove commented Nov 3, 2022 •

edited

Loading

ayushdg commented Nov 3, 2022

ayushdg commented Nov 4, 2022

andygrove commented Nov 8, 2022

codecov-commenter commented Nov 8, 2022 •

edited

Loading

andygrove commented Nov 8, 2022

ayushdg commented Nov 8, 2022

ayushdg commented Nov 8, 2022

charlesbluca commented Nov 9, 2022 •

edited

Loading

charlesbluca commented Nov 9, 2022

charlesbluca Nov 14, 2022

charlesbluca left a comment

Upgrade to DataFusion 14.0.0 #903

Upgrade to DataFusion 14.0.0 #903

Conversation

andygrove commented Nov 3, 2022 • edited Loading

ayushdg commented Nov 3, 2022

ayushdg commented Nov 4, 2022

andygrove commented Nov 8, 2022

codecov-commenter commented Nov 8, 2022 • edited Loading

Codecov Report

andygrove commented Nov 8, 2022

ayushdg commented Nov 8, 2022

ayushdg commented Nov 8, 2022

charlesbluca commented Nov 9, 2022 • edited Loading

charlesbluca commented Nov 9, 2022

charlesbluca Nov 14, 2022

Choose a reason for hiding this comment

charlesbluca left a comment

Choose a reason for hiding this comment

andygrove commented Nov 3, 2022 •

edited

Loading

codecov-commenter commented Nov 8, 2022 •

edited

Loading

charlesbluca commented Nov 9, 2022 •

edited

Loading