Update DataFusion and change order of optimization rules #825

andygrove · 2022-09-30T14:38:28Z

There have been a lot of improvements to the optimization rules in DataFusion recently, particularly related to type coercion. This PR picks up those improvements.

Closes #844

andygrove · 2022-09-30T15:41:48Z

dask_sql/physical/rex/core/literal.py

            "TimestampMicrosecond",
            "TimestampNanosecond",
        }:
            unit_mapping = {


It looks like we were not previously exercising this code path since there were some obvious bugs in here

andygrove · 2022-09-30T15:50:57Z

test_literals is failing with a timezone-related issue ... seems to be off by 6 hours

E           AssertionError: numpy array are different
E           
E           numpy array values are different (100.0 %)
E           [index]: [0]
E           [left]:  [1649288001000000000]
E           [right]: [1649266401000000000]

codecov-commenter · 2022-09-30T16:19:37Z

Codecov Report

Merging #825 (48fdec8) into main (c31a6eb) will increase coverage by 0.39%.
The diff coverage is 60.86%.

@@            Coverage Diff             @@
##             main     #825      +/-   ##
==========================================
+ Coverage   77.04%   77.43%   +0.39%     
==========================================
  Files          71       71              
  Lines        3594     3599       +5     
  Branches      632      634       +2     
==========================================
+ Hits         2769     2787      +18     
+ Misses        696      679      -17     
- Partials      129      133       +4

Impacted Files	Coverage Δ
dask_sql/physical/rex/core/call.py	`81.03% <33.33%> (-0.72%)`	⬇️
dask_sql/physical/rex/core/literal.py	`58.09% <83.33%> (+12.09%)`	⬆️
dask_sql/mappings.py	`84.31% <100.00%> (+2.13%)`	⬆️
dask_sql/physical/rel/logical/filter.py	`81.81% <0.00%> (-3.04%)`	⬇️
dask_sql/_version.py	`34.00% <0.00%> (+1.44%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sarahyurick · 2022-09-30T18:26:23Z

dask_sql/physical/rex/core/literal.py

            literal_value, timezone = rex.getTimestampValue()
            if timezone and timezone != "UTC":
                raise ValueError("Non UTC timezones not supported")
-            literal_type = SqlTypeName.TIMESTAMP


after this if block, you can have:

elif timezone is None: literal_value = datetime.fromtimestamp(literal_value // 10**9)

(make sure to do a from datetime import datetime)

btw, I chose 10**9 to convert to seconds, as suggested by https://stackoverflow.com/questions/45423917/pandas-errno-75-value-too-large-for-defined-data-type, but maybe milliseconds (or something else) would be generally better.

Thanks @sarahyurick that fixes the regression locally for me. I have pushed this change.

andygrove · 2022-10-03T15:13:18Z

There is a GPU CI Failure:

TypeError: can only concatenate str (not "datetime.timedelta") to str

@sarahyurick Do you know what we need to do to fix this?

sarahyurick · 2022-10-03T16:00:57Z

There is a GPU CI Failure:
TypeError: can only concatenate str (not "datetime.timedelta") to str
@sarahyurick Do you know what we need to do to fix this?

Hmm, maybe you could try adding a literal_value = str(literal_value) at the end of the elif block?

andygrove · 2022-10-05T18:49:42Z

tests/integration/test_join.py

    select count(*) from (
-    select * from df_simple
+    select a, b from df_simple
    intersect
-    select * from df_simple
+    select a, b from df_simple
    intersect
-    select * from df_wide
+    select a, b from df_wide
    ) hot_item
    limit 100


This query was invalid. intersect requires that all relations have the same number of columns. For example, postgres would fail with ERROR: each INTERSECT query must have the same number of columns. DataFusion now has a check for this.

sarahyurick · 2022-10-05T19:52:35Z

dask_sql/physical/rex/core/literal.py

            if timezone and timezone != "UTC":
                raise ValueError("Non UTC timezones not supported")
+            elif timezone is None:
+                literal_value = datetime.fromtimestamp(literal_value // 10**9)


If gpuCI still fails, you can try adding literal_value = str(literal_value) after this line (but still in the elif block).

If gpuCI still fails, you can try adding literal_value = str(literal_value) after this line (but still in the elif block).

Thanks Sarah. I tried that but this still fails. Here is more info on the failure:

self = <dask_sql.physical.rex.core.call.ReduceOperation object at 0x7f4717a097c0> operands = ('2001-03-09', datetime.timedelta(days=90)), kwargs = {} def reduce(self, *operands, **kwargs): if len(operands) > 1: if any( map( lambda op: is_frame(op) & pd.api.types.is_datetime64_dtype(op), operands, ) ): operands = tuple(map(as_timelike, operands)) > return reduce(partial(self.operation, **kwargs), operands) E TypeError: can only concatenate str (not "datetime.timedelta") to str dask_sql/physical/rex/core/call.py:137: TypeError

Perhaps def reduce or def as_timelike need updating to support the interval type that the Rust code is returning?

It looks like @charlesbluca may be familiar with these methods and maybe has some insight here.

Curious if this change is still needed

I don't think we need the string casting, but it seems like the datetime.fromtimestamp is necessary to get the proper timezone for the datetime

…subqueries to joins

charlesbluca

LGTM

ayushdg

Minor question but otherwise lgtm!

ayushdg · 2022-10-10T15:05:52Z

dask_sql/physical/rex/core/literal.py

            if timezone and timezone != "UTC":
                raise ValueError("Non UTC timezones not supported")
+            elif timezone is None:
+                literal_value = datetime.fromtimestamp(literal_value // 10**9)


Curious if this change is still needed

andygrove added 2 commits September 30, 2022 08:37

Update DataFusion and change order of optimization rules

906e70d

partial fix to Python code

c9211aa

andygrove commented Sep 30, 2022

View reviewed changes

add comment

196479d

andygrove added 3 commits September 30, 2022 10:44

save progress

44a62de

revert change

59d6916

Revert chrono

e5d52ee

sarahyurick reviewed Sep 30, 2022

View reviewed changes

add suggestion

b51c75d

andygrove marked this pull request as ready for review September 30, 2022 20:10

andygrove requested review from ayushdg, charlesbluca, galipremsagar and jdye64 as code owners September 30, 2022 20:10

python lint

7f5d7e6

sarahyurick mentioned this pull request Oct 4, 2022

Resolve test_literals() #812

Merged

andygrove added 2 commits October 5, 2022 12:47

bump datafusion version again

5b21585

fmt

48b962b

andygrove commented Oct 5, 2022

View reviewed changes

andygrove added 3 commits October 5, 2022 12:53

upmerge

8fe5a34

add additional test

3d000c6

lint

aec3cb7

sarahyurick reviewed Oct 5, 2022

View reviewed changes

andygrove added 3 commits October 5, 2022 14:02

fix GPU CI?

4926796

bump to 13.0.0-rc1

02ed556

stop building aggregate schema and let DataFusion do that

1d04626

charlesbluca added 5 commits October 7, 2022 07:17

Merge remote-tracking branch 'origin/main' into bump-df-0930

62f20f8

Add handling for string to datetime casting, switch to np.timedelta64

b2cf5ce

Un-xfail passing tests

8639e5d

Resolve style failures

7328664

timedelta64 doesn't accept floats

9f078c7

andygrove mentioned this pull request Oct 7, 2022

[BUG] DataFusion regression in optimizer related to casting #844

Closed

charlesbluca and others added 5 commits October 7, 2022 15:03

Merge branch 'main' into bump-df-0930

7735f67

upmerge:

ee72003

fix regression by running SimplifyExpressions again after converting …

82264fa

…subqueries to joins

merge

c2237ac

use official DataFusion 13.0.0 release

48fdec8

charlesbluca approved these changes Oct 10, 2022

View reviewed changes

ayushdg approved these changes Oct 10, 2022

View reviewed changes

galipremsagar approved these changes Oct 10, 2022

View reviewed changes

ayushdg merged commit 6e21be2 into dask-contrib:main Oct 10, 2022

Update DataFusion and change order of optimization rules #825

Update DataFusion and change order of optimization rules #825

Uh oh!

Conversation

andygrove commented Sep 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Sep 30, 2022

Uh oh!

codecov-commenter commented Sep 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Oct 3, 2022

Uh oh!

sarahyurick commented Oct 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

ayushdg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

andygrove commented Sep 30, 2022 •

edited

Loading

codecov-commenter commented Sep 30, 2022 •

edited

Loading