Skip to content

Conversation

@andygrove
Copy link
Contributor

@andygrove andygrove commented Sep 30, 2022

There have been a lot of improvements to the optimization rules in DataFusion recently, particularly related to type coercion. This PR picks up those improvements.

Closes #844

"TimestampMicrosecond",
"TimestampNanosecond",
}:
unit_mapping = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we were not previously exercising this code path since there were some obvious bugs in here

@andygrove
Copy link
Contributor Author

test_literals is failing with a timezone-related issue ... seems to be off by 6 hours

E           AssertionError: numpy array are different
E           
E           numpy array values are different (100.0 %)
E           [index]: [0]
E           [left]:  [1649288001000000000]
E           [right]: [1649266401000000000]

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2022

Codecov Report

Merging #825 (48fdec8) into main (c31a6eb) will increase coverage by 0.39%.
The diff coverage is 60.86%.

@@            Coverage Diff             @@
##             main     #825      +/-   ##
==========================================
+ Coverage   77.04%   77.43%   +0.39%     
==========================================
  Files          71       71              
  Lines        3594     3599       +5     
  Branches      632      634       +2     
==========================================
+ Hits         2769     2787      +18     
+ Misses        696      679      -17     
- Partials      129      133       +4     
Impacted Files Coverage Δ
dask_sql/physical/rex/core/call.py 81.03% <33.33%> (-0.72%) ⬇️
dask_sql/physical/rex/core/literal.py 58.09% <83.33%> (+12.09%) ⬆️
dask_sql/mappings.py 84.31% <100.00%> (+2.13%) ⬆️
dask_sql/physical/rel/logical/filter.py 81.81% <0.00%> (-3.04%) ⬇️
dask_sql/_version.py 34.00% <0.00%> (+1.44%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

literal_value, timezone = rex.getTimestampValue()
if timezone and timezone != "UTC":
raise ValueError("Non UTC timezones not supported")
literal_type = SqlTypeName.TIMESTAMP
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this if block, you can have:

elif timezone is None:
    literal_value = datetime.fromtimestamp(literal_value // 10**9)

(make sure to do a from datetime import datetime)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I chose 10**9 to convert to seconds, as suggested by https://stackoverflow.com/questions/45423917/pandas-errno-75-value-too-large-for-defined-data-type, but maybe milliseconds (or something else) would be generally better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sarahyurick that fixes the regression locally for me. I have pushed this change.

@andygrove andygrove marked this pull request as ready for review September 30, 2022 20:10
@andygrove
Copy link
Contributor Author

There is a GPU CI Failure:

TypeError: can only concatenate str (not "datetime.timedelta") to str

@sarahyurick Do you know what we need to do to fix this?

@sarahyurick
Copy link
Collaborator

There is a GPU CI Failure:

TypeError: can only concatenate str (not "datetime.timedelta") to str

@sarahyurick Do you know what we need to do to fix this?

Hmm, maybe you could try adding a literal_value = str(literal_value) at the end of the elif block?

Comment on lines 331 to 338
select count(*) from (
select * from df_simple
select a, b from df_simple
intersect
select * from df_simple
select a, b from df_simple
intersect
select * from df_wide
select a, b from df_wide
) hot_item
limit 100
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query was invalid. intersect requires that all relations have the same number of columns. For example, postgres would fail with ERROR: each INTERSECT query must have the same number of columns. DataFusion now has a check for this.

if timezone and timezone != "UTC":
raise ValueError("Non UTC timezones not supported")
elif timezone is None:
literal_value = datetime.fromtimestamp(literal_value // 10**9)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gpuCI still fails, you can try adding literal_value = str(literal_value) after this line (but still in the elif block).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gpuCI still fails, you can try adding literal_value = str(literal_value) after this line (but still in the elif block).

Thanks Sarah. I tried that but this still fails. Here is more info on the failure:

self = <dask_sql.physical.rex.core.call.ReduceOperation object at 0x7f4717a097c0>
operands = ('2001-03-09', datetime.timedelta(days=90)), kwargs = {}

    def reduce(self, *operands, **kwargs):
        if len(operands) > 1:
            if any(
                map(
                    lambda op: is_frame(op) & pd.api.types.is_datetime64_dtype(op),
                    operands,
                )
            ):
                operands = tuple(map(as_timelike, operands))
>           return reduce(partial(self.operation, **kwargs), operands)
E           TypeError: can only concatenate str (not "datetime.timedelta") to str

dask_sql/physical/rex/core/call.py:137: TypeError

Perhaps def reduce or def as_timelike need updating to support the interval type that the Rust code is returning?

It looks like @charlesbluca may be familiar with these methods and maybe has some insight here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this change is still needed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the string casting, but it seems like the datetime.fromtimestamp is necessary to get the proper timezone for the datetime

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question but otherwise lgtm!

if timezone and timezone != "UTC":
raise ValueError("Non UTC timezones not supported")
elif timezone is None:
literal_value = datetime.fromtimestamp(literal_value // 10**9)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this change is still needed

@ayushdg ayushdg merged commit 6e21be2 into dask-contrib:main Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] DataFusion regression in optimizer related to casting

6 participants