-
Notifications
You must be signed in to change notification settings - Fork 72
Update DataFusion and change order of optimization rules #825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "TimestampMicrosecond", | ||
| "TimestampNanosecond", | ||
| }: | ||
| unit_mapping = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we were not previously exercising this code path since there were some obvious bugs in here
|
|
Codecov Report
@@ Coverage Diff @@
## main #825 +/- ##
==========================================
+ Coverage 77.04% 77.43% +0.39%
==========================================
Files 71 71
Lines 3594 3599 +5
Branches 632 634 +2
==========================================
+ Hits 2769 2787 +18
+ Misses 696 679 -17
- Partials 129 133 +4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
| literal_value, timezone = rex.getTimestampValue() | ||
| if timezone and timezone != "UTC": | ||
| raise ValueError("Non UTC timezones not supported") | ||
| literal_type = SqlTypeName.TIMESTAMP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this if block, you can have:
elif timezone is None:
literal_value = datetime.fromtimestamp(literal_value // 10**9)
(make sure to do a from datetime import datetime)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I chose 10**9 to convert to seconds, as suggested by https://stackoverflow.com/questions/45423917/pandas-errno-75-value-too-large-for-defined-data-type, but maybe milliseconds (or something else) would be generally better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sarahyurick that fixes the regression locally for me. I have pushed this change.
|
There is a GPU CI Failure: @sarahyurick Do you know what we need to do to fix this? |
Hmm, maybe you could try adding a |
| select count(*) from ( | ||
| select * from df_simple | ||
| select a, b from df_simple | ||
| intersect | ||
| select * from df_simple | ||
| select a, b from df_simple | ||
| intersect | ||
| select * from df_wide | ||
| select a, b from df_wide | ||
| ) hot_item | ||
| limit 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query was invalid. intersect requires that all relations have the same number of columns. For example, postgres would fail with ERROR: each INTERSECT query must have the same number of columns. DataFusion now has a check for this.
| if timezone and timezone != "UTC": | ||
| raise ValueError("Non UTC timezones not supported") | ||
| elif timezone is None: | ||
| literal_value = datetime.fromtimestamp(literal_value // 10**9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gpuCI still fails, you can try adding literal_value = str(literal_value) after this line (but still in the elif block).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gpuCI still fails, you can try adding
literal_value = str(literal_value)after this line (but still in theelifblock).
Thanks Sarah. I tried that but this still fails. Here is more info on the failure:
self = <dask_sql.physical.rex.core.call.ReduceOperation object at 0x7f4717a097c0>
operands = ('2001-03-09', datetime.timedelta(days=90)), kwargs = {}
def reduce(self, *operands, **kwargs):
if len(operands) > 1:
if any(
map(
lambda op: is_frame(op) & pd.api.types.is_datetime64_dtype(op),
operands,
)
):
operands = tuple(map(as_timelike, operands))
> return reduce(partial(self.operation, **kwargs), operands)
E TypeError: can only concatenate str (not "datetime.timedelta") to str
dask_sql/physical/rex/core/call.py:137: TypeError
Perhaps def reduce or def as_timelike need updating to support the interval type that the Rust code is returning?
It looks like @charlesbluca may be familiar with these methods and maybe has some insight here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if this change is still needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the string casting, but it seems like the datetime.fromtimestamp is necessary to get the proper timezone for the datetime
charlesbluca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ayushdg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor question but otherwise lgtm!
| if timezone and timezone != "UTC": | ||
| raise ValueError("Non UTC timezones not supported") | ||
| elif timezone is None: | ||
| literal_value = datetime.fromtimestamp(literal_value // 10**9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if this change is still needed
There have been a lot of improvements to the optimization rules in DataFusion recently, particularly related to type coercion. This PR picks up those improvements.
Closes #844