-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11426: [Rust][DataFusion] EXTRACT support #9359
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9359 +/- ##
==========================================
+ Coverage 82.27% 82.29% +0.01%
==========================================
Files 244 244
Lines 55555 55616 +61
==========================================
+ Hits 45708 45767 +59
- Misses 9847 9849 +2
Continue to review full report at Codecov.
|
@@ -169,6 +170,13 @@ pub enum Expr { | |||
}, | |||
/// Represents a reference to all fields in a schema. | |||
Wildcard, | |||
/// Extract date parts (day, hour, minute) from a date / time expression | |||
Extract { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, this could use ScalarFunction
e.g. using "date_part" (which is also supported by PostgreSQL) to avoid an extra enum option, and convert the extract sql syntax to this ScalarFunction
. I am not sure which is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been trying to move them to the ScalarFunction
to avoid having many variants around, mostly because adding a new item to an enum is backward incompatible and thus it may be beneficial to reserve that for operations that cannot be described by ScalarFunction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked into this. The downside currently is w.r.t. performance and being able to utilize Arrow kernels. The ScalarFunction
implementation repeats scalar values, the date part, e.g. 'HOUR' for date_part('HOUR', dt)
will be for repeated for each row.
In PostgreSQL, expressions are not allowed for date_part
/ extract
, date_trunc
etc. :
https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
Note that here the field parameter needs to be a string value, not a name. The valid field names for date_part are the same as for extract.
This is also happening with the existing date_trunc
function where currently the date part (as string) is repeated / matched against for each row and also evaluated per row (see below). That won't work with the hour
kernel for obvious reasons.
let result = range
.map(|i| {
if array.is_null(i) {
Ok(0_i64)
} else {
let date_time = match granularity_array.value(i) {
"second" => array
.value_as_datetime(i)
.and_then(|d| d.with_nanosecond(0)),
"minute" => array
.value_as_datetime(i)
.and_then(|d| d.with_nanosecond(0))
.and_then(|d| d.with_second(0)),
[...]
So I think here we have a few options:
- Refactor/Optimize
ScalarFunction
to also allow for scalar values, be able to check on them + support literals (I guess it should useColumnarValue
instead of justArray
s). - Have a similar (inefficient) implementation for extract from / date_part to compute as currently
date_trunc
. I think that - Refactor/Optimize
ScalarFunction
later and keep the Extract as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hold my beer: #9376 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍺 awesome! 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!
In terms of adding variants to Expr
I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE).
So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate EXTRACT
into a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is worth it to wait for the other PR to land and convert it to use the scalar function. Then we could also add the function alias date_part
easily next to the extract
syntax, which are both supported by PostgreSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this and it looks 💯 !
I am unsure about whether adding a new entry to the enum is ideal, as those have major impact to everyone that uses Expr
. @andygrove and @alamb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan -- this is looking really nice. I think some additional coverage for the non supported date_parts
would be good.
@@ -169,6 +170,13 @@ pub enum Expr { | |||
}, | |||
/// Represents a reference to all fields in a schema. | |||
Wildcard, | |||
/// Extract date parts (day, hour, minute) from a date / time expression | |||
Extract { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!
In terms of adding variants to Expr
I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE).
So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate EXTRACT
into a function.
match data_type { | ||
DataType::Date32 => { | ||
let array = array.as_any().downcast_ref::<Date32Array>().unwrap(); | ||
Ok(ColumnarValue::Array(Arc::new(hour(array)?))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it confusing that date_part
is passed all the way down in the Exprs / trees only to be ignored in the actual implementation which directly calls hour
. I can see that the expr seems to always be made with DatePart::Hour
but I am not 100% sure.
I am fine with not supporting all the various date parts initially, but I would recommend the following as a way of documenting through code / safe guard against future bugs:
- Add a test for
EXTRACT DAY from timestamp
and show that it generates a useful error - Add a check in this function for
date_part != DataPart::Hour
and throw an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no other datepart for now in the enum, so I guess that might generate some clippy warnings. But we could match directly on the datepart. The errorshould be generated elsewhere already (when building the logical plan), agree makes sense to add a test for that 👍.
rust/datafusion/tests/sql.rs
Outdated
let mut ctx = ExecutionContext::new(); | ||
let sql = "SELECT | ||
EXTRACT(HOUR FROM CAST('2020-01-01' AS DATE)), | ||
EXTRACT(HOUR FROM to_timestamp('2020-09-08T12:00:00+00:00')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding coverage for the other date parts would be valuable here (even if they error)
This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support #9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes #9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes apache#9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
d4608a9
to
356c300
Compare
Most of it converted to use scalar functions now. There is some missing support to support multiple types for one argument but not for the other in scalar functions, will have a look at that later. |
This is now ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is looking great. Thanks @Dandandan -- @jorgecarleitao do you want to take another look or shall we merge this?
|
||
let is_scalar = matches!(array, ColumnarValue::Scalar(_)); | ||
|
||
let array = match array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the longer term plan will be to handle the Scalar
case more efficiently. This (converting to an array) is fine for now I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. For now we can use this approach to avoid reimplementing hours/years etc, with a bit of overhead.
Maybe longer term would be nice to have something like Datum
in Arrow in order to both gain some performance and avoid reimplementing things for the scalar case.
@@ -71,6 +71,8 @@ pub enum Signature { | |||
Exact(Vec<DataType>), | |||
/// fixed number of arguments of arbitrary types | |||
Any(usize), | |||
/// One of a list of signatures | |||
OneOf(Vec<Signature>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @seddonm1 I am not sure how this affects your string functions / other postgres function plans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I missed this but all good. This is actually better :D
@@ -68,6 +68,29 @@ pub fn data_types( | |||
current_types: &[DataType], | |||
signature: &Signature, | |||
) -> Result<Vec<DataType>> { | |||
let valid_types = get_valid_types(signature, current_types)?; | |||
|
|||
if valid_types.contains(¤t_types.to_owned()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this be ¤t_types
(aka why does it need a call to to_owned
just to immediately borrow from it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have a look... I think it was auto generated by the new "extract function" functionality in rust-analyzer (which doesn't work 100% reliably, but still is very useful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems ¤t_types
isn't possible with contains
, made it use any
instead.
#[tokio::test] | ||
async fn extract_date_part() -> Result<()> { | ||
let mut ctx = ExecutionContext::new(); | ||
let sql = "SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Integration test failure looks like https://issues.apache.org/jira/browse/ARROW-11717. I am retriggering it on this PR |
Thanks @Dandandan -- I am about to run out of time for today but I will plan to merge this in tomorrow if someone doesn't beat me to it |
@Dandandan sadly this PR has merge conflicts (probably from #9509) -- can you possibly rebase it? |
@alamb conflict solved 👍 |
Integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717 and is related to this PR |
This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache/arrow#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes #9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes apache#9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR starts implementing support for the `EXTRACT` syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax: `EXTRACT (HOUR FROM dt)` See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc. Closes apache#9359 from Dandandan/temporal_sql Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes apache#9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR starts implementing support for the `EXTRACT` syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax: `EXTRACT (HOUR FROM dt)` See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc. Closes apache#9359 from Dandandan/temporal_sql Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9359 +/- ##
==========================================
+ Coverage 82.27% 82.29% +0.01%
==========================================
Files 244 244
Lines 55555 55616 +61
==========================================
+ Hits 45708 45767 +59
- Misses 9847 9849 +2 ☔ View full report in Codecov by Sentry. |
This PR starts implementing support for the
EXTRACT
syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax:EXTRACT (HOUR FROM dt)
See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference
This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc.