Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11426: [Rust][DataFusion] EXTRACT support #9359

Closed
wants to merge 29 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Jan 29, 2021

This PR starts implementing support for the EXTRACT syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax:

EXTRACT (HOUR FROM dt)

See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference

This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc.

@github-actions
Copy link

@Dandandan Dandandan changed the title ARROW-11426: [Rust][DataFusion] Start of EXTRACT support for DataFusion [WIP] ARROW-11426: [Rust][DataFusion] EXTRACT support for DataFusion [WIP] Jan 29, 2021
@Dandandan Dandandan changed the title ARROW-11426: [Rust][DataFusion] EXTRACT support for DataFusion [WIP] ARROW-11426: [Rust][DataFusion] EXTRACT support [WIP] Jan 29, 2021
@codecov-io
Copy link

codecov-io commented Jan 29, 2021

Codecov Report

Merging #9359 (40e184b) into master (aebabca) will increase coverage by 0.01%.
The diff coverage is 87.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9359      +/-   ##
==========================================
+ Coverage   82.27%   82.29%   +0.01%     
==========================================
  Files         244      244              
  Lines       55555    55616      +61     
==========================================
+ Hits        45708    45767      +59     
- Misses       9847     9849       +2     
Impacted Files Coverage Δ
...st/datafusion/src/physical_plan/expressions/mod.rs 71.42% <ø> (ø)
rust/datafusion/src/logical_plan/expr.rs 81.13% <50.00%> (ø)
...tafusion/src/physical_plan/datetime_expressions.rs 68.83% <66.66%> (-0.35%) ⬇️
rust/datafusion/src/sql/planner.rs 83.22% <80.00%> (-0.02%) ⬇️
rust/datafusion/src/physical_plan/functions.rs 73.82% <100.00%> (+1.46%) ⬆️
rust/datafusion/src/physical_plan/type_coercion.rs 98.62% <100.00%> (+0.09%) ⬆️
rust/datafusion/tests/sql.rs 99.92% <100.00%> (+<0.01%) ⬆️
rust/datafusion/src/scalar.rs 53.27% <0.00%> (+1.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aebabca...40e184b. Read the comment docs.

@Dandandan Dandandan changed the title ARROW-11426: [Rust][DataFusion] EXTRACT support [WIP] ARROW-11426: [Rust][DataFusion] EXTRACT support Jan 29, 2021
@@ -169,6 +170,13 @@ pub enum Expr {
},
/// Represents a reference to all fields in a schema.
Wildcard,
/// Extract date parts (day, hour, minute) from a date / time expression
Extract {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this could use ScalarFunction e.g. using "date_part" (which is also supported by PostgreSQL) to avoid an extra enum option, and convert the extract sql syntax to this ScalarFunction. I am not sure which is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been trying to move them to the ScalarFunction to avoid having many variants around, mostly because adding a new item to an enum is backward incompatible and thus it may be beneficial to reserve that for operations that cannot be described by ScalarFunction.

Copy link
Contributor Author

@Dandandan Dandandan Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked into this. The downside currently is w.r.t. performance and being able to utilize Arrow kernels. The ScalarFunction implementation repeats scalar values, the date part, e.g. 'HOUR' for date_part('HOUR', dt) will be for repeated for each row.
In PostgreSQL, expressions are not allowed for date_part / extract, date_trunc etc. :

https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

Note that here the field parameter needs to be a string value, not a name. The valid field names for date_part are the same as for extract.

This is also happening with the existing date_trunc function where currently the date part (as string) is repeated / matched against for each row and also evaluated per row (see below). That won't work with the hour kernel for obvious reasons.

    let result = range
        .map(|i| {
            if array.is_null(i) {
                Ok(0_i64)
            } else {
                let date_time = match granularity_array.value(i) {
                    "second" => array
                        .value_as_datetime(i)
                        .and_then(|d| d.with_nanosecond(0)),
                    "minute" => array
                        .value_as_datetime(i)
                        .and_then(|d| d.with_nanosecond(0))
                        .and_then(|d| d.with_second(0)),
[...]

So I think here we have a few options:

  • Refactor/Optimize ScalarFunction to also allow for scalar values, be able to check on them + support literals (I guess it should use ColumnarValue instead of just Arrays).
  • Have a similar (inefficient) implementation for extract from / date_part to compute as currently date_trunc. I think that
  • Refactor/Optimize ScalarFunction later and keep the Extract as is for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold my beer: #9376 :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍺 awesome! 😎

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!

In terms of adding variants to Expr I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE).

So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate EXTRACT into a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is worth it to wait for the other PR to land and convert it to use the scalar function. Then we could also add the function alias date_part easily next to the extract syntax, which are both supported by PostgreSQL.

@Dandandan
Copy link
Contributor Author

This is ready for review now. @alamb @nevi-me tagging you, because I think you would be interested in more temporal support.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this and it looks 💯 !

I am unsure about whether adding a new entry to the enum is ideal, as those have major impact to everyone that uses Expr. @andygrove and @alamb ?

rust/datafusion/src/logical_plan/expr.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan -- this is looking really nice. I think some additional coverage for the non supported date_parts would be good.

@@ -169,6 +170,13 @@ pub enum Expr {
},
/// Represents a reference to all fields in a schema.
Wildcard,
/// Extract date parts (day, hour, minute) from a date / time expression
Extract {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!

In terms of adding variants to Expr I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE).

So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate EXTRACT into a function.

match data_type {
DataType::Date32 => {
let array = array.as_any().downcast_ref::<Date32Array>().unwrap();
Ok(ColumnarValue::Array(Arc::new(hour(array)?)))
Copy link
Contributor

@alamb alamb Jan 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that date_part is passed all the way down in the Exprs / trees only to be ignored in the actual implementation which directly calls hour. I can see that the expr seems to always be made with DatePart::Hour but I am not 100% sure.

I am fine with not supporting all the various date parts initially, but I would recommend the following as a way of documenting through code / safe guard against future bugs:

  1. Add a test for EXTRACT DAY from timestamp and show that it generates a useful error
  2. Add a check in this function for date_part != DataPart::Hour and throw an error

Copy link
Contributor Author

@Dandandan Dandandan Jan 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no other datepart for now in the enum, so I guess that might generate some clippy warnings. But we could match directly on the datepart. The errorshould be generated elsewhere already (when building the logical plan), agree makes sense to add a test for that 👍.

let mut ctx = ExecutionContext::new();
let sql = "SELECT
EXTRACT(HOUR FROM CAST('2020-01-01' AS DATE)),
EXTRACT(HOUR FROM to_timestamp('2020-09-08T12:00:00+00:00'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding coverage for the other date parts would be valuable here (even if they error)

alamb pushed a commit that referenced this pull request Feb 3, 2021
This adds year support to the temporal module. Year support is something needed for some TCPH queries.
Together with `extract` support
#9359

we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion.

Other changes in the PR:
* Adding some more tests to `hour`
* Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.).
* Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable.

Closes #9374 from Dandandan/year

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
nevi-me pushed a commit to nevi-me/arrow that referenced this pull request Feb 13, 2021
This adds year support to the temporal module. Year support is something needed for some TCPH queries.
Together with `extract` support
apache#9359

we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion.

Other changes in the PR:
* Adding some more tests to `hour`
* Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.).
* Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable.

Closes apache#9374 from Dandandan/year

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@Dandandan
Copy link
Contributor Author

Most of it converted to use scalar functions now. There is some missing support to support multiple types for one argument but not for the other in scalar functions, will have a look at that later.

@Dandandan
Copy link
Contributor Author

@jorgecarleitao @alamb

This is now ready for review

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is looking great. Thanks @Dandandan -- @jorgecarleitao do you want to take another look or shall we merge this?


let is_scalar = matches!(array, ColumnarValue::Scalar(_));

let array = match array {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the longer term plan will be to handle the Scalar case more efficiently. This (converting to an array) is fine for now I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. For now we can use this approach to avoid reimplementing hours/years etc, with a bit of overhead.
Maybe longer term would be nice to have something like Datum in Arrow in order to both gain some performance and avoid reimplementing things for the scalar case.

@@ -71,6 +71,8 @@ pub enum Signature {
Exact(Vec<DataType>),
/// fixed number of arguments of arbitrary types
Any(usize),
/// One of a list of signatures
OneOf(Vec<Signature>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @seddonm1 I am not sure how this affects your string functions / other postgres function plans

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I missed this but all good. This is actually better :D

@@ -68,6 +68,29 @@ pub fn data_types(
current_types: &[DataType],
signature: &Signature,
) -> Result<Vec<DataType>> {
let valid_types = get_valid_types(signature, current_types)?;

if valid_types.contains(&current_types.to_owned()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be &current_types (aka why does it need a call to to_owned just to immediately borrow from it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have a look... I think it was auto generated by the new "extract function" functionality in rust-analyzer (which doesn't work 100% reliably, but still is very useful).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems &current_types isn't possible with contains, made it use any instead.

#[tokio::test]
async fn extract_date_part() -> Result<()> {
let mut ctx = ExecutionContext::new();
let sql = "SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Feb 20, 2021

Integration test failure looks like https://issues.apache.org/jira/browse/ARROW-11717. I am retriggering it on this PR

@alamb
Copy link
Contributor

alamb commented Feb 20, 2021

Thanks @Dandandan -- I am about to run out of time for today but I will plan to merge this in tomorrow if someone doesn't beat me to it

@alamb
Copy link
Contributor

alamb commented Feb 21, 2021

@Dandandan sadly this PR has merge conflicts (probably from #9509) -- can you possibly rebase it?

@alamb alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 21, 2021
@Dandandan
Copy link
Contributor Author

@alamb conflict solved 👍

@alamb
Copy link
Contributor

alamb commented Feb 21, 2021

Integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717 and is related to this PR

@alamb alamb closed this in 924449e Feb 21, 2021
@Dandandan Dandandan deleted the temporal_sql branch February 21, 2021 15:48
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This adds year support to the temporal module. Year support is something needed for some TCPH queries.
Together with `extract` support
apache/arrow#9359

we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion.

Other changes in the PR:
* Adding some more tests to `hour`
* Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.).
* Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable.

Closes #9374 from Dandandan/year

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This adds year support to the temporal module. Year support is something needed for some TCPH queries.
Together with `extract` support
apache#9359

we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion.

Other changes in the PR:
* Adding some more tests to `hour`
* Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.).
* Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable.

Closes apache#9374 from Dandandan/year

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR starts implementing support for the `EXTRACT` syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax:

`EXTRACT (HOUR FROM dt)`

See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference

This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc.

Closes apache#9359 from Dandandan/temporal_sql

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This adds year support to the temporal module. Year support is something needed for some TCPH queries.
Together with `extract` support
apache#9359

we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion.

Other changes in the PR:
* Adding some more tests to `hour`
* Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.).
* Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable.

Closes apache#9374 from Dandandan/year

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR starts implementing support for the `EXTRACT` syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax:

`EXTRACT (HOUR FROM dt)`

See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference

This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc.

Closes apache#9359 from Dandandan/temporal_sql

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 87.14286% with 9 lines in your changes missing coverage. Please review.

Project coverage is 82.29%. Comparing base (aebabca) to head (40e184b).

Files Patch % Lines
...tafusion/src/physical_plan/datetime_expressions.rs 66.66% 7 Missing ⚠️
rust/datafusion/src/logical_plan/expr.rs 50.00% 1 Missing ⚠️
rust/datafusion/src/sql/planner.rs 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9359      +/-   ##
==========================================
+ Coverage   82.27%   82.29%   +0.01%     
==========================================
  Files         244      244              
  Lines       55555    55616      +61     
==========================================
+ Hits        45708    45767      +59     
- Misses       9847     9849       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Rust - DataFusion Component: Rust needs-rebase A PR that needs to be rebased by the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants