ARROW-11426: [Rust][DataFusion] EXTRACT support #9359

Dandandan · 2021-01-29T10:47:04Z

This PR starts implementing support for the EXTRACT syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax:

EXTRACT (HOUR FROM dt)

See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference

This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc.

github-actions · 2021-01-29T10:47:30Z

https://issues.apache.org/jira/browse/ARROW-11426

codecov-io · 2021-01-29T14:47:13Z

Codecov Report

Merging #9359 (40e184b) into master (aebabca) will increase coverage by 0.01%.
The diff coverage is 87.14%.

@@            Coverage Diff             @@
##           master    #9359      +/-   ##
==========================================
+ Coverage   82.27%   82.29%   +0.01%     
==========================================
  Files         244      244              
  Lines       55555    55616      +61     
==========================================
+ Hits        45708    45767      +59     
- Misses       9847     9849       +2

Impacted Files	Coverage Δ
...st/datafusion/src/physical_plan/expressions/mod.rs	`71.42% <ø> (ø)`
rust/datafusion/src/logical_plan/expr.rs	`81.13% <50.00%> (ø)`
...tafusion/src/physical_plan/datetime_expressions.rs	`68.83% <66.66%> (-0.35%)`	⬇️
rust/datafusion/src/sql/planner.rs	`83.22% <80.00%> (-0.02%)`	⬇️
rust/datafusion/src/physical_plan/functions.rs	`73.82% <100.00%> (+1.46%)`	⬆️
rust/datafusion/src/physical_plan/type_coercion.rs	`98.62% <100.00%> (+0.09%)`	⬆️
rust/datafusion/tests/sql.rs	`99.92% <100.00%> (+<0.01%)`	⬆️
rust/datafusion/src/scalar.rs	`53.27% <0.00%> (+1.63%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aebabca...40e184b. Read the comment docs.

Dandandan · 2021-01-29T16:43:56Z

rust/datafusion/src/logical_plan/expr.rs

@@ -169,6 +170,13 @@ pub enum Expr {
    },
    /// Represents a reference to all fields in a schema.
    Wildcard,
+    /// Extract date parts (day, hour, minute) from a date / time expression
+    Extract {


Alternatively, this could use ScalarFunction e.g. using "date_part" (which is also supported by PostgreSQL) to avoid an extra enum option, and convert the extract sql syntax to this ScalarFunction. I am not sure which is better?

I have been trying to move them to the ScalarFunction to avoid having many variants around, mostly because adding a new item to an enum is backward incompatible and thus it may be beneficial to reserve that for operations that cannot be described by ScalarFunction.

I just looked into this. The downside currently is w.r.t. performance and being able to utilize Arrow kernels. The ScalarFunction implementation repeats scalar values, the date part, e.g. 'HOUR' for date_part('HOUR', dt) will be for repeated for each row.
In PostgreSQL, expressions are not allowed for date_part / extract, date_trunc etc. :

https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

Note that here the field parameter needs to be a string value, not a name. The valid field names for date_part are the same as for extract.

This is also happening with the existing date_trunc function where currently the date part (as string) is repeated / matched against for each row and also evaluated per row (see below). That won't work with the hour kernel for obvious reasons.

let result = range .map(|i| { if array.is_null(i) { Ok(0_i64) } else { let date_time = match granularity_array.value(i) { "second" => array .value_as_datetime(i) .and_then(|d| d.with_nanosecond(0)), "minute" => array .value_as_datetime(i) .and_then(|d| d.with_nanosecond(0)) .and_then(|d| d.with_second(0)), [...]

So I think here we have a few options:

Refactor/Optimize ScalarFunction to also allow for scalar values, be able to check on them + support literals (I guess it should use ColumnarValue instead of just Arrays).

Have a similar (inefficient) implementation for extract from / date_part to compute as currently date_trunc. I think that

Refactor/Optimize ScalarFunction later and keep the Extract as is for now.

hold my beer: #9376 :)

🍺 awesome! 😎

I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!

In terms of adding variants to Expr I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE).

So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate EXTRACT into a function.

Yes, I think it is worth it to wait for the other PR to land and convert it to use the scalar function. Then we could also add the function alias date_part easily next to the extract syntax, which are both supported by PostgreSQL.

Dandandan · 2021-01-29T16:46:38Z

This is ready for review now. @alamb @nevi-me tagging you, because I think you would be interested in more temporal support.

jorgecarleitao

I went through this and it looks 💯 !

I am unsure about whether adding a new entry to the enum is ideal, as those have major impact to everyone that uses Expr. @andygrove and @alamb ?

rust/datafusion/src/logical_plan/expr.rs

alamb

Thanks @Dandandan -- this is looking really nice. I think some additional coverage for the non supported date_parts would be good.

alamb · 2021-01-31T11:22:26Z

rust/datafusion/src/logical_plan/expr.rs

@@ -169,6 +170,13 @@ pub enum Expr {
    },
    /// Represents a reference to all fields in a schema.
    Wildcard,
+    /// Extract date parts (day, hour, minute) from a date / time expression
+    Extract {


I also agree (belatedly) that having a more efficient implementation of constant function arguments (e.g. #9376) is the way to go!

In terms of adding variants to Expr I think it will need be done when the semantics of whatever expression is being added can't realistically be expressed as a function (e.g. CASE).

So in this case, given @jorgecarleitao is cranking along with #9376 it seems like perhaps this PR should perhaps try and translate EXTRACT into a function.

alamb · 2021-01-31T11:29:22Z

rust/datafusion/src/physical_plan/expressions/extract.rs

+        match data_type {
+            DataType::Date32 => {
+                let array = array.as_any().downcast_ref::<Date32Array>().unwrap();
+                Ok(ColumnarValue::Array(Arc::new(hour(array)?)))


I find it confusing that date_part is passed all the way down in the Exprs / trees only to be ignored in the actual implementation which directly calls hour. I can see that the expr seems to always be made with DatePart::Hour but I am not 100% sure.

I am fine with not supporting all the various date parts initially, but I would recommend the following as a way of documenting through code / safe guard against future bugs:

Add a test for EXTRACT DAY from timestamp and show that it generates a useful error

Add a check in this function for date_part != DataPart::Hour and throw an error

There is no other datepart for now in the enum, so I guess that might generate some clippy warnings. But we could match directly on the datepart. The errorshould be generated elsewhere already (when building the logical plan), agree makes sense to add a test for that 👍.

alamb · 2021-01-31T11:29:40Z

rust/datafusion/tests/sql.rs

+    let mut ctx = ExecutionContext::new();
+    let sql = "SELECT
+        EXTRACT(HOUR FROM CAST('2020-01-01' AS DATE)),
+        EXTRACT(HOUR FROM to_timestamp('2020-09-08T12:00:00+00:00'))


I think adding coverage for the other date parts would be valuable here (even if they error)

This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support #9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes #9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes apache#9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan · 2021-02-17T20:06:22Z

Most of it converted to use scalar functions now. There is some missing support to support multiple types for one argument but not for the other in scalar functions, will have a look at that later.

Dandandan · 2021-02-20T10:45:19Z

@jorgecarleitao @alamb

This is now ready for review

alamb

I think it is looking great. Thanks @Dandandan -- @jorgecarleitao do you want to take another look or shall we merge this?

alamb · 2021-02-20T12:43:10Z

rust/datafusion/src/physical_plan/datetime_expressions.rs

+
+    let is_scalar = matches!(array, ColumnarValue::Scalar(_));
+
+    let array = match array {


I assume the longer term plan will be to handle the Scalar case more efficiently. This (converting to an array) is fine for now I think

Yes, indeed. For now we can use this approach to avoid reimplementing hours/years etc, with a bit of overhead.
Maybe longer term would be nice to have something like Datum in Arrow in order to both gain some performance and avoid reimplementing things for the scalar case.

alamb · 2021-02-20T12:43:49Z

rust/datafusion/src/physical_plan/functions.rs

@@ -71,6 +71,8 @@ pub enum Signature {
    Exact(Vec<DataType>),
    /// fixed number of arguments of arbitrary types
    Any(usize),
+    /// One of a list of signatures
+    OneOf(Vec<Signature>),


FYI @seddonm1 I am not sure how this affects your string functions / other postgres function plans

Yes, I missed this but all good. This is actually better :D

alamb · 2021-02-20T12:45:37Z

rust/datafusion/src/physical_plan/type_coercion.rs

@@ -68,6 +68,29 @@ pub fn data_types(
    current_types: &[DataType],
    signature: &Signature,
 ) -> Result<Vec<DataType>> {
+    let valid_types = get_valid_types(signature, current_types)?;
+
+    if valid_types.contains(&current_types.to_owned()) {


Why can't this be &current_types (aka why does it need a call to to_owned just to immediately borrow from it?)

Will have a look... I think it was auto generated by the new "extract function" functionality in rust-analyzer (which doesn't work 100% reliably, but still is very useful).

seems &current_types isn't possible with contains, made it use any instead.

alamb · 2021-02-20T12:46:49Z

rust/datafusion/tests/sql.rs

+#[tokio::test]
+async fn extract_date_part() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    let sql = "SELECT


alamb · 2021-02-20T13:53:46Z

Integration test failure looks like https://issues.apache.org/jira/browse/ARROW-11717. I am retriggering it on this PR

alamb · 2021-02-20T13:59:48Z

Thanks @Dandandan -- I am about to run out of time for today but I will plan to merge this in tomorrow if someone doesn't beat me to it

alamb · 2021-02-21T10:37:44Z

@Dandandan sadly this PR has merge conflicts (probably from #9509) -- can you possibly rebase it?

Dandandan · 2021-02-21T11:10:54Z

@alamb conflict solved 👍

alamb · 2021-02-21T12:05:20Z

Integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717 and is related to this PR

This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache/arrow#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes #9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes apache#9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR starts implementing support for the `EXTRACT` syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax: `EXTRACT (HOUR FROM dt)` See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc. Closes apache#9359 from Dandandan/temporal_sql Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This adds year support to the temporal module. Year support is something needed for some TCPH queries. Together with `extract` support apache#9359 we should be able to add `EXTRACT (YEAR FROM dt)` support to DataFusion. Other changes in the PR: * Adding some more tests to `hour` * Removing datatype check from inner loop (there is still one more check in `value_as_datetime` and `value_as_time`) but I leave that for a future PR, as well as further performance improvements (e.g. avoiding the `Int32Builder`, avoiding null checks, adding some microbenchmarks etc.). * Returning an error message on unsupported datatypes (instead of returning an array with nulls). This is backwards incompatible, but I think this is reasonable. Closes apache#9374 from Dandandan/year Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

This PR starts implementing support for the `EXTRACT` syntax / execution, to retrieve date parts (hours, minutes, days, etc.) from temporal data types, with the following syntax: `EXTRACT (HOUR FROM dt)` See https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT for reference This is just a first implementation, in following PRs we can extend the support to different date parts, time zones, etc. Closes apache#9359 from Dandandan/temporal_sql Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

codecov-commenter · 2024-08-17T18:45:06Z

Codecov Report

Attention: Patch coverage is 87.14286% with 9 lines in your changes missing coverage. Please review.

Project coverage is 82.29%. Comparing base (aebabca) to head (40e184b).

Files	Patch %	Lines
...tafusion/src/physical_plan/datetime_expressions.rs	66.66%	7 Missing ⚠️
rust/datafusion/src/logical_plan/expr.rs	50.00%	1 Missing ⚠️
rust/datafusion/src/sql/planner.rs	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9359      +/-   ##
==========================================
+ Coverage   82.27%   82.29%   +0.01%     
==========================================
  Files         244      244              
  Lines       55555    55616      +61     
==========================================
+ Hits        45708    45767      +59     
- Misses       9847     9849       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Dandandan added 2 commits January 29, 2021 11:41

Start of EXTRACT support for DataFusion

ae0d3a3

fmt

b91632d

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 29, 2021

Add temporal module

304008b

Dandandan changed the title ~~ARROW-11426: [Rust][DataFusion] Start of EXTRACT support for DataFusion [WIP]~~ ARROW-11426: [Rust][DataFusion] EXTRACT support for DataFusion [WIP] Jan 29, 2021

Dandandan added 2 commits January 29, 2021 12:57

Test, boilerplate

3eb552b

Add todo!()

8284f1c

Dandandan changed the title ~~ARROW-11426: [Rust][DataFusion] EXTRACT support for DataFusion [WIP]~~ ARROW-11426: [Rust][DataFusion] EXTRACT support [WIP] Jan 29, 2021

Extract implementation

dbbb647

Add support for timestamps without time zone

a471f57

Dandandan changed the title ~~ARROW-11426: [Rust][DataFusion] EXTRACT support [WIP]~~ ARROW-11426: [Rust][DataFusion] EXTRACT support Jan 29, 2021

Dandandan added 5 commits January 29, 2021 16:48

Small test changes

5de73f6

Clippy

7ce35ab

Remove remaining todo!()

1e96898

Improve naming

bda3cab

Undo whitespace changes

0cad90f

Dandandan commented Jan 29, 2021

View reviewed changes

jorgecarleitao approved these changes Jan 30, 2021

View reviewed changes

rust/datafusion/src/logical_plan/expr.rs Outdated Show resolved Hide resolved

Dandandan added 3 commits January 30, 2021 11:20

Merge changes

0254e45

Comment fixes

ac62940

Support all time units

dc67999

Dandandan mentioned this pull request Jan 30, 2021

ARROW-11439: [Rust] Add year support to temporal kernels #9374

Closed

alamb reviewed Jan 31, 2021

View reviewed changes

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

Dandandan added 4 commits February 17, 2021 20:48

WIP date_part 2

4c0dac6

WIP

6311917

WIP

9b97b47

Test fix

a52e074

Dandandan added 3 commits February 20, 2021 10:42

Merge remote-tracking branch 'upstream/master' into temporal_sql

31f8d28

Add support for more complex function types

af8792f

Support both year and hour based on argument

e538726

Dandandan added 2 commits February 20, 2021 11:56

Fmt

ca771de

Return scalar values

a23e4c9

alamb approved these changes Feb 20, 2021

View reviewed changes

Avoid copy

aa149e7

alamb added the needs-rebase A PR that needs to be rebased by the author label Feb 21, 2021

Dandandan added 2 commits February 21, 2021 12:06

Merge remote-tracking branch 'upstream/master' into temporal_sql

85bb7b0

Fix conflict

40e184b

alamb closed this in 924449e Feb 21, 2021

Dandandan deleted the temporal_sql branch February 21, 2021 15:48

asfimport mentioned this pull request Feb 22, 2021

[Rust][DataFusion] EXTRACT support #27313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11426: [Rust][DataFusion] EXTRACT support #9359

ARROW-11426: [Rust][DataFusion] EXTRACT support #9359

Dandandan commented Jan 29, 2021 •

edited

Loading

github-actions bot commented Jan 29, 2021

codecov-io commented Jan 29, 2021 •

edited

Loading

Dandandan Jan 29, 2021

jorgecarleitao Jan 30, 2021

Dandandan Jan 30, 2021 •

edited

Loading

jorgecarleitao Jan 30, 2021

Dandandan Jan 30, 2021

alamb Jan 31, 2021

Dandandan Jan 31, 2021

Dandandan commented Jan 29, 2021

jorgecarleitao left a comment

alamb left a comment

alamb Jan 31, 2021

alamb Jan 31, 2021 •

edited

Loading

Dandandan Jan 31, 2021 •

edited

Loading

alamb Jan 31, 2021

Dandandan commented Feb 17, 2021

Dandandan commented Feb 20, 2021

alamb left a comment

alamb Feb 20, 2021

Dandandan Feb 20, 2021

alamb Feb 20, 2021

seddonm1 Feb 21, 2021

alamb Feb 20, 2021

Dandandan Feb 20, 2021

Dandandan Feb 20, 2021

alamb Feb 20, 2021

alamb commented Feb 20, 2021

alamb commented Feb 20, 2021

alamb commented Feb 21, 2021

Dandandan commented Feb 21, 2021

alamb commented Feb 21, 2021 •

edited

Loading

codecov-commenter commented Aug 17, 2024


		let is_scalar = matches!(array, ColumnarValue::Scalar(_));

		let array = match array {

ARROW-11426: [Rust][DataFusion] EXTRACT support #9359

ARROW-11426: [Rust][DataFusion] EXTRACT support #9359

Conversation

Dandandan commented Jan 29, 2021 • edited Loading

github-actions bot commented Jan 29, 2021

codecov-io commented Jan 29, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Jan 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Jan 29, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jan 31, 2021 • edited Loading

Choose a reason for hiding this comment

Dandandan Jan 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Feb 17, 2021

Dandandan commented Feb 20, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Feb 20, 2021

alamb commented Feb 20, 2021

alamb commented Feb 21, 2021

Dandandan commented Feb 21, 2021

alamb commented Feb 21, 2021 • edited Loading

codecov-commenter commented Aug 17, 2024

Codecov Report

Dandandan commented Jan 29, 2021 •

edited

Loading

codecov-io commented Jan 29, 2021 •

edited

Loading

Dandandan Jan 30, 2021 •

edited

Loading

alamb Jan 31, 2021 •

edited

Loading

Dandandan Jan 31, 2021 •

edited

Loading

alamb commented Feb 21, 2021 •

edited

Loading