Skip to content

Conversation

@mustafasrepo
Copy link
Contributor

Which issue does this PR close?

Closes #4076.

Rationale for this change

With this change, we can now run built-in window functions with custom window frames such as queries in the form

SELECT
     FIRST_VALUE(c4) OVER(ORDER BY c9) as first_value1,
     FIRST_VALUE(c4) OVER(ORDER BY c9) as first_value2,
     LAST_VALUE(c4) OVER(ORDER BY c9) as last_value1,
     LAST_VALUE(c4) OVER(ORDER BY c9) as last_value2
     FROM aggregate_test_100
     ORDER BY c9
     LIMIT 5

What changes are included in this PR?

  • We added range calculation support for Builtin-window functions FIRST_VALUE, LAST_VALUE, NTH_VALUE (For other window functions what is inside window frame isn’t important https://www.postgresql.org/docs/current/functions-window.html). To be able to use range calculation code for both window functions and aggregate functions, range calculation code is moved from aggregate.rs to window_expr.rs.
  • Added tests for Built-in Window Functions

Are there any user-facing changes?

N.A

@github-actions github-actions bot added core Core DataFusion crate physical-expr Changes to the physical-expr crates labels Nov 2, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mustafasrepo -- I don't have time to review this one today but I will put it on my queue for tomorrow

cc @jimexist

@alamb
Copy link
Contributor

alamb commented Nov 3, 2022

I again ran out of time today but it is on my list for first thing tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @mustafasrepo and @ozankabak

I went through this code quite carefully and it is really nice and easy to follow 🏅

In case it is not obvious to others reading this PR, this PR adds support for the ROWS BETWEEN 1 PRECEDING and 2 FOLLOWING window clauses.

I tried it out using some simple data and postgres:

postgres=# create table foo as values (1,2), (3,4), (3,2), (2,1), (null, 0);
SELECT 5
postgres=# select first_value(column1) over (order by column2), last_value(column1) over (order by column2) from foo;
 first_value | last_value 
-------------+------------
             |           
             |          2
             |          3
             |          3
             |          3
(5 rows)

Before this change:

select first_value(column1) over (order by column2 ROWS BETWEEN 1 PRECEDING and 2 FOLLOWING) from foo;
+--------------------------+
| FIRST_VALUE(foo.column1) |
+--------------------------+
|                          |
|                          |
|                          |
|                          |
|                          |
+--------------------------+
5 rows in set. Query took 0.003 seconds.

After this PR it gets the same answer as postgres 👍

select first_value(column1) over (order by column2 ROWS BETWEEN 1 PRECEDING and 2 FOLLOWING) from foo;
+--------------------------+
| FIRST_VALUE(foo.column1) |
+--------------------------+
|                          |
|                          |
| 2                        |
| 1                        |
| 3                        |
+--------------------------+

self.window_frame.clone()
};
let mut row_wise_results: Vec<ScalarValue> = vec![];
for partition_range in &partition_points {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reorganization is very nice and makes the code much easier to read . Very nice 👍


/// We use start and end bounds to calculate current row's starting and ending range.
/// This function supports different modes, but we currently do not support window calculation for GROUPS inside window frames.
fn calculate_range(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this logic to be very straightforward and easy to follow 👍

Ok((start, end))
}
WindowFrameUnits::Groups => Err(DataFusionError::NotImplemented(
"Window frame for groups is not implemented".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Nov 4, 2022

I'll plan to merge this PR tomorrow unless there are other comments raised.

@alamb alamb merged commit 238e179 into apache:master Nov 5, 2022
@alamb
Copy link
Contributor

alamb commented Nov 5, 2022

Again, really nice work @mustafasrepo and @ozankabak -- thank you very much

@ursabot
Copy link

ursabot commented Nov 5, 2022

Benchmark runs are scheduled for baseline = 6d00bd9 and contender = 238e179. 238e179 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dandandan pushed a commit to yuuch/arrow-datafusion that referenced this pull request Nov 5, 2022
…ache#4078)

* refactor running window

* remove unnecessary changes

* implement suggested changes

* Minor refactors to improve readability

* Refactor according to reviews

* minor changes

* Remove unnecessary into/collect calls

* convert evaluate_inside_range result to ScalarValue

* Simplify evaluate function of BuiltInWindowExpr

Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Nov 5, 2022
…ache#4078)

* refactor running window

* remove unnecessary changes

* implement suggested changes

* Minor refactors to improve readability

* Refactor according to reviews

* minor changes

* Remove unnecessary into/collect calls

* convert evaluate_inside_range result to ScalarValue

* Simplify evaluate function of BuiltInWindowExpr

Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong result for FIRST_VALUE AND LAST_VALUE window functions

4 participants