TAJO-1415: Window frame support by sirpkt · Pull Request #454 · apache/tajo

sirpkt · 2015-03-23T08:39:09Z

It supports all ROWS and RANGE window frame.

Cases when window frame is not applied

no order by clause is used
some built-in window functions where window frame is not supported: row_number, rank, dense_rank, percent_rank, cume_dist, tile, lag, lead

Cases when window frame should be applied

other built-in window functions: first_value, last_value, nth_value
normal aggregation functions

Based on above information, this patch distinguishes window function types as following three:

built-in window function without window frame support
built-in window function with window frame support
normal aggregation functions used as a window function. In this case, window frame should be supported

And, it further distinguishes window frame types as following four:

entire partition
from the start of the partition to the moving end point relative to current row
from the moving start point relative to current row to the end of the partition
sliding frame as the current row position varies

Case 1 is the same as previous handling of window function.
Case 2 is handled as incremental termination of aggregation function, which means for every row call merge() and terminate() of the given function
Case 3 is handled almost the same as case 2 except feeding rows to the function from the end of the partition to the start of the frame, i.e., in reverse order
Case 4 is handled by two pass approach: making small loop of feeding rows to the function for each row value computation, I think, which is inevitable since aggregation function does not support sliding window aggregation.

All above are implemented for ROWS first,
and then expanded to support RANGE by including rows that has the same order by value with current row in computation of window function.

This patch includes following changes

parser can handle integer offset PRECEDING and FOLLOWING
ExprAnnotator can reflect window frame information on WindowFunctionEval including default value handling
WindowAggExec can handles ROWS and RANGE with window frame support
Parameter checking in parser and ExprAnnotator is included
last_value is re-implemented as WindowAggFunc. First_value implementation becomes more simple
Window related classes in tajo-plan has new prefix 'Logical' to distinguish themselves with the same name class in tajo-algebra
plan.proto is modified to support data structure to distinguish function types and frame types
add test cases for window frame

…e entire partition

sirpkt · 2015-03-23T08:40:09Z

I checked 'mvn clean install' passed in my laptop.

jihoonson · 2015-04-05T16:34:35Z

Thanks. I'll review today.

jihoonson · 2015-04-06T15:50:51Z

Sorry for late review. It seems the patch have gone stale.
Would you mind rebasing it?

Conflicts are resolved: tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/WindowAggExec.java tajo-plan/src/main/java/org/apache/tajo/plan/expr/WindowFunctionEval.java

…ation functions

sirpkt · 2015-04-07T07:35:23Z

Thank you for the review, @jihoonson
I just rebased and resolved some conflicts from recent patches.

jihoonson · 2015-04-07T07:50:00Z

Thanks. I'll finish my review soon.

jihoonson · 2015-04-12T14:40:32Z

tajo-algebra/src/main/java/org/apache/tajo/algebra/WindowSpec.java

We don't need to separately maintain the start bound type and end bound type. Please see the discussion at #13 (comment).

Hmm... I think the discussion that you linked is not complete one
because UNBOUNDED_PRECEDING is only available at start bound type while UNBOUNDED_FOLLOWING is only available as an end bound type.
I tested in psql as following

mydb=# select id, max(value) over (partition by id rows between unbounded following and unbounded following) from test3; ERROR: frame start cannot be UNBOUNDED FOLLOWING LINE 1: ...id, max(value) over (partition by id rows between unbounded ... ^

and saw the error as I expected.

So, I think it needs separation between start bound type and end bound type.

Right. I hope that you read a couple of comments in the thread of the link. As I commented there, there is only one rule that the end bound cannot precede the start bound. So, naturally, the start bound cannot be unbounded following.
I think that we can make some codes simpler with the unified bound type.

I don't think only one rule is enough because start bound and end bound can be the same.
For example, we can use rows between current row and current row as followings

mydb=# select id, max(value) over (partition by id rows between current row and current row) from test3; id | max ----+----- 1 | 1 1 | 2 2 | 5 2 | 6 2 | 7 (5 rows)

So, actually there are two rules.

end bound cannot precede the start bound, but can be the same as the start bound

UNBOUNDED PRECEDING cannot be an end bound and UNBOUNDED FOLLOWING cannot be a start bound

However, I totally agree with you that the code becomes much simpler if we use the unified bound type. So, I'll try to unify the bound type.

Oh, you are right. Thanks.

jihoonson · 2015-04-12T15:56:01Z

@sirpkt, thanks for great patch!
Thanks to your nice patch overview and plentiful comments, I was able to easily review.
In overall your patch looks good to me. I left some comments.
Thanks!

sirpkt · 2015-04-12T23:40:05Z

Thank you for the review, @jihoonson !
I'll reflect your comments soon.

- Refactor WindowAggExec code - Unify window frame start and end bound type - Remove unnecessary 'static' - Add parameter test in CurrentValue NOT CHANGED YET: - row_number() with DISTINCT still disabled - LEAD() still AggFunction not WindowAggFunc

sirpkt · 2015-04-16T01:33:55Z

I updated the patch and it passed 'mvn clean install' in my laptop.

Reflecting review comments

- Refactor WindowAggExec code
- Unify window frame start and end bound type
- Remove unnecessary 'static'
- Add parameter test in CurrentValue

NOT CHANGED YET:
- row_number() with DISTINCT still disabled
- LEAD() still AggFunction not WindowAggFunc

jihoonson · 2015-04-17T10:20:49Z

tajo-plan/src/main/java/org/apache/tajo/plan/ExprAnnotator.java

How about moving these comments to the code where the enum is defined?

It looks better!
I moved the comments about Function Type and Frame Type to the code where the enum is defined.

jihoonson · 2015-04-17T10:25:52Z

Thanks @sirpkt. In overall, the updated patch looks good to me.
I'm still reviewing this patch. I'll finish soon.

jihoonson · 2015-04-17T11:04:53Z

tajo-algebra/src/main/java/org/apache/tajo/algebra/WindowSpec.java

WindowStartBound and WindowEndBound classes look to contain same information.
Do we need to keep them separately?

You're right, @jihoonson.
We don't need to keep them separately.
I'll merge them into WindowBound.

jihoonson · 2015-04-18T10:57:17Z

@sirpkt thanks for your nice work. Since this is a new feature which will be very useful and popularly used, I've carefully reviewed. Please think about my comments. Thanks.

- remove unused import, variables and unnecessary 'static' keywords - refactor WindowAggExec with finer functions and more intuitive names - unify bound types for start and end bound - refactor testWindowQuery to follow Tajo test code convention

sirpkt · 2015-04-27T09:03:31Z

I reflected considerate and detailed comments from @jihoonson.

jihoonson · 2015-04-27T11:06:37Z

tajo-catalog/tajo-catalog-drivers/pom.xml

This looks to be already removed in https://issues.apache.org/jira/browse/TAJO-1442.

You're right, @jihoonson.
I'll remove the line.

jihoonson · 2015-04-27T11:55:34Z

@sirpkt thanks for updating your patch.
I've tested the following query and found that the result of Tajo is different from that of pgsql.

Tajo

default> select l_partkey, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
...
l_partkey,  ?windowfunction
-------------------------------
1,  
1,  
1,  
1,  
1,  
1,  F
1,  O
1,  O
1,  F
1,  F
1,  O
1,  F
1,  F
1,  F
1,  F
1,  O
1,  O
1,  F
1,  F
1,  F
1,  O
1,  O
1,  F
1,  F
1,  O
1,  F
1,  F
1,  F
1,  O
1,  F
1,  O
2,  
2,  
2,  
2,  
2,  
2,  O
...

PostgreSQL

jihoonson=# select l_partkey, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
 l_partkey | first_value 
-----------+-------------
         1 | O
         1 | O
         1 | O
         1 | O
         1 | O
         1 | O
         1 | F
         1 | O
         1 | F
         1 | F
         1 | F
         1 | F
         1 | O
         1 | F
         1 | F
         1 | O
         1 | O
         1 | F
         1 | F
         1 | F
         1 | F
         1 | F
         1 | O
         1 | O
         1 | O
         1 | F
         1 | O
         1 | F
         1 | F
         1 | F
         1 | O
         2 | O
...

As you can see, some values are null in Tajo.

sirpkt · 2015-04-28T10:57:40Z

Thank you for the finding, @jihoonson.
Difference between result of the patch and the result of postgresql is from the bug.
I'll fix the bug and add more test codes about window frame.

- bug fix for the handling of built-in window functions with frame support - add more test codes for the preceding and following with constants

sirpkt · 2015-04-28T11:20:49Z

I rebased and updated the patch.

- bug fix for the handling of built-in window functions with frame support
- add more test codes for the preceding and following with constant cases

jihoonson · 2015-04-29T14:40:47Z

Thanks @sirpkt, but query results are still different.
Would you check it again?

Tajo

default> select l_partkey, l_tax, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
l_partkey,  l_tax,  ?windowfunction
-------------------------------
1,  0.0,  F
1,  0.0,  F
1,  0.01,  F
1,  0.01,  F
1,  0.01,  F
1,  0.02,  F
1,  0.02,  O
1,  0.02,  F
1,  0.03,  O
1,  0.03,  F
1,  0.04,  F
1,  0.04,  O
1,  0.05,  F
1,  0.05,  F
1,  0.06,  F
1,  0.06,  O
1,  0.06,  O
1,  0.06,  F
1,  0.06,  F
1,  0.07,  F
1,  0.07,  O
1,  0.07,  F
1,  0.07,  F
1,  0.07,  O
1,  0.07,  O
1,  0.08,  F
1,  0.08,  O
1,  0.08,  F
1,  0.08,  F
1,  0.08,  F
1,  0.08,  F
2,  0.0,  O
2,  0.0,  O
2,  0.0,  O
2,  0.01,  O
2,  0.01,  O
...

PostgreSQL

jihoonson=# select l_partkey, l_tax, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
 l_partkey | l_tax | first_value 
-----------+-------+-------------
         1 |     0 | O
         1 |     0 | O
         1 |  0.01 | O
         1 |  0.01 | O
         1 |  0.01 | O
         1 |  0.02 | O
         1 |  0.02 | F
         1 |  0.02 | O
         1 |  0.03 | F
         1 |  0.03 | F
         1 |  0.04 | F
         1 |  0.04 | F
         1 |  0.05 | O
         1 |  0.05 | F
         1 |  0.06 | F
         1 |  0.06 | O
         1 |  0.06 | O
         1 |  0.06 | F
         1 |  0.06 | F
         1 |  0.07 | F
         1 |  0.07 | F
         1 |  0.07 | F
         1 |  0.07 | O
         1 |  0.07 | O
         1 |  0.07 | O
         1 |  0.08 | F
         1 |  0.08 | O
         1 |  0.08 | F
         1 |  0.08 | F
         1 |  0.08 | F
         1 |  0.08 | O
         2 |     0 | O
         2 |     0 | O
         2 |     0 | O
         2 |  0.01 | O
         2 |  0.01 | O
         2 |  0.01 | O
         2 |  0.02 | O
         2 |  0.02 | O
         2 |  0.02 | O
...

sirpkt added 6 commits March 22, 2015 22:38

window frame ROWS support is added, RANGE is not supported yet

3e2cfab

Merge remote-tracking branch 'upstream/master' into TAJO-1415

ddd6797

bug fix during master merge

ddc7c1b

support for RANGE window frame

aa97dbb

Merge remote-tracking branch 'upstream/master' into TAJO-1415

7b21415

Fix bug for no order by case, where window function SHOULD work on th…

973d99f

…e entire partition

sirpkt added 4 commits March 24, 2015 17:45

Merge branch 'master' into TAJO-1415

53edd69

fix typo

883ec99

Merge remote-tracking branch 'upstream/master' into TAJO-1415

ff21b1e

remove comments

7895593

sirpkt added 2 commits April 7, 2015 15:41

Merge remote-tracking branch 'upstream/master' into TAJO-1415

5ece3a4

Conflicts are resolved: tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/WindowAggExec.java tajo-plan/src/main/java/org/apache/tajo/plan/expr/WindowFunctionEval.java

fix query result according to default window frame setting for aggreg…

35e9870

…ation functions

jihoonson reviewed Apr 12, 2015
View reviewed changes

sirpkt added 2 commits April 13, 2015 09:18

Merge remote-tracking branch 'upstream/master' into TAJO-1415

ac0cc75

Reflecting review comments

76caa39

- Refactor WindowAggExec code - Unify window frame start and end bound type - Remove unnecessary 'static' - Add parameter test in CurrentValue NOT CHANGED YET: - row_number() with DISTINCT still disabled - LEAD() still AggFunction not WindowAggFunc

sirpkt added 2 commits April 13, 2015 19:56

Merge remote-tracking branch 'upstream/master' into TAJO-1415

3231f9a

Merge remote-tracking branch 'upstream/master' into TAJO-1415

72a4519

jihoonson reviewed Apr 17, 2015
View reviewed changes

sirpkt added 3 commits April 20, 2015 10:39

Merge branch 'master' into TAJO-1415

7f45445

Merge remote-tracking branch 'upstream/master' into TAJO-1415

6dd02ac

rebase and reflect comments from review

114f4c5

- remove unused import, variables and unnecessary 'static' keywords - refactor WindowAggExec with finer functions and more intuitive names - unify bound types for start and end bound - refactor testWindowQuery to follow Tajo test code convention

jihoonson reviewed Apr 27, 2015
View reviewed changes

Merge remote-tracking branch 'upstream/master' into TAJO-1415

a2d4def

reflect recent comments on the patch

4d5916e

- bug fix for the handling of built-in window functions with frame support - add more test codes for the preceding and following with constants

Merge remote-tracking branch 'upstream/master' into TAJO-1415

c634c4b

Conversation

sirpkt commented Mar 23, 2015

Uh oh!

sirpkt commented Mar 23, 2015

Uh oh!

jihoonson commented Apr 5, 2015

Uh oh!

jihoonson commented Apr 6, 2015

Uh oh!

sirpkt commented Apr 7, 2015

Uh oh!

jihoonson commented Apr 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Apr 12, 2015

Uh oh!

sirpkt commented Apr 12, 2015

Uh oh!

sirpkt commented Apr 16, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Apr 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Apr 18, 2015

Uh oh!

sirpkt commented Apr 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Apr 27, 2015

Uh oh!

sirpkt commented Apr 28, 2015

Uh oh!

sirpkt commented Apr 28, 2015

Uh oh!

jihoonson commented Apr 29, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants