Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

TAJO-1415: Window frame support#454

Open
sirpkt wants to merge 22 commits intoapache:masterfrom
sirpkt:TAJO-1415
Open

TAJO-1415: Window frame support#454
sirpkt wants to merge 22 commits intoapache:masterfrom
sirpkt:TAJO-1415

Conversation

@sirpkt
Copy link
Contributor

@sirpkt sirpkt commented Mar 23, 2015

It supports all ROWS and RANGE window frame.

Cases when window frame is not applied

  • no order by clause is used
  • some built-in window functions where window frame is not supported: row_number, rank, dense_rank, percent_rank, cume_dist, tile, lag, lead

Cases when window frame should be applied

  • other built-in window functions: first_value, last_value, nth_value
  • normal aggregation functions

Based on above information, this patch distinguishes window function types as following three:

  1. built-in window function without window frame support
  2. built-in window function with window frame support
  3. normal aggregation functions used as a window function. In this case, window frame should be supported

And, it further distinguishes window frame types as following four:

  1. entire partition
  2. from the start of the partition to the moving end point relative to current row
  3. from the moving start point relative to current row to the end of the partition
  4. sliding frame as the current row position varies

Case 1 is the same as previous handling of window function.
Case 2 is handled as incremental termination of aggregation function, which means for every row call merge() and terminate() of the given function
Case 3 is handled almost the same as case 2 except feeding rows to the function from the end of the partition to the start of the frame, i.e., in reverse order
Case 4 is handled by two pass approach: making small loop of feeding rows to the function for each row value computation, I think, which is inevitable since aggregation function does not support sliding window aggregation.

All above are implemented for ROWS first,
and then expanded to support RANGE by including rows that has the same order by value with current row in computation of window function.

This patch includes following changes

  • parser can handle integer offset PRECEDING and FOLLOWING
  • ExprAnnotator can reflect window frame information on WindowFunctionEval including default value handling
  • WindowAggExec can handles ROWS and RANGE with window frame support
  • Parameter checking in parser and ExprAnnotator is included
  • last_value is re-implemented as WindowAggFunc. First_value implementation becomes more simple
  • Window related classes in tajo-plan has new prefix 'Logical' to distinguish themselves with the same name class in tajo-algebra
  • plan.proto is modified to support data structure to distinguish function types and frame types
  • add test cases for window frame

@sirpkt
Copy link
Contributor Author

sirpkt commented Mar 23, 2015

I checked 'mvn clean install' passed in my laptop.

@jihoonson
Copy link
Contributor

Thanks. I'll review today.

@jihoonson
Copy link
Contributor

Sorry for late review. It seems the patch have gone stale.
Would you mind rebasing it?

sirpkt added 2 commits April 7, 2015 15:41
Conflicts are resolved:
	tajo-core/src/main/java/org/apache/tajo/engine/planner/physical/WindowAggExec.java
	tajo-plan/src/main/java/org/apache/tajo/plan/expr/WindowFunctionEval.java
@sirpkt
Copy link
Contributor Author

sirpkt commented Apr 7, 2015

Thank you for the review, @jihoonson
I just rebased and resolved some conflicts from recent patches.

@jihoonson
Copy link
Contributor

Thanks. I'll finish my review soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to separately maintain the start bound type and end bound type. Please see the discussion at #13 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think the discussion that you linked is not complete one
because UNBOUNDED_PRECEDING is only available at start bound type while UNBOUNDED_FOLLOWING is only available as an end bound type.
I tested in psql as following

mydb=# select id, max(value) over (partition by id rows between unbounded following and unbounded following) from test3;
ERROR:  frame start cannot be UNBOUNDED FOLLOWING
LINE 1: ...id, max(value) over (partition by id rows between unbounded ...
                                                             ^

and saw the error as I expected.

So, I think it needs separation between start bound type and end bound type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I hope that you read a couple of comments in the thread of the link. As I commented there, there is only one rule that the end bound cannot precede the start bound. So, naturally, the start bound cannot be unbounded following.
I think that we can make some codes simpler with the unified bound type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think only one rule is enough because start bound and end bound can be the same.
For example, we can use rows between current row and current row as followings

mydb=# select id, max(value) over (partition by id rows between current row and current row) from test3;
 id | max 
----+-----
  1 |   1
  1 |   2
  2 |   5
  2 |   6
  2 |   7
(5 rows)

So, actually there are two rules.

  1. end bound cannot precede the start bound, but can be the same as the start bound
  2. UNBOUNDED PRECEDING cannot be an end bound and UNBOUNDED FOLLOWING cannot be a start bound

However, I totally agree with you that the code becomes much simpler if we use the unified bound type. So, I'll try to unify the bound type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right. Thanks.

@jihoonson
Copy link
Contributor

@sirpkt, thanks for great patch!
Thanks to your nice patch overview and plentiful comments, I was able to easily review.
In overall your patch looks good to me. I left some comments.
Thanks!

@sirpkt
Copy link
Contributor Author

sirpkt commented Apr 12, 2015

Thank you for the review, @jihoonson !
I'll reflect your comments soon.

sirpkt added 2 commits April 13, 2015 09:18
- Refactor WindowAggExec code
- Unify window frame start and end bound type
- Remove unnecessary 'static'
- Add parameter test in CurrentValue

NOT CHANGED YET:
- row_number() with DISTINCT still disabled
- LEAD() still AggFunction not WindowAggFunc
@sirpkt
Copy link
Contributor Author

sirpkt commented Apr 16, 2015

I updated the patch and it passed 'mvn clean install' in my laptop.

Reflecting review comments

- Refactor WindowAggExec code
- Unify window frame start and end bound type
- Remove unnecessary 'static'
- Add parameter test in CurrentValue

NOT CHANGED YET:
- row_number() with DISTINCT still disabled
- LEAD() still AggFunction not WindowAggFunc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving these comments to the code where the enum is defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better!
I moved the comments about Function Type and Frame Type to the code where the enum is defined.

@jihoonson
Copy link
Contributor

Thanks @sirpkt. In overall, the updated patch looks good to me.
I'm still reviewing this patch. I'll finish soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WindowStartBound and WindowEndBound classes look to contain same information.
Do we need to keep them separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, @jihoonson.
We don't need to keep them separately.
I'll merge them into WindowBound.

@jihoonson
Copy link
Contributor

@sirpkt thanks for your nice work. Since this is a new feature which will be very useful and popularly used, I've carefully reviewed. Please think about my comments. Thanks.

sirpkt added 3 commits April 20, 2015 10:39
- remove unused import, variables and unnecessary 'static' keywords
- refactor WindowAggExec with finer functions and more intuitive names
- unify bound types for start and end bound
- refactor testWindowQuery to follow Tajo test code convention
@sirpkt
Copy link
Contributor Author

sirpkt commented Apr 27, 2015

I reflected considerate and detailed comments from @jihoonson.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be already removed in https://issues.apache.org/jira/browse/TAJO-1442.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, @jihoonson.
I'll remove the line.

@jihoonson
Copy link
Contributor

@sirpkt thanks for updating your patch.
I've tested the following query and found that the result of Tajo is different from that of pgsql.

Tajo

default> select l_partkey, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
...
l_partkey,  ?windowfunction
-------------------------------
1,  
1,  
1,  
1,  
1,  
1,  F
1,  O
1,  O
1,  F
1,  F
1,  O
1,  F
1,  F
1,  F
1,  F
1,  O
1,  O
1,  F
1,  F
1,  F
1,  O
1,  O
1,  F
1,  F
1,  O
1,  F
1,  F
1,  F
1,  O
1,  F
1,  O
2,  
2,  
2,  
2,  
2,  
2,  O
...

PostgreSQL

jihoonson=# select l_partkey, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
 l_partkey | first_value 
-----------+-------------
         1 | O
         1 | O
         1 | O
         1 | O
         1 | O
         1 | O
         1 | F
         1 | O
         1 | F
         1 | F
         1 | F
         1 | F
         1 | O
         1 | F
         1 | F
         1 | O
         1 | O
         1 | F
         1 | F
         1 | F
         1 | F
         1 | F
         1 | O
         1 | O
         1 | O
         1 | F
         1 | O
         1 | F
         1 | F
         1 | F
         1 | O
         2 | O
...

As you can see, some values are null in Tajo.

@sirpkt
Copy link
Contributor Author

sirpkt commented Apr 28, 2015

Thank you for the finding, @jihoonson.
Difference between result of the patch and the result of postgresql is from the bug.
I'll fix the bug and add more test codes about window frame.

- bug fix for the handling of built-in window functions with frame support
- add more test codes for the preceding and following with constants
@sirpkt
Copy link
Contributor Author

sirpkt commented Apr 28, 2015

I rebased and updated the patch.

- bug fix for the handling of built-in window functions with frame support
- add more test codes for the preceding and following with constant cases

@jihoonson
Copy link
Contributor

Thanks @sirpkt, but query results are still different.
Would you check it again?

Tajo

default> select l_partkey, l_tax, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
l_partkey,  l_tax,  ?windowfunction
-------------------------------
1,  0.0,  F
1,  0.0,  F
1,  0.01,  F
1,  0.01,  F
1,  0.01,  F
1,  0.02,  F
1,  0.02,  O
1,  0.02,  F
1,  0.03,  O
1,  0.03,  F
1,  0.04,  F
1,  0.04,  O
1,  0.05,  F
1,  0.05,  F
1,  0.06,  F
1,  0.06,  O
1,  0.06,  O
1,  0.06,  F
1,  0.06,  F
1,  0.07,  F
1,  0.07,  O
1,  0.07,  F
1,  0.07,  F
1,  0.07,  O
1,  0.07,  O
1,  0.08,  F
1,  0.08,  O
1,  0.08,  F
1,  0.08,  F
1,  0.08,  F
1,  0.08,  F
2,  0.0,  O
2,  0.0,  O
2,  0.0,  O
2,  0.01,  O
2,  0.01,  O
...

PostgreSQL

jihoonson=# select l_partkey, l_tax, first_value(l_linestatus) over (partition by l_partkey order by l_tax rows between 5 preceding and 8 following) from lineitem;
 l_partkey | l_tax | first_value 
-----------+-------+-------------
         1 |     0 | O
         1 |     0 | O
         1 |  0.01 | O
         1 |  0.01 | O
         1 |  0.01 | O
         1 |  0.02 | O
         1 |  0.02 | F
         1 |  0.02 | O
         1 |  0.03 | F
         1 |  0.03 | F
         1 |  0.04 | F
         1 |  0.04 | F
         1 |  0.05 | O
         1 |  0.05 | F
         1 |  0.06 | F
         1 |  0.06 | O
         1 |  0.06 | O
         1 |  0.06 | F
         1 |  0.06 | F
         1 |  0.07 | F
         1 |  0.07 | F
         1 |  0.07 | F
         1 |  0.07 | O
         1 |  0.07 | O
         1 |  0.07 | O
         1 |  0.08 | F
         1 |  0.08 | O
         1 |  0.08 | F
         1 |  0.08 | F
         1 |  0.08 | F
         1 |  0.08 | O
         2 |     0 | O
         2 |     0 | O
         2 |     0 | O
         2 |  0.01 | O
         2 |  0.01 | O
         2 |  0.01 | O
         2 |  0.02 | O
         2 |  0.02 | O
         2 |  0.02 | O
...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants