Conversation
vtlim
left a comment
There was a problem hiding this comment.
Great examples! Left comments on style and areas that could use more explanation.
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
2bethere
left a comment
There was a problem hiding this comment.
Overall I think this looks good.
I'd like to suggest adding a section around known issues and the limitations that we have here. For example, if the user sees a Null point exception during a query, they might be using an unsupported query shape.
Similar to how you called out Group by queries are required.
e026fdc to
093a1c7
Compare
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
| The following are known issues with window functions: | ||
|
|
||
| - Descending order, DESC, in the ORDER BY clause in the window definition causes a Null Pointer Exception. | ||
| - Druid throws an exception if you do not include the offset and default for LEAD() and LAG() functions. |
There was a problem hiding this comment.
What would the alternative be? That is, if the offset is not defined, what should be returned? The default can be null, so that makes sense, but I'm unclear what a meaning default value for offset in LEAD/LAG would be?
There was a problem hiding this comment.
I only know by comparison to Postgres which sets the default offset to 1 if not included:
If omitted, offset defaults to 1 and default to NULL.
I have also observed that the default value is not respected when set. For example for the first row in the window a LEAD(dim, 1, default) should apply the default value and that does not appear to be the case.
Also, it would be better to catch the exception throw a user-readable message so they know to set the offset.
There was a problem hiding this comment.
this default offset stuff was fixed recently
rm unwanted file
|
|
||
| There are known issues where ORDER BY only works on ascending order and certain options may cause errors. | ||
|
|
||
| Set the context parameter `windowsAreForClosers: true` to use window functions. |
There was a problem hiding this comment.
we should rename this to something else :)
| You can use the OVER clause to treat other Druid aggregation functions as window functions. For example, the sum of a value for rows within a window. | ||
|
|
||
| When working with window functions, consider the following: | ||
| - Window functions only work on GROUP BY queries. |
There was a problem hiding this comment.
I don't think we need this anymore - it should work
There was a problem hiding this comment.
I tried a couple of queries like
SELECT
RANK() OVER (PARTITION BY m1 ORDER BY m1 DESC)
AS ranking,
m1
FROM foo
Does not give any results but if I add a GROUP BY m1 it does give results. As for this I would like to include GROUP BY in the docs as the window functions are still experimental and once we are confident that the cases run without the group by we can update it in the next release
| |`PERCENT_RANK()`| Returns the rank of the row calculated as a percentage according to the formula: `(rank - 1) / (total window rows - 1)` | | ||
| |`CUME_DIST()`| Returns the cumulative distribution of the current row within the window calculated as `number of window rows at the same rank or higher than current row` / `total window rows` | | ||
| |`NTILE(tiles)`| Divides the rows within a window as evenly as possible into the number of tiles, also called buckets, and returns the value of the tile that the row falls into |None | | ||
| |`LAG(expr, offset, default)`| Returns the value evaluated at the row that precedes the current row by the offset number within the window; if there is no such row, returns the given default value | |
There was a problem hiding this comment.
I don't think we have LAG(expr, offset, default) ; to my best knowledge we have this
for which we could write:
LAG(expr, [offset])offsetdefaults to1if omitted- remove the part about default value
similar commment notes for LEAD
| PERCENT_RANK() OVER w AS pct_rank, | ||
| CUME_DIST() OVER w AS cumulative_dist, | ||
| NTILE(4) OVER w AS ntile_val, | ||
| LAG(ABS(delta), 1, 0) OVER w AS lag_val, |
There was a problem hiding this comment.
I don't think this default work - I see null in the output; and not 0
I assume this query can be run so I think we accept it ; but doesn't live up to working accordingly....
There was a problem hiding this comment.
@kgyrtkirk , don't know if this is a bug in the function or if there is not an intention to be a default value (I was basing my docs off pg)
It's kinda odd to me that the function signature accepts the value, but maybe it is doing something else?
There was a problem hiding this comment.
its a bug in the LEAD/LAG implementation ; it doesn't take the default into consideration; however Calcite accepts it - we could either reject these ; or implement it - either way I've opened this ticket for it
| |`2016-06-27T04:20:00.000Z`|`#kk.wikipedia`|56|56| | ||
| |`2016-06-27T04:35:00.000Z`|`#kk.wikipedia`|2440|2496| |
There was a problem hiding this comment.
I might have made a mistake I can't see right now - but I'm pretty confident that this should be 2496 for both of these rows
I think we have:
OVER(... ORDER BY FLOOR( __time TO HOUR)=> should be ordered by the floored value- running aggregate of
SUM()should be calculated for up-to-all current values of orderby value... - these 2 rows should floor to 4am but they have different running-sum values
I was playing with pg in this sqlfiddle
There was a problem hiding this comment.
The query floors to MINUTE not hour in line 177:
FLOOR(__time TO MINUTE)
For consistency I changed line 180 to floor to MINUTE too -- it keeps the same results as I have here.
I wonder if there is a bug with Druid in the case that you pointed out. Where it looks like it honored the MINUTE and not the HOUR?
There was a problem hiding this comment.
for the resultset line 180 mattered more :)
Thank you for changing it now the resultset seem to be more accurate! 🗡️
| The following are known issues with window functions: | ||
|
|
||
| - Descending order, DESC, in the ORDER BY clause in the window definition causes a Null Pointer Exception. | ||
| - Druid throws an exception if you do not include the offset and default for LEAD() and LAG() functions. |
There was a problem hiding this comment.
this default offset stuff was fixed recently
| ] | ||
| } | ||
| } | ||
| } No newline at end of file |
| Apache Druid supports two query languages: [Druid SQL](sql.md) and [native queries](querying.md). | ||
| This document describes the SQL language. | ||
|
|
||
| Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback. |
There was a problem hiding this comment.
| Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback. | |
| Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback. Windows functions are not supported by multi-stage-query engine and thus cannot be used in SQL-based Ingestion as of now. |
|
|
||
| The OVER clause defines the query windows for window functions as follows: | ||
| - PARTITION BY indicates the dimension that defines the rows within the window | ||
| - ORDER BY specifies the order of the rows within the windows. Currently only ascending order, ASC, works. |
There was a problem hiding this comment.
we now accept DESC as well now (and nulls X in case it aligns with the current behaviour)
so Currently only ascending order, ASC, works. could be removed
| PERCENT_RANK() OVER w AS pct_rank, | ||
| CUME_DIST() OVER w AS cumulative_dist, | ||
| NTILE(4) OVER w AS ntile_val, | ||
| LAG(ABS(delta), 1, 0) OVER w AS lag_val, |
There was a problem hiding this comment.
its a bug in the LEAD/LAG implementation ; it doesn't take the default into consideration; however Calcite accepts it - we could either reject these ; or implement it - either way I've opened this ticket for it
| |`2016-06-27T04:20:00.000Z`|`#kk.wikipedia`|56|56| | ||
| |`2016-06-27T04:35:00.000Z`|`#kk.wikipedia`|2440|2496| |
There was a problem hiding this comment.
for the resultset line 180 mattered more :)
Thank you for changing it now the resultset seem to be more accurate! 🗡️
|
|
||
| The following are known issues with window functions: | ||
|
|
||
| - Descending order, DESC, in the ORDER BY clause in the window definition causes a Null Pointer Exception. No newline at end of file |
There was a problem hiding this comment.
no; this is not there anymore - however the following is known:
- aggregates which have
ORDER BYspecified are processed in the window:ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW- note: other database seem to use the default of
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWin this case (not sure if it is specified or not in the standard - have to dig it out) - note: in case the order column is unique there is no difference between
RANGE/ROWS
- note: other database seem to use the default of
- windows with
RANGEspecifications are handled asROWS LEAD/LAGignores the default valueLAST_VALUEreturns the last value of the window even in the presence of anorder by
here is an example query which could illustrate these (not sure if needed)
SELECT
FLOOR(__time TO DAY) t,
cnt,
FLOOR(m1/3),
COUNT(1) OVER (ORDER BY FLOOR(m1/3) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), --ok
COUNT(1) OVER (ORDER BY FLOOR(m1/3) RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), -- bad
LAST_VALUE(cnt) OVER (ORDER BY __time), -- last value of the window
LAG(cnt) OVER (), -- ok
LAG(cnt,1,999) OVER () -- default value ignored
FROM foo
|
|
||
| Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback. | ||
|
|
||
| There are known issues where ORDER BY only works on ascending order and certain options may cause errors. |
There was a problem hiding this comment.
This has been fixed recently. Now both the orders work fine. The known issue mentioned here has been resolved
soumyava
left a comment
There was a problem hiding this comment.
Some nits. Mostly LGTM !
|
|
||
| ## Examples | ||
|
|
||
| The following example illustrates all of the built-in window functions to compare the number of characters changed per event for a channel in the Wikipedia data set. |
There was a problem hiding this comment.
Probably sometime later when these are not experimental we could run an example with a use case and show how an empty over() works followed by introducing partition by and order by similar to how Drill does in https://drill.apache.org/docs/sql-window-functions-introduction/. Not now, but some food for thought for future
somu-imply
left a comment
There was a problem hiding this comment.
Thanks @techdocsmith for the docs. LGTM
* draft window functions * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address comments * remove default column * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * fix ntile * remove default header column * code tics to remove spelling errors * add known issues, add SUM example * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address spelling * remove extra chars * add to sidebar, fix admonition * Update sql-window-functions.md accept suggestion, change admonition style * update sidebar * Delete Untitled.ipynb rm unwanted file * Update docs/querying/sql-window-functions.md * Update docs/querying/sql-window-functions.md * update context param, accept suggestions * accept suggestions * Apply suggestions from code review * Fix known issues * require GROUP BY, explain order of operation * accept suggestions * fix spelling --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* draft window functions * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address comments * remove default column * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Update docs/querying/sql-window-functions.md Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * fix ntile * remove default header column * code tics to remove spelling errors * add known issues, add SUM example * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com> * address spelling * remove extra chars * add to sidebar, fix admonition * Update sql-window-functions.md accept suggestion, change admonition style * update sidebar * Delete Untitled.ipynb rm unwanted file * Update docs/querying/sql-window-functions.md * Update docs/querying/sql-window-functions.md * update context param, accept suggestions * accept suggestions * Apply suggestions from code review * Fix known issues * require GROUP BY, explain order of operation * accept suggestions * fix spelling --------- Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Adds docs for SQL Window Functions
This PR has: