Skip to content

Comments

window functions docs#14739

Merged
soumyava merged 31 commits intoapache:masterfrom
techdocsmith:wf-docs
Nov 6, 2023
Merged

window functions docs#14739
soumyava merged 31 commits intoapache:masterfrom
techdocsmith:wf-docs

Conversation

@techdocsmith
Copy link
Contributor

Adds docs for SQL Window Functions

This PR has:

  • [x ] been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • [ x] been tested in a test Druid cluster.

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great examples! Left comments on style and areas that could use more explanation.

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
techdocsmith and others added 5 commits August 2, 2023 14:01
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM!

Copy link
Contributor

@2bethere 2bethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good.
I'd like to suggest adding a section around known issues and the limitations that we have here. For example, if the user sees a Null point exception during a query, they might be using an unsupported query shape.
Similar to how you called out Group by queries are required.

@techdocsmith techdocsmith force-pushed the wf-docs branch 2 times, most recently from e026fdc to 093a1c7 Compare August 3, 2023 20:25
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@techdocsmith techdocsmith changed the title draft window functions window functions docs Aug 4, 2023
The following are known issues with window functions:

- Descending order, DESC, in the ORDER BY clause in the window definition causes a Null Pointer Exception.
- Druid throws an exception if you do not include the offset and default for LEAD() and LAG() functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the alternative be? That is, if the offset is not defined, what should be returned? The default can be null, so that makes sense, but I'm unclear what a meaning default value for offset in LEAD/LAG would be?

Copy link
Contributor Author

@techdocsmith techdocsmith Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only know by comparison to Postgres which sets the default offset to 1 if not included:

If omitted, offset defaults to 1 and default to NULL.

I have also observed that the default value is not respected when set. For example for the first row in the window a LEAD(dim, 1, default) should apply the default value and that does not appear to be the case.

Also, it would be better to catch the exception throw a user-readable message so they know to set the offset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this default offset stuff was fixed recently

rm unwanted file

There are known issues where ORDER BY only works on ascending order and certain options may cause errors.

Set the context parameter `windowsAreForClosers: true` to use window functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rename this to something else :)

You can use the OVER clause to treat other Druid aggregation functions as window functions. For example, the sum of a value for rows within a window.

When working with window functions, consider the following:
- Window functions only work on GROUP BY queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this anymore - it should work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a couple of queries like

SELECT	
		RANK() OVER (PARTITION BY m1 ORDER BY m1 DESC) 
			AS ranking,
		m1
FROM foo 

Does not give any results but if I add a GROUP BY m1 it does give results. As for this I would like to include GROUP BY in the docs as the window functions are still experimental and once we are confident that the cases run without the group by we can update it in the next release

|`PERCENT_RANK()`| Returns the rank of the row calculated as a percentage according to the formula: `(rank - 1) / (total window rows - 1)` |
|`CUME_DIST()`| Returns the cumulative distribution of the current row within the window calculated as `number of window rows at the same rank or higher than current row` / `total window rows` |
|`NTILE(tiles)`| Divides the rows within a window as evenly as possible into the number of tiles, also called buckets, and returns the value of the tile that the row falls into |None |
|`LAG(expr, offset, default)`| Returns the value evaluated at the row that precedes the current row by the offset number within the window; if there is no such row, returns the given default value |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have LAG(expr, offset, default) ; to my best knowledge we have this

for which we could write:

  • LAG(expr, [offset])
  • offset defaults to 1 if omitted
  • remove the part about default value

similar commment notes for LEAD

PERCENT_RANK() OVER w AS pct_rank,
CUME_DIST() OVER w AS cumulative_dist,
NTILE(4) OVER w AS ntile_val,
LAG(ABS(delta), 1, 0) OVER w AS lag_val,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this default work - I see null in the output; and not 0
I assume this query can be run so I think we accept it ; but doesn't live up to working accordingly....

Copy link
Contributor Author

@techdocsmith techdocsmith Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk , don't know if this is a bug in the function or if there is not an intention to be a default value (I was basing my docs off pg)

It's kinda odd to me that the function signature accepts the value, but maybe it is doing something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a bug in the LEAD/LAG implementation ; it doesn't take the default into consideration; however Calcite accepts it - we could either reject these ; or implement it - either way I've opened this ticket for it

Comment on lines +193 to +194
|`2016-06-27T04:20:00.000Z`|`#kk.wikipedia`|56|56|
|`2016-06-27T04:35:00.000Z`|`#kk.wikipedia`|2440|2496|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have made a mistake I can't see right now - but I'm pretty confident that this should be 2496 for both of these rows

I think we have:

  • OVER(... ORDER BY FLOOR( __time TO HOUR) => should be ordered by the floored value
  • running aggregate of SUM() should be calculated for up-to-all current values of orderby value...
  • these 2 rows should floor to 4am but they have different running-sum values

I was playing with pg in this sqlfiddle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgyrtkirk

The query floors to MINUTE not hour in line 177:
FLOOR(__time TO MINUTE)

For consistency I changed line 180 to floor to MINUTE too -- it keeps the same results as I have here.

I wonder if there is a bug with Druid in the case that you pointed out. Where it looks like it honored the MINUTE and not the HOUR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the resultset line 180 mattered more :)

Thank you for changing it now the resultset seem to be more accurate! 🗡️

The following are known issues with window functions:

- Descending order, DESC, in the ORDER BY clause in the window definition causes a Null Pointer Exception.
- Druid throws an exception if you do not include the offset and default for LEAD() and LAG() functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this default offset stuff was fixed recently

]
}
}
} No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline removal?

@LakshSingla LakshSingla added this to the 28.0 milestone Oct 17, 2023
Apache Druid supports two query languages: [Druid SQL](sql.md) and [native queries](querying.md).
This document describes the SQL language.

Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback.
Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback. Windows functions are not supported by multi-stage-query engine and thus cannot be used in SQL-based Ingestion as of now.


The OVER clause defines the query windows for window functions as follows:
- PARTITION BY indicates the dimension that defines the rows within the window
- ORDER BY specifies the order of the rows within the windows. Currently only ascending order, ASC, works.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now accept DESC as well now (and nulls X in case it aligns with the current behaviour)
so Currently only ascending order, ASC, works. could be removed

PERCENT_RANK() OVER w AS pct_rank,
CUME_DIST() OVER w AS cumulative_dist,
NTILE(4) OVER w AS ntile_val,
LAG(ABS(delta), 1, 0) OVER w AS lag_val,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a bug in the LEAD/LAG implementation ; it doesn't take the default into consideration; however Calcite accepts it - we could either reject these ; or implement it - either way I've opened this ticket for it

Comment on lines +193 to +194
|`2016-06-27T04:20:00.000Z`|`#kk.wikipedia`|56|56|
|`2016-06-27T04:35:00.000Z`|`#kk.wikipedia`|2440|2496|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the resultset line 180 mattered more :)

Thank you for changing it now the resultset seem to be more accurate! 🗡️


The following are known issues with window functions:

- Descending order, DESC, in the ORDER BY clause in the window definition causes a Null Pointer Exception. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no; this is not there anymore - however the following is known:

  • aggregates which have ORDER BY specified are processed in the window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    • note: other database seem to use the default of RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW in this case (not sure if it is specified or not in the standard - have to dig it out)
    • note: in case the order column is unique there is no difference between RANGE / ROWS
  • windows with RANGE specifications are handled as ROWS
  • LEAD/LAG ignores the default value
  • LAST_VALUE returns the last value of the window even in the presence of an order by

here is an example query which could illustrate these (not sure if needed)

  SELECT
    FLOOR(__time TO DAY) t,
    cnt,
    FLOOR(m1/3),
    COUNT(1) OVER (ORDER BY FLOOR(m1/3) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), --ok
    COUNT(1) OVER (ORDER BY FLOOR(m1/3) RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), -- bad
    LAST_VALUE(cnt) OVER (ORDER BY __time), -- last value of the window
    LAG(cnt) OVER (),  -- ok
    LAG(cnt,1,999) OVER () -- default value ignored
  FROM foo


Window functions are an [experimental](../development/experimental.md) feature. Development and testing are still at early stage. Feel free to try window functions and provide your feedback.

There are known issues where ORDER BY only works on ascending order and certain options may cause errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed recently. Now both the orders work fine. The known issue mentioned here has been resolved

Copy link
Contributor

@soumyava soumyava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits. Mostly LGTM !


## Examples

The following example illustrates all of the built-in window functions to compare the number of characters changed per event for a channel in the Wikipedia data set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably sometime later when these are not experimental we could run an example with a use case and show how an empty over() works followed by introducing partition by and order by similar to how Drill does in https://drill.apache.org/docs/sql-window-functions-introduction/. Not now, but some food for thought for future

Copy link
Contributor

@somu-imply somu-imply left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @techdocsmith for the docs. LGTM

@soumyava soumyava merged commit 0403e48 into apache:master Nov 6, 2023
techdocsmith added a commit to techdocsmith/druid that referenced this pull request Nov 6, 2023
* draft window functions

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* address comments

* remove default column

* Update docs/querying/sql-window-functions.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql-window-functions.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* fix ntile

* remove default header column

* code tics to remove spelling errors

* add known issues, add SUM example

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* address spelling

* remove extra chars

* add to sidebar, fix admonition

* Update sql-window-functions.md

accept suggestion, change admonition style

* update sidebar

* Delete Untitled.ipynb

rm unwanted file

* Update docs/querying/sql-window-functions.md

* Update docs/querying/sql-window-functions.md

* update context param, accept suggestions

* accept suggestions

* Apply suggestions from code review

* Fix known issues

* require GROUP BY, explain order of operation

* accept suggestions

* fix spelling

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
* draft window functions

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* address comments

* remove default column

* Update docs/querying/sql-window-functions.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Update docs/querying/sql-window-functions.md

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* fix ntile

* remove default header column

* code tics to remove spelling errors

* add known issues, add SUM example

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>

* address spelling

* remove extra chars

* add to sidebar, fix admonition

* Update sql-window-functions.md

accept suggestion, change admonition style

* update sidebar

* Delete Untitled.ipynb

rm unwanted file

* Update docs/querying/sql-window-functions.md

* Update docs/querying/sql-window-functions.md

* update context param, accept suggestions

* accept suggestions

* Apply suggestions from code review

* Fix known issues

* require GROUP BY, explain order of operation

* accept suggestions

* fix spelling

---------

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants