Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38219][SQL] Support ANSI aggregation function percentile_cont as window function #35531

Closed
wants to merge 9 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Feb 15, 2022

What changes were proposed in this pull request?

#34936 support ANSI Aggregation Function: percentile_cont.
Syntax:
PERCENTILE_CONT( <percentile> ) WITHIN GROUP (ORDER BY <order_by_expr>) OVER ( [ PARTITION BY <expr3> ] )
Arguments:
percentile:The percentile of the value that you want to find. The percentile must be a constant between 0.0 and 1.0. For example, if you want to find the value at the 90th percentile, specify 0.9.

order_by_expr:The expression (typically a column name) by which to order the values.

expr3:This is the optional expression used to group rows into partitions.

Examples

select k, percentile_cont(0.25) within group (order by v) 
  from aggr 
  group by k
  order by k;
+---+-------------------------------------------------+
| K | PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY V) |
|---+-------------------------------------------------|
| 0 |                                        10.00000 |
| 1 |                                        12.50000 |
| 2 |                                        17.50000 |
| 3 |                                        60.00000 |
| 4 |                                            NULL |
+---+-------------------------------------------------+

Note: specify order by or frame for window frame of percentile_cont is not allowed.

Why are the changes needed?

To improve migration from other systems to Spark SQL. Some mainstream database supports percentile_cont as window function show below:
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/percentile_cont.html
Oracle
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/PERCENTILE_CONT.html#GUID-CA259452-A565-41B3-A4F4-DD74B66CEDE0
H2
http://www.h2database.com/html/functions-aggregate.html#percentile_cont
Sybase
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01776.1601/doc/html/san1278453109663.html
Exasol
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/percentile_cont.htm
Yellowbrick
https://www.yellowbrick.com/docs/2.2/ybd_sqlref/percentile_cont_window.html
Mariadb
https://mariadb.com/kb/en/percentile_cont/
Singlestore
https://docs.singlestore.com/db/v7.6/en/reference/sql-reference/window-functions/percentile_cont-and-median.html
SqlServer
https://docs.microsoft.com/en-us/sql/t-sql/functions/percentile-cont-transact-sql?view=sql-server-ver15

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Feb 15, 2022
@amaliujia
Copy link
Contributor

I know it might be painful that in the past we don't do it but now why we are doing it, but having the function specification documented, and then discussed, before reviewing the code, would be a really helpful process for changing a user-facing SQL feature (either adding a new one or changing existing ones). Many SQL related project does it (e.g. Apache Calcite)

It is not that difficult: just write down function signature, argument types, behaviors, examples. It is more like document what you are expressing by your code. People can better know all key decisions and argue on specific ones, or catch what has been missed.

@beliefer
Copy link
Contributor Author

I know it might be painful that in the past we don't do it but now why we are doing it, but having the function specification documented, and then discussed, before reviewing the code, would be a really helpful process for changing a user-facing SQL feature (either adding a new one or changing existing ones). Many SQL related project does it (e.g. Apache Calcite)

It is not that difficult: just write down function signature, argument types, behaviors, examples. It is more like document what you are expressing by your code. People can better know all key decisions and argue on specific ones, or catch what has been missed.

Updated

@beliefer
Copy link
Contributor Author

ping @MaxGekk cc @cloud-fan @gengliangwang

@@ -1835,11 +1835,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
override def visitPercentile(ctx: PercentileContext): Expression = withOrigin(ctx) {
val percentage = expression(ctx.percentage)
val sortOrder = visitSortItem(ctx.sortItem)
val percentile = sortOrder.direction match {
case Ascending => new Percentile(sortOrder.child, percentage)
case Descending => new Percentile(sortOrder.child, Subtract(Literal(1), percentage))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we directly use Percentile? It's an aggregate function and can be used in Window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because cannot specify order by or frame of window for 'PERCENTILE_CONT'.

              case AggregateExpression(_: PercentileCont, _, _, _, _)
                 if w.windowSpec.orderSpec.nonEmpty || w.windowSpec.frameSpecification !=
                     SpecifiedWindowFrame(RowFrame, UnboundedPreceding, UnboundedFollowing) =>
                 failAnalysis("Cannot specify order by or frame for 'PERCENTILE_CONT'.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a real limitation or it's just other databases do not support it?

Copy link
Contributor Author

@beliefer beliefer Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. According the document, only could specify partition by clause in window.
I tested in Oracle

SELECT 
percentile_cont(0.25) WITHIN GROUP (ORDER BY x) OVER (ORDER BY x ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
percentile_cont(0.25) WITHIN GROUP (ORDER BY x DESC) OVER (PARTITION BY k)
FROM SYSTEM.testRegression
ORDER BY k;

, it throws SQL 错误 [30487] [99999]: ORA-30487: ORDER BY 在此禁用.

@@ -890,7 +890,7 @@ primaryExpression
| OVERLAY '(' input=valueExpression PLACING replace=valueExpression
FROM position=valueExpression (FOR length=valueExpression)? ')' #overlay
| PERCENTILE_CONT '(' percentage=valueExpression ')'
WITHIN GROUP '(' ORDER BY sortItem ')' #percentile
WITHIN GROUP '(' ORDER BY sortItem ')' ( OVER windowSpec)? #percentile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand the rationale more. This function has its own syntax to specify ORDER BY, so I get that it should not co-exists with the ORDER BY in the window frame. But doesn't percentile have the same problem when being used as a window function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why can't users specify the window frame boundaries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the document of these database, Almost all databases tell me the window specification of PERCENTILE_CONT only have partition by.
I can't find any other words about root cause or rationale.

@beliefer beliefer force-pushed the SPARK-38219 branch 2 times, most recently from f1d2d2a to 732c5c8 Compare March 13, 2022 05:52
@cloud-fan
Copy link
Contributor

@MaxGekk

5.0
""",
group = "agg_funcs",
since = "3.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

* Return a percentile value based on a continuous distribution of
* the input column (specified in ORDER BY clause).
*/
@ExpressionDescription(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the expression is not bound to a function in FunctionRegistry, ExpressionDescription is useless. What's the purpose of the annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had removed it.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 25, 2022

Waiting for https://github.com/beliefer/spark/runs/5691324331 completes.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 25, 2022

The test failure is not related to the changes:

[info] - SPARK-37555: spark-sql should pass last unclosed comment to backend *** FAILED *** (2 minutes, 11 seconds)
...
[info]   Exception: java.util.concurrent.TimeoutException: Futures timed out after [2 minutes]

@MaxGekk MaxGekk closed this in 8262a7b Mar 25, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Mar 25, 2022

+1, LGTM. Merged to master.
Thank you, @beliefer and @cloud-fan for review.

martin-g pushed a commit to martin-g/spark that referenced this pull request Mar 25, 2022
…` as window function

### What changes were proposed in this pull request?
apache#34936 support ANSI Aggregation Function: `percentile_cont`.
**Syntax:**
`PERCENTILE_CONT( <percentile> ) WITHIN GROUP (ORDER BY <order_by_expr>) OVER ( [ PARTITION BY <expr3> ] )`
**Arguments:**
percentile:The percentile of the value that you want to find. The percentile must be a constant between 0.0 and 1.0. For example, if you want to find the value at the 90th percentile, specify 0.9.

order_by_expr:The expression (typically a column name) by which to order the values.

expr3:This is the optional expression used to group rows into partitions.

**Examples**:
```
select k, percentile_cont(0.25) within group (order by v)
  from aggr
  group by k
  order by k;
+---+-------------------------------------------------+
| K | PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY V) |
|---+-------------------------------------------------|
| 0 |                                        10.00000 |
| 1 |                                        12.50000 |
| 2 |                                        17.50000 |
| 3 |                                        60.00000 |
| 4 |                                            NULL |
+---+-------------------------------------------------+
```

**Note: specify order by or frame for window frame of `percentile_cont` is not allowed.**
### Why are the changes needed?
To improve migration from other systems to Spark SQL. Some mainstream database supports `percentile_cont` as window function show below:
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/percentile_cont.html
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/PERCENTILE_CONT.html#GUID-CA259452-A565-41B3-A4F4-DD74B66CEDE0
**H2**
http://www.h2database.com/html/functions-aggregate.html#percentile_cont
**Sybase**
https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01776.1601/doc/html/san1278453109663.html
**Exasol**
https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/percentile_cont.htm
**Yellowbrick**
https://www.yellowbrick.com/docs/2.2/ybd_sqlref/percentile_cont_window.html
**Mariadb**
https://mariadb.com/kb/en/percentile_cont/
**Singlestore**
https://docs.singlestore.com/db/v7.6/en/reference/sql-reference/window-functions/percentile_cont-and-median.html
**SqlServer**
https://docs.microsoft.com/en-us/sql/t-sql/functions/percentile-cont-transact-sql?view=sql-server-ver15

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New tests.

Closes apache#35531 from beliefer/SPARK-38219.

Lead-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@beliefer
Copy link
Contributor Author

@MaxGekk @cloud-fan Thank you for the help.

case class PercentileCont(left: Expression, right: Expression)
extends AggregateFunction
with RuntimeReplaceableAggregate
with ImplicitCastInputTypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only use PercentileCont as a marker to indicate that window frame ORDER BY should not be used, I think InheritAnalysisRules is better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will use tree node tag to improve the implement of PercentileCont based discussion offline.

cloud-fan pushed a commit that referenced this pull request Apr 20, 2022
…rcentile_cont and percentile_disc

### What changes were proposed in this pull request?
This PR backport #35531 and #35041 to branch-3.3

### Why are the changes needed?
`percentile_cont` and `percentile_disc` in Spark3.3 release.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes #36277 from beliefer/SPARK-38219_SPARK-37691_backport_3.3.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants