[SPARK-46009][SQL][CONNECT] Merge the parse rule of PercentileCont and PercentileDisc into functionCall #43910

beliefer · 2023-11-20T12:12:31Z

What changes were proposed in this pull request?

Spark SQL parser have a special rule to parse [percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the functionCall.

Why are the changes needed?

Merge the parse rule of PercentileCont and PercentileDisc into functionCall.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

New test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

cloud-fan · 2023-11-21T06:47:54Z

common/utils/src/main/resources/error/error-classes.json

@@ -1846,6 +1846,34 @@
    },
    "sqlState" : "42000"
  },
+  "INVALID_INVERSE_DISTRIBUTION_FUNCTION" : {
+    "message" : [
+      "Invalid inverse distribution function <funcName>."


is inverse distribution a common terminology?

I referenced H2, SQLServer and Postgres, they call these function inverse distribution function.
Addition, some databases call these functions as ordered set function.

cloud-fan · 2023-11-21T06:49:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          case u @ UnresolvedFunction(nameParts, arguments, _, _, _, sortOrder) => withPosition(u) {
+            val args = sortOrder match {
+              case Some(s) if nameParts.length == 1 &&
+                (nameParts.head == "percentile_cont" || nameParts.head == "percentile_disc") =>


can we lookup the function first then check the expression? it's hacky to check the name here.

Yes. These code just is a temp implementation. It's used let you give a first review.

cloud-fan · 2023-11-21T06:58:45Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

+object PercentileContBuilder extends ExpressionBuilder {
+  override def build(funcName: String, expressions: Seq[Expression]): Expression = {
+    val numArgs = expressions.length
+    if (numArgs == 2) {


For now, we implement WITHIN GROUP within the expression itself, instead of a general approach. I think this will not be changed in the near future. However, from the API perspective, WITHIN GROUP does like a general feature for aggregate functions.

We should make the framework more flexible. My idea is

function lookup should only look at the function name and inputs, not WITHIN GROUP (it's consistent with other features like DISTINCT, FILTER, etc.)

after function lookup, we get an expression, and the expression should implement a trait to indicate that it supports WITHIN GROUP. Otherwise we fail.

For example, we can have a

trait SupportsWithinGroup { def withOrderings(orderings: Seq[SortOrder]) }

The two percentile expressions should implement this trait so that they can get the orderings in a post-hoc way.

Because the two percentile expressions have a lot of complexity. I recommend the factory mode for inverse distribution function.

trait InverseDistributionFactory extends AggregateFunction { def createInverseDistributionFunction(sortOrder: SortOrder): AggregateFunction }

…d PercentileDisc into functionCall

cloud-fan · 2023-11-23T13:10:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+              throw QueryCompilationErrors.inverseDistributionFunctionMissingWithinGroupError(
+                factory.prettyName)
+            case other if u.sortOrder.isDefined =>
+              throw QueryCompilationErrors.unsupportedInverseDistributionFunctionError(


It only covers unsupported agg func. We should cover all cases in validateFunction

maybe we can add an extra case match at the beginning

case _ if !func.isInstanceOf[InverseDistributionFactory] && u.sortOrder.isDefined => fail

BTW, I think the existing functionWithUnsupportedSyntaxError is fine here. We don't need to create a new error.

It only covers unsupported agg func. We should cover all cases in validateFunction

Thank you for the reminder.

BTW, I think the existing functionWithUnsupportedSyntaxError is fine here. We don't need to create a new error.

I tried to find out the suitable error, but it too many. Thank you for the reminder.

cloud-fan · 2023-11-23T13:14:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

@@ -298,11 +298,12 @@ case class UnresolvedFunction(
    arguments: Seq[Expression],
    isDistinct: Boolean,
    filter: Option[Expression] = None,
-    ignoreNulls: Boolean = false)
+    ignoreNulls: Boolean = false,
+    sortOrder: Option[SortOrder] = None)


maybe orderingWithinGroup?

cloud-fan · 2023-11-23T13:14:45Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

@@ -975,6 +975,7 @@ primaryExpression
    | LEFT_PAREN query RIGHT_PAREN                                                             #subqueryExpression
    | functionName LEFT_PAREN (setQuantifier? argument+=functionArgument
       (COMMA argument+=functionArgument)*)? RIGHT_PAREN
+       (WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN)?


can we check other databases? Is the sort ordering always one item?

I checked the inverse distribution functions in Postgres and Oracle
percentile_cont(0.25) WITHIN GROUP (ORDER BY salary, bonus)

Postgres throws

SQL 错误 [42883]: ERROR: function percentile_cont(numeric, numeric, double precision) does not exist 建议：No function matches the given name and argument types. You might need to add explicit type casts. 位置：52

Oracle throws
SQL 错误 [909] [42000]: ORA-00909: 参数个数无效

https://www.postgresql.org/docs/9.4/functions-aggregate.html

So the WITHIN GROUP can have multiple order by items. We should allow it in the parser and can fail for certain functions that only allow one sort order.

cloud-fan · 2023-11-23T13:15:35Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

+ * A factory used to create the inverse distribution function during the analysis phase.
+ * This factory is also an aggregate function that cannot be evaluated.
+ */
+trait InverseDistributionFactory extends AggregateFunction {


can we move it to a new file?

instead of having a factory, shall we create a fake PercentileCont first and then update its left expression later in validateFunction?

e.g. we can put a null literal as a placeholder for PercentileCont.left, and replace it later

We still need a base trait

trait SupportsOrderingWithinGroup { self: AggregateFunction => def withOrderingWithinGroup(orderingWithGroup: Seq[SortOrder]): Expression }

if null literal is too hack, we can create a placeholder expression.

I think using the factory model or using SupportsOrderingWithinGroup is just a difference in specific implementation methods, and I have no preference for the two.

beliefer · 2023-11-30T09:16:24Z

@cloud-fan Please take a look? Thanks.

cloud-fan · 2023-12-01T11:15:06Z

common/utils/src/main/resources/error/error-classes.json

+      },
+      "PERCENTILE_PERCENTAGE_MISSING" : {
+        "message" : [
+          "Missing percentage."


This doesn't seem like a thing to inverse distribution function. It should be specific to certain functions and shouldn't be in this error class.

cloud-fan · 2023-12-01T13:02:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -2367,6 +2367,27 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
        case agg: AggregateFunction =>
          // Note: PythonUDAF does not support these advanced clauses.
          if (agg.isInstanceOf[PythonUDAF]) checkUnsupportedAggregateClause(agg, u)
+          val newAgg = agg match {
+            case idf: SupportsOrderingWithinGroup if u.orderingWithinGroup.isDefined =>


This is not future-proof. AggregateWindowFunction extends AggregateFunction, but we are not performing the DISTINCT check for it.

Good catch!

cloud-fan · 2023-12-01T13:02:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+            case _ =>
+              if (u.orderingWithinGroup.isDefined) {
+                throw QueryCompilationErrors.functionWithUnsupportedSyntaxError(
+                  func.prettyName, "WITHIN GROUP (ORDER BY clause)")


Suggested change

func.prettyName, "WITHIN GROUP (ORDER BY clause)")

func.prettyName, "WITHIN GROUP (ORDER BY ...)")

cloud-fan · 2023-12-01T13:09:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

-      copy(arguments = newChildren.dropRight(1), filter = Some(newChildren.last))
+      if (orderingWithinGroup.isDefined) {
+        val newSortOrder = Some(newChildren.last.asInstanceOf[SortOrder])
+        val args = newChildren.dropRight(1)


nit: dropRight creates a new collection, let's be more efficient here

val newSortOrder = Some(newChildren.last.asInstanceOf[SortOrder]) val newFilter = Some(newChildren(newChildren.length - 2).asInstanceOf[SortOrder]) ... arguments = newChildren.dropRight(2)

cloud-fan · 2023-12-01T13:10:35Z

.../scala/org/apache/spark/sql/catalyst/expressions/aggregate/SupportsOrderingWithinGroup.scala

+ * The trait used to set the [[SortOrder]] after inverse distribution functions parsed.
+ */
+trait SupportsOrderingWithinGroup { self: AggregateFunction =>
+  def isFake: Boolean


do we need this flag? I think we can always call withOrderingWithinGroup, as it only happens once we turn UnresolvedFunction into resolved ones.

cloud-fan · 2023-12-01T13:11:53Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

+    if (numArgs == 1) {
+      PercentileCont(UnresolvedWithinGroup, expressions(0))
+    } else if (numArgs == 0) {
+      throw QueryCompilationErrors.inverseDistributionFunctionMissingPercentageError(funcName)


why do we need to throw a different error here? It's just QueryCompilationErrors.wrongNumArgsError

cloud-fan · 2023-12-01T13:14:33Z

sql/core/src/test/resources/sql-tests/analyzer-results/percentiles.sql.out

+-- !query analysis
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "_LEGACY_ERROR_TEMP_1023",


since we are here, shall we create error classes for them?

Let me make a separate PR for this, because this mistake is very common and can be easily corrected. It's not easy to review, and I'm also not good at resolving conflicts.

cloud-fan · 2023-12-04T06:53:13Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

+  override def withOrderingWithinGroup(orderingWithinGroup: Seq[SortOrder]): AggregateFunction = {
+    if (orderingWithinGroup.length != 1) {
+      throw QueryCompilationErrors.wrongNumArgsError(
+        nodeName, Seq(2), orderingWithinGroup.length + 1)


Can we create a new sub-error-class under INVALID_INVERSE_DISTRIBUTION_FUNCTION? like WRONG_NUM_ORDERINGS.

cloud-fan · 2023-12-04T06:53:25Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/percentiles.scala

@@ -417,6 +429,17 @@ case class PercentileDisc(
    s"$prettyName($distinct${right.sql}) WITHIN GROUP (ORDER BY ${left.sql}$direction)"
  }

+  override def withOrderingWithinGroup(orderingWithinGroup: Seq[SortOrder]): AggregateFunction = {
+    if (orderingWithinGroup.length != 1) {
+      throw QueryCompilationErrors.wrongNumArgsError(


cloud-fan · 2023-12-04T11:38:56Z

common/utils/src/main/resources/error/error-classes.json

+      },
+      "WRONG_NUM_ORDERINGS" : {
+        "message" : [
+          "WITHIN GROUP requires <expectedNum> orderings but the actual number is <actualNum>."


can we mention the function name?

The parent error class contains it.
"Invalid inverse distribution function <funcName>."

oh, let's make it a bit clearer then.

Requires <expectedNum> orderings in WITHIN GROUP but got <actualNum>

beliefer · 2023-12-04T12:47:58Z

The GA failure is unrelated.

beliefer · 2023-12-05T01:29:50Z

The GA failure is unrelated.
Merged to master
@cloud-fan Thank you very much!

…d PercentileDisc into functionCall Spark SQL parser have a special rule to parse `[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v)`. We should merge this rule into the `functionCall`. Merge the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`. 'No'. New test cases. 'No'. Closes apache#43910 from beliefer/SPARK-46009. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com>

…P_1023 ### What changes were proposed in this pull request? Based on the suggestion at #43910 (comment), this PR want assign a name to the error class `_LEGACY_ERROR_TEMP_1023`. ### Why are the changes needed? Assign a name to the error class `_LEGACY_ERROR_TEMP_1023`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44355 from beliefer/SPARK-46406. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com>

cloud-fan · 2024-01-20T01:49:24Z

common/utils/src/main/resources/error/README.md

@@ -1309,6 +1309,7 @@ The following SQLSTATEs are collated from:
 |HZ320    |HZ   |RDA-specific condition                            |320     |version not supported                                       |RDA/SQL        |Y       |RDA/SQL                                                                     |
 |HZ321    |HZ   |RDA-specific condition                            |321     |TCP/IP error                                                |RDA/SQL        |Y       |RDA/SQL                                                                     |
 |HZ322    |HZ   |RDA-specific condition                            |322     |TLS alert                                                   |RDA/SQL        |Y       |RDA/SQL                                                                     |
+|ID001    |IM   |Invalid inverse distribution function             |001     |Invalid inverse distribution function                       |SQL/Foundation |N       |SQL/Foundation PostgreSQL Oracle Snowflake Redshift H2                      |


I can't find it in https://www.postgresql.org/docs/current/errcodes-appendix.html , did you make it up yourself? @beliefer

You means reference the error code from postgresql?
If so, I will pick it up.

I mean we should be honest here. If it's not from PostgreSQL or other system, and we do want a new category here, I'd suggest 42K0K

I selected ID001 due to ID is the abbreviation of Invalid inverse distribution function.
42K0K seems good too.

OK, so you invented it. Let's not do it again. SQLSTATE is a standard and the prefix has meanings. Can you open a followup PR to change it to 42K0K? thanks!

…inverse distribution function ### What changes were proposed in this pull request? This PR follows up #43910 and propose to change the error code for invalid inverse distribution function. ### Why are the changes needed? Based on the discussion at #43910 (comment) ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA tests. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44811 from beliefer/SPARK-46009_followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…TILE_DISC in g4 ### What changes were proposed in this pull request? This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4 ### Why are the changes needed? #43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46272 from beliefer/SPARK-46009_followup2. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…TILE_DISC in g4 ### What changes were proposed in this pull request? This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4 ### Why are the changes needed? apache#43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes apache#46272 from beliefer/SPARK-46009_followup2. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added SQL DOCS CONNECT labels Nov 20, 2023

beliefer force-pushed the SPARK-46009 branch from 42f760b to 4dca856 Compare November 21, 2023 06:04

beliefer requested review from cloud-fan and MaxGekk November 21, 2023 06:13

cloud-fan reviewed Nov 21, 2023

View reviewed changes

beliefer force-pushed the SPARK-46009 branch 2 times, most recently from 65273f3 to 9622179 Compare November 21, 2023 13:45

[SPARK-46009][SQL][CONNECT] Merge the parse rule of PercentileCont an…

800b100

…d PercentileDisc into functionCall

beliefer force-pushed the SPARK-46009 branch from 9622179 to 800b100 Compare November 22, 2023 07:09

cloud-fan reviewed Nov 23, 2023

View reviewed changes

Replace InverseDistributionFactory with SupportsOrderingWithinGroup

fa06acc

beliefer force-pushed the SPARK-46009 branch from 4be5762 to fa06acc Compare November 25, 2023 04:18

beliefer requested a review from cloud-fan November 28, 2023 12:33

cloud-fan reviewed Dec 1, 2023

View reviewed changes

Update code

0721c33

beliefer force-pushed the SPARK-46009 branch from d604ef1 to 0721c33 Compare December 2, 2023 14:39

cloud-fan reviewed Dec 4, 2023

View reviewed changes

cloud-fan approved these changes Dec 4, 2023

View reviewed changes

beliefer added 2 commits December 4, 2023 17:25

Update code

76f03d5

Update code

c8c9fac

cloud-fan reviewed Dec 4, 2023

View reviewed changes

Update code

ff40177

beliefer closed this in f1283c1 Dec 5, 2023

beliefer mentioned this pull request Dec 14, 2023

[SPARK-46406][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1023 #44355

Closed

cloud-fan reviewed Jan 20, 2024

View reviewed changes

beliefer mentioned this pull request Jan 20, 2024

[SPARK-46009][SQL][DOCS][FOLLOWUP] Change the error code for invalid inverse distribution function #44811

Closed

beliefer mentioned this pull request Apr 29, 2024

[SPARK-46009][SQL][FOLLOWUP] Remove unused PERCENTILE_CONT and PERCENTILE_DISC in g4 #46272

Closed

	func.prettyName, "WITHIN GROUP (ORDER BY clause)")
	func.prettyName, "WITHIN GROUP (ORDER BY ...)")

[SPARK-46009][SQL][CONNECT] Merge the parse rule of PercentileCont and PercentileDisc into functionCall #43910

[SPARK-46009][SQL][CONNECT] Merge the parse rule of PercentileCont and PercentileDisc into functionCall #43910

Conversation

beliefer commented Nov 20, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

beliefer Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Nov 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Dec 4, 2023

beliefer commented Dec 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Nov 20, 2023 •

edited

Loading

beliefer Nov 21, 2023 •

edited

Loading

beliefer Nov 24, 2023 •

edited

Loading

cloud-fan Dec 1, 2023 •

edited

Loading