Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46009][SQL][CONNECT] Merge the parse rule of PercentileCont and PercentileDisc into functionCall #43910

Closed
wants to merge 6 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Nov 20, 2023

What changes were proposed in this pull request?

Spark SQL parser have a special rule to parse [percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v).
We should merge this rule into the functionCall.

Why are the changes needed?

Merge the parse rule of PercentileCont and PercentileDisc into functionCall.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

New test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

@@ -1846,6 +1846,34 @@
},
"sqlState" : "42000"
},
"INVALID_INVERSE_DISTRIBUTION_FUNCTION" : {
"message" : [
"Invalid inverse distribution function <funcName>."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is inverse distribution a common terminology?

Copy link
Contributor Author

@beliefer beliefer Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referenced H2, SQLServer and Postgres, they call these function inverse distribution function.
Addition, some databases call these functions as ordered set function.

case u @ UnresolvedFunction(nameParts, arguments, _, _, _, sortOrder) => withPosition(u) {
val args = sortOrder match {
case Some(s) if nameParts.length == 1 &&
(nameParts.head == "percentile_cont" || nameParts.head == "percentile_disc") =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we lookup the function first then check the expression? it's hacky to check the name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. These code just is a temp implementation. It's used let you give a first review.

object PercentileContBuilder extends ExpressionBuilder {
override def build(funcName: String, expressions: Seq[Expression]): Expression = {
val numArgs = expressions.length
if (numArgs == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we implement WITHIN GROUP within the expression itself, instead of a general approach. I think this will not be changed in the near future. However, from the API perspective, WITHIN GROUP does like a general feature for aggregate functions.

We should make the framework more flexible. My idea is

  1. function lookup should only look at the function name and inputs, not WITHIN GROUP (it's consistent with other features like DISTINCT, FILTER, etc.)
  2. after function lookup, we get an expression, and the expression should implement a trait to indicate that it supports WITHIN GROUP. Otherwise we fail.

For example, we can have a

trait SupportsWithinGroup {
  def withOrderings(orderings: Seq[SortOrder])
}

The two percentile expressions should implement this trait so that they can get the orderings in a post-hoc way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the two percentile expressions have a lot of complexity. I recommend the factory mode for inverse distribution function.

trait InverseDistributionFactory extends AggregateFunction {
  def createInverseDistributionFunction(sortOrder: SortOrder): AggregateFunction
}

@beliefer beliefer force-pushed the SPARK-46009 branch 2 times, most recently from 65273f3 to 9622179 Compare November 21, 2023 13:45
throw QueryCompilationErrors.inverseDistributionFunctionMissingWithinGroupError(
factory.prettyName)
case other if u.sortOrder.isDefined =>
throw QueryCompilationErrors.unsupportedInverseDistributionFunctionError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only covers unsupported agg func. We should cover all cases in validateFunction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add an extra case match at the beginning

case _ if !func.isInstanceOf[InverseDistributionFactory] && u.sortOrder.isDefined => fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think the existing functionWithUnsupportedSyntaxError is fine here. We don't need to create a new error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only covers unsupported agg func. We should cover all cases in validateFunction

Thank you for the reminder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think the existing functionWithUnsupportedSyntaxError is fine here. We don't need to create a new error.

I tried to find out the suitable error, but it too many. Thank you for the reminder.

@@ -298,11 +298,12 @@ case class UnresolvedFunction(
arguments: Seq[Expression],
isDistinct: Boolean,
filter: Option[Expression] = None,
ignoreNulls: Boolean = false)
ignoreNulls: Boolean = false,
sortOrder: Option[SortOrder] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe orderingWithinGroup?

@@ -975,6 +975,7 @@ primaryExpression
| LEFT_PAREN query RIGHT_PAREN #subqueryExpression
| functionName LEFT_PAREN (setQuantifier? argument+=functionArgument
(COMMA argument+=functionArgument)*)? RIGHT_PAREN
(WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check other databases? Is the sort ordering always one item?

Copy link
Contributor Author

@beliefer beliefer Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the inverse distribution functions in Postgres and Oracle
percentile_cont(0.25) WITHIN GROUP (ORDER BY salary, bonus)

Postgres throws

SQL 错误 [42883]: ERROR: function percentile_cont(numeric, numeric, double precision) does not exist
  建议:No function matches the given name and argument types. You might need to add explicit type casts.
  位置:52

Oracle throws
SQL 错误 [909] [42000]: ORA-00909: 参数个数无效

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.postgresql.org/docs/9.4/functions-aggregate.html

So the WITHIN GROUP can have multiple order by items. We should allow it in the parser and can fail for certain functions that only allow one sort order.
image

* A factory used to create the inverse distribution function during the analysis phase.
* This factory is also an aggregate function that cannot be evaluated.
*/
trait InverseDistributionFactory extends AggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move it to a new file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of having a factory, shall we create a fake PercentileCont first and then update its left expression later in validateFunction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. we can put a null literal as a placeholder for PercentileCont.left, and replace it later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a base trait

trait SupportsOrderingWithinGroup { self: AggregateFunction =>
  def withOrderingWithinGroup(orderingWithGroup: Seq[SortOrder]): Expression
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if null literal is too hack, we can create a placeholder expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the factory model or using SupportsOrderingWithinGroup is just a difference in specific implementation methods, and I have no preference for the two.

@beliefer
Copy link
Contributor Author

@cloud-fan Please take a look? Thanks.

},
"PERCENTILE_PERCENTAGE_MISSING" : {
"message" : [
"Missing percentage."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a thing to inverse distribution function. It should be specific to certain functions and shouldn't be in this error class.

@@ -2367,6 +2367,27 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
case agg: AggregateFunction =>
// Note: PythonUDAF does not support these advanced clauses.
if (agg.isInstanceOf[PythonUDAF]) checkUnsupportedAggregateClause(agg, u)
val newAgg = agg match {
case idf: SupportsOrderingWithinGroup if u.orderingWithinGroup.isDefined =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not future-proof. AggregateWindowFunction extends AggregateFunction, but we are not performing the DISTINCT check for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

case _ =>
if (u.orderingWithinGroup.isDefined) {
throw QueryCompilationErrors.functionWithUnsupportedSyntaxError(
func.prettyName, "WITHIN GROUP (ORDER BY clause)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func.prettyName, "WITHIN GROUP (ORDER BY clause)")
func.prettyName, "WITHIN GROUP (ORDER BY ...)")

copy(arguments = newChildren.dropRight(1), filter = Some(newChildren.last))
if (orderingWithinGroup.isDefined) {
val newSortOrder = Some(newChildren.last.asInstanceOf[SortOrder])
val args = newChildren.dropRight(1)
Copy link
Contributor

@cloud-fan cloud-fan Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dropRight creates a new collection, let's be more efficient here

val newSortOrder = Some(newChildren.last.asInstanceOf[SortOrder])
val newFilter = Some(newChildren(newChildren.length - 2).asInstanceOf[SortOrder])
... arguments = newChildren.dropRight(2)

* The trait used to set the [[SortOrder]] after inverse distribution functions parsed.
*/
trait SupportsOrderingWithinGroup { self: AggregateFunction =>
def isFake: Boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this flag? I think we can always call withOrderingWithinGroup, as it only happens once we turn UnresolvedFunction into resolved ones.

if (numArgs == 1) {
PercentileCont(UnresolvedWithinGroup, expressions(0))
} else if (numArgs == 0) {
throw QueryCompilationErrors.inverseDistributionFunctionMissingPercentageError(funcName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to throw a different error here? It's just QueryCompilationErrors.wrongNumArgsError

-- !query analysis
org.apache.spark.sql.AnalysisException
{
"errorClass" : "_LEGACY_ERROR_TEMP_1023",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are here, shall we create error classes for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make a separate PR for this, because this mistake is very common and can be easily corrected. It's not easy to review, and I'm also not good at resolving conflicts.

override def withOrderingWithinGroup(orderingWithinGroup: Seq[SortOrder]): AggregateFunction = {
if (orderingWithinGroup.length != 1) {
throw QueryCompilationErrors.wrongNumArgsError(
nodeName, Seq(2), orderingWithinGroup.length + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a new sub-error-class under INVALID_INVERSE_DISTRIBUTION_FUNCTION? like WRONG_NUM_ORDERINGS.

@@ -417,6 +429,17 @@ case class PercentileDisc(
s"$prettyName($distinct${right.sql}) WITHIN GROUP (ORDER BY ${left.sql}$direction)"
}

override def withOrderingWithinGroup(orderingWithinGroup: Seq[SortOrder]): AggregateFunction = {
if (orderingWithinGroup.length != 1) {
throw QueryCompilationErrors.wrongNumArgsError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

},
"WRONG_NUM_ORDERINGS" : {
"message" : [
"WITHIN GROUP requires <expectedNum> orderings but the actual number is <actualNum>."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we mention the function name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent error class contains it.
"Invalid inverse distribution function <funcName>."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, let's make it a bit clearer then.

Requires <expectedNum> orderings in WITHIN GROUP but got <actualNum>

@beliefer
Copy link
Contributor Author

beliefer commented Dec 4, 2023

The GA failure is unrelated.

@beliefer beliefer closed this in f1283c1 Dec 5, 2023
@beliefer
Copy link
Contributor Author

beliefer commented Dec 5, 2023

The GA failure is unrelated.
Merged to master
@cloud-fan Thank you very much!

dbatomic pushed a commit to dbatomic/spark that referenced this pull request Dec 11, 2023
…d PercentileDisc into functionCall

Spark SQL parser have a special rule to parse `[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v)`.
We should merge this rule into the `functionCall`.

Merge the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`.

'No'.

New test cases.

'No'.

Closes apache#43910 from beliefer/SPARK-46009.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Jiaan Geng <beliefer@163.com>
beliefer added a commit that referenced this pull request Dec 17, 2023
…P_1023

### What changes were proposed in this pull request?
Based on the suggestion at #43910 (comment), this PR want assign a name to the error class `_LEGACY_ERROR_TEMP_1023`.

### Why are the changes needed?
Assign a name to the error class `_LEGACY_ERROR_TEMP_1023`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44355 from beliefer/SPARK-46406.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Jiaan Geng <beliefer@163.com>
@@ -1309,6 +1309,7 @@ The following SQLSTATEs are collated from:
|HZ320 |HZ |RDA-specific condition |320 |version not supported |RDA/SQL |Y |RDA/SQL |
|HZ321 |HZ |RDA-specific condition |321 |TCP/IP error |RDA/SQL |Y |RDA/SQL |
|HZ322 |HZ |RDA-specific condition |322 |TLS alert |RDA/SQL |Y |RDA/SQL |
|ID001 |IM |Invalid inverse distribution function |001 |Invalid inverse distribution function |SQL/Foundation |N |SQL/Foundation PostgreSQL Oracle Snowflake Redshift H2 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find it in https://www.postgresql.org/docs/current/errcodes-appendix.html , did you make it up yourself? @beliefer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You means reference the error code from postgresql?
If so, I will pick it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we should be honest here. If it's not from PostgreSQL or other system, and we do want a new category here, I'd suggest 42K0K

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I selected ID001 due to ID is the abbreviation of Invalid inverse distribution function.
42K0K seems good too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you invented it. Let's not do it again. SQLSTATE is a standard and the prefix has meanings. Can you open a followup PR to change it to 42K0K? thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

MaxGekk pushed a commit that referenced this pull request Jan 21, 2024
…inverse distribution function

### What changes were proposed in this pull request?
This PR follows up #43910 and propose to change the error code for invalid inverse distribution function.

### Why are the changes needed?
Based on the discussion at #43910 (comment)

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA tests.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44811 from beliefer/SPARK-46009_followup.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
dongjoon-hyun pushed a commit that referenced this pull request May 1, 2024
…TILE_DISC in g4

### What changes were proposed in this pull request?
This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4

### Why are the changes needed?
#43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #46272 from beliefer/SPARK-46009_followup2.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
…TILE_DISC in g4

### What changes were proposed in this pull request?
This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4

### Why are the changes needed?
apache#43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
GA.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes apache#46272 from beliefer/SPARK-46009_followup2.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants