-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46009][SQL][CONNECT] Merge the parse rule of PercentileCont and PercentileDisc into functionCall #43910
Conversation
42f760b
to
4dca856
Compare
@@ -1846,6 +1846,34 @@ | |||
}, | |||
"sqlState" : "42000" | |||
}, | |||
"INVALID_INVERSE_DISTRIBUTION_FUNCTION" : { | |||
"message" : [ | |||
"Invalid inverse distribution function <funcName>." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is inverse distribution
a common terminology?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referenced H2, SQLServer and Postgres, they call these function inverse distribution function
.
Addition, some databases call these functions as ordered set function
.
case u @ UnresolvedFunction(nameParts, arguments, _, _, _, sortOrder) => withPosition(u) { | ||
val args = sortOrder match { | ||
case Some(s) if nameParts.length == 1 && | ||
(nameParts.head == "percentile_cont" || nameParts.head == "percentile_disc") => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we lookup the function first then check the expression? it's hacky to check the name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These code just is a temp implementation. It's used let you give a first review.
object PercentileContBuilder extends ExpressionBuilder { | ||
override def build(funcName: String, expressions: Seq[Expression]): Expression = { | ||
val numArgs = expressions.length | ||
if (numArgs == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we implement WITHIN GROUP within the expression itself, instead of a general approach. I think this will not be changed in the near future. However, from the API perspective, WITHIN GROUP does like a general feature for aggregate functions.
We should make the framework more flexible. My idea is
- function lookup should only look at the function name and inputs, not WITHIN GROUP (it's consistent with other features like DISTINCT, FILTER, etc.)
- after function lookup, we get an expression, and the expression should implement a trait to indicate that it supports
WITHIN GROUP
. Otherwise we fail.
For example, we can have a
trait SupportsWithinGroup {
def withOrderings(orderings: Seq[SortOrder])
}
The two percentile expressions should implement this trait so that they can get the orderings in a post-hoc way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the two percentile expressions have a lot of complexity. I recommend the factory mode for inverse distribution function.
trait InverseDistributionFactory extends AggregateFunction {
def createInverseDistributionFunction(sortOrder: SortOrder): AggregateFunction
}
65273f3
to
9622179
Compare
…d PercentileDisc into functionCall
9622179
to
800b100
Compare
throw QueryCompilationErrors.inverseDistributionFunctionMissingWithinGroupError( | ||
factory.prettyName) | ||
case other if u.sortOrder.isDefined => | ||
throw QueryCompilationErrors.unsupportedInverseDistributionFunctionError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only covers unsupported agg func. We should cover all cases in validateFunction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add an extra case match at the beginning
case _ if !func.isInstanceOf[InverseDistributionFactory] && u.sortOrder.isDefined => fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think the existing functionWithUnsupportedSyntaxError
is fine here. We don't need to create a new error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only covers unsupported agg func. We should cover all cases in
validateFunction
Thank you for the reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think the existing
functionWithUnsupportedSyntaxError
is fine here. We don't need to create a new error.
I tried to find out the suitable error, but it too many. Thank you for the reminder.
@@ -298,11 +298,12 @@ case class UnresolvedFunction( | |||
arguments: Seq[Expression], | |||
isDistinct: Boolean, | |||
filter: Option[Expression] = None, | |||
ignoreNulls: Boolean = false) | |||
ignoreNulls: Boolean = false, | |||
sortOrder: Option[SortOrder] = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe orderingWithinGroup
?
@@ -975,6 +975,7 @@ primaryExpression | |||
| LEFT_PAREN query RIGHT_PAREN #subqueryExpression | |||
| functionName LEFT_PAREN (setQuantifier? argument+=functionArgument | |||
(COMMA argument+=functionArgument)*)? RIGHT_PAREN | |||
(WITHIN GROUP LEFT_PAREN ORDER BY sortItem RIGHT_PAREN)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check other databases? Is the sort ordering always one item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the inverse distribution functions in Postgres and Oracle
percentile_cont(0.25) WITHIN GROUP (ORDER BY salary, bonus)
Postgres throws
SQL 错误 [42883]: ERROR: function percentile_cont(numeric, numeric, double precision) does not exist
建议:No function matches the given name and argument types. You might need to add explicit type casts.
位置:52
Oracle throws
SQL 错误 [909] [42000]: ORA-00909: 参数个数无效
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.postgresql.org/docs/9.4/functions-aggregate.html
So the WITHIN GROUP can have multiple order by items. We should allow it in the parser and can fail for certain functions that only allow one sort order.
* A factory used to create the inverse distribution function during the analysis phase. | ||
* This factory is also an aggregate function that cannot be evaluated. | ||
*/ | ||
trait InverseDistributionFactory extends AggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move it to a new file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of having a factory, shall we create a fake PercentileCont
first and then update its left
expression later in validateFunction
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. we can put a null literal as a placeholder for PercentileCont.left
, and replace it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need a base trait
trait SupportsOrderingWithinGroup { self: AggregateFunction =>
def withOrderingWithinGroup(orderingWithGroup: Seq[SortOrder]): Expression
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if null literal is too hack, we can create a placeholder expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the factory model or using SupportsOrderingWithinGroup
is just a difference in specific implementation methods, and I have no preference for the two.
4be5762
to
fa06acc
Compare
@cloud-fan Please take a look? Thanks. |
}, | ||
"PERCENTILE_PERCENTAGE_MISSING" : { | ||
"message" : [ | ||
"Missing percentage." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a thing to inverse distribution function. It should be specific to certain functions and shouldn't be in this error class.
@@ -2367,6 +2367,27 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor | |||
case agg: AggregateFunction => | |||
// Note: PythonUDAF does not support these advanced clauses. | |||
if (agg.isInstanceOf[PythonUDAF]) checkUnsupportedAggregateClause(agg, u) | |||
val newAgg = agg match { | |||
case idf: SupportsOrderingWithinGroup if u.orderingWithinGroup.isDefined => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not future-proof. AggregateWindowFunction
extends AggregateFunction
, but we are not performing the DISTINCT check for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
case _ => | ||
if (u.orderingWithinGroup.isDefined) { | ||
throw QueryCompilationErrors.functionWithUnsupportedSyntaxError( | ||
func.prettyName, "WITHIN GROUP (ORDER BY clause)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func.prettyName, "WITHIN GROUP (ORDER BY clause)") | |
func.prettyName, "WITHIN GROUP (ORDER BY ...)") |
copy(arguments = newChildren.dropRight(1), filter = Some(newChildren.last)) | ||
if (orderingWithinGroup.isDefined) { | ||
val newSortOrder = Some(newChildren.last.asInstanceOf[SortOrder]) | ||
val args = newChildren.dropRight(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dropRight
creates a new collection, let's be more efficient here
val newSortOrder = Some(newChildren.last.asInstanceOf[SortOrder])
val newFilter = Some(newChildren(newChildren.length - 2).asInstanceOf[SortOrder])
... arguments = newChildren.dropRight(2)
* The trait used to set the [[SortOrder]] after inverse distribution functions parsed. | ||
*/ | ||
trait SupportsOrderingWithinGroup { self: AggregateFunction => | ||
def isFake: Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this flag? I think we can always call withOrderingWithinGroup
, as it only happens once we turn UnresolvedFunction
into resolved ones.
if (numArgs == 1) { | ||
PercentileCont(UnresolvedWithinGroup, expressions(0)) | ||
} else if (numArgs == 0) { | ||
throw QueryCompilationErrors.inverseDistributionFunctionMissingPercentageError(funcName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to throw a different error here? It's just QueryCompilationErrors.wrongNumArgsError
-- !query analysis | ||
org.apache.spark.sql.AnalysisException | ||
{ | ||
"errorClass" : "_LEGACY_ERROR_TEMP_1023", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are here, shall we create error classes for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make a separate PR for this, because this mistake is very common and can be easily corrected. It's not easy to review, and I'm also not good at resolving conflicts.
override def withOrderingWithinGroup(orderingWithinGroup: Seq[SortOrder]): AggregateFunction = { | ||
if (orderingWithinGroup.length != 1) { | ||
throw QueryCompilationErrors.wrongNumArgsError( | ||
nodeName, Seq(2), orderingWithinGroup.length + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create a new sub-error-class under INVALID_INVERSE_DISTRIBUTION_FUNCTION
? like WRONG_NUM_ORDERINGS
.
@@ -417,6 +429,17 @@ case class PercentileDisc( | |||
s"$prettyName($distinct${right.sql}) WITHIN GROUP (ORDER BY ${left.sql}$direction)" | |||
} | |||
|
|||
override def withOrderingWithinGroup(orderingWithinGroup: Seq[SortOrder]): AggregateFunction = { | |||
if (orderingWithinGroup.length != 1) { | |||
throw QueryCompilationErrors.wrongNumArgsError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
}, | ||
"WRONG_NUM_ORDERINGS" : { | ||
"message" : [ | ||
"WITHIN GROUP requires <expectedNum> orderings but the actual number is <actualNum>." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we mention the function name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parent error class contains it.
"Invalid inverse distribution function <funcName>."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, let's make it a bit clearer then.
Requires <expectedNum> orderings in WITHIN GROUP but got <actualNum>
The GA failure is unrelated. |
The GA failure is unrelated. |
…d PercentileDisc into functionCall Spark SQL parser have a special rule to parse `[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v)`. We should merge this rule into the `functionCall`. Merge the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`. 'No'. New test cases. 'No'. Closes apache#43910 from beliefer/SPARK-46009. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com>
…P_1023 ### What changes were proposed in this pull request? Based on the suggestion at #43910 (comment), this PR want assign a name to the error class `_LEGACY_ERROR_TEMP_1023`. ### Why are the changes needed? Assign a name to the error class `_LEGACY_ERROR_TEMP_1023`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44355 from beliefer/SPARK-46406. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Jiaan Geng <beliefer@163.com>
@@ -1309,6 +1309,7 @@ The following SQLSTATEs are collated from: | |||
|HZ320 |HZ |RDA-specific condition |320 |version not supported |RDA/SQL |Y |RDA/SQL | | |||
|HZ321 |HZ |RDA-specific condition |321 |TCP/IP error |RDA/SQL |Y |RDA/SQL | | |||
|HZ322 |HZ |RDA-specific condition |322 |TLS alert |RDA/SQL |Y |RDA/SQL | | |||
|ID001 |IM |Invalid inverse distribution function |001 |Invalid inverse distribution function |SQL/Foundation |N |SQL/Foundation PostgreSQL Oracle Snowflake Redshift H2 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find it in https://www.postgresql.org/docs/current/errcodes-appendix.html , did you make it up yourself? @beliefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You means reference the error code from postgresql?
If so, I will pick it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should be honest here. If it's not from PostgreSQL or other system, and we do want a new category here, I'd suggest 42K0K
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I selected ID001
due to ID is the abbreviation of Invalid inverse distribution function
.
42K0K
seems good too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so you invented it. Let's not do it again. SQLSTATE is a standard and the prefix has meanings. Can you open a followup PR to change it to 42K0K? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
…inverse distribution function ### What changes were proposed in this pull request? This PR follows up #43910 and propose to change the error code for invalid inverse distribution function. ### Why are the changes needed? Based on the discussion at #43910 (comment) ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA tests. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44811 from beliefer/SPARK-46009_followup. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…TILE_DISC in g4 ### What changes were proposed in this pull request? This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4 ### Why are the changes needed? #43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46272 from beliefer/SPARK-46009_followup2. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…TILE_DISC in g4 ### What changes were proposed in this pull request? This PR propose to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4 ### Why are the changes needed? apache#43910 merged the parse rule of `PercentileCont` and `PercentileDisc` into `functionCall`, but forgot to remove unused `PERCENTILE_CONT` and `PERCENTILE_DISC` in g4. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes apache#46272 from beliefer/SPARK-46009_followup2. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Spark SQL parser have a special rule to parse
[percentile_cont|percentile_disc](percentage) WITHIN GROUP (ORDER BY v)
.We should merge this rule into the
functionCall
.Why are the changes needed?
Merge the parse rule of
PercentileCont
andPercentileDisc
intofunctionCall
.Does this PR introduce any user-facing change?
'No'.
How was this patch tested?
New test cases.
Was this patch authored or co-authored using generative AI tooling?
'No'.