-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes #27628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #118654 has finished for PR 27628 at commit
|
case class IntegralDivide( | ||
left: Expression, | ||
right: Expression, | ||
returnLong: Boolean) extends DivModLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just add a private val returnLong = SQLConf.get.integralDivideReturnLong
in the class body? Then the config value is fixed when the expression is created. And it can be serialized to executors.
The spark Expression constructor is kind of exposed to end users when they call functions in SQL. BTW Cast
already use a val
to store config values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can potentially change value every time you transform the tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or how about we create 2 expressions IntegralDivide
and IntegralDivideReturnLong
? I'm just worried about we allow end users to specify the returnLong
parameter which becomes an API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just worried about we allow end users to specify the returnLong parameter which becomes an API.
We don't allow that:
SELECT div(3, 2, false);
fails with:
Invalid number of arguments for function div. Expected: 2; Found: 3; line 1 pos 7
org.apache.spark.sql.AnalysisException: Invalid number of arguments for function div. Expected: 2; Found: 3; line 1 pos 7
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602)
at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1418)
I ran the command on the PR changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the non-expression parameter doesn't count? Then I'm fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we count only expressions, see
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Lines 601 to 602 in 919d551
val params = Seq.fill(expressions.size)(classOf[Expression]) | |
val f = constructors.find(_.getParameterTypes.toSeq == params).getOrElse { |
def apply(left: Expression, right: Expression): IntegralDivide = { | ||
new IntegralDivide(left, right) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we define unapply
as well? Most of the time we don't care about the returnLong
paramter. When we do, we should write case e @ IntegralDivide(left, right) if e.returnLong
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one place where we unapply IntegralDivide
. It is here https://github.com/apache/spark/pull/27628/files#diff-8e1575bb706d6f7e8b5ea0b175eaeafcR178 . And returnLong
is not checked, it is just extracted and copy-pasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I define unapply
, it will be not used. Not sure that is is needed.
@MaxGekk not related to this PR, can you check other places that call |
There are at least a few more places:
|
@MaxGekk do all of them influence the dataType or the nullability? |
@hvanhovell Some of them, others (except of Json/CSV exprs) influence defaultElementType |
@MaxGekk thanks for composing the list! I've created an umbrella JIRA for them: https://issues.apache.org/jira/browse/SPARK-30893 |
thanks, merging to master/3.0! |
…L config changes ### What changes were proposed in this pull request? In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. OptionsAttachments ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `ArithmeticExpressionSuite`. Closes #27628 from MaxGekk/integral-divide-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4248b7f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…L config changes ### What changes were proposed in this pull request? In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. OptionsAttachments ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `ArithmeticExpressionSuite`. Closes apache#27628 from MaxGekk/integral-divide-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In the PR, I propose to add the
returnLong
parameter toIntegralDivide
, and pass the value ofspark.sql.legacy.integralDivide.returnBigint
ifreturnLong
is not provided on creation ofIntegralDivide
.Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By
ArithmeticExpressionSuite
.