[SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes #27628

MaxGekk · 2020-02-18T21:40:10Z

What changes were proposed in this pull request?

In the PR, I propose to add the returnLong parameter to IntegralDivide, and pass the value of spark.sql.legacy.integralDivide.returnBigint if returnLong is not provided on creation of IntegralDivide.

Why are the changes needed?

This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ArithmeticExpressionSuite.

SparkQA · 2020-02-19T01:53:55Z

Test build #118654 has finished for PR 27628 at commit ebeec38.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class IntegralDivide(

cloud-fan · 2020-02-19T05:54:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+case class IntegralDivide(
+    left: Expression,
+    right: Expression,
+    returnLong: Boolean) extends DivModLike {


can we just add a private val returnLong = SQLConf.get.integralDivideReturnLong in the class body? Then the config value is fixed when the expression is created. And it can be serialized to executors.

The spark Expression constructor is kind of exposed to end users when they call functions in SQL. BTW Cast already use a val to store config values.

That can potentially change value every time you transform the tree.

or how about we create 2 expressions IntegralDivide and IntegralDivideReturnLong? I'm just worried about we allow end users to specify the returnLong parameter which becomes an API.

I'm just worried about we allow end users to specify the returnLong parameter which becomes an API.

We don't allow that:

SELECT div(3, 2, false);

fails with:

Invalid number of arguments for function div. Expected: 2; Found: 3; line 1 pos 7 org.apache.spark.sql.AnalysisException: Invalid number of arguments for function div. Expected: 2; Found: 3; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1418)

I ran the command on the PR changes.

so the non-expression parameter doesn't count? Then I'm fine with it.

Yes, we count only expressions, see

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

Lines 601 to 602 in 919d551

val params = Seq.fill(expressions.size)(classOf[Expression])

val f = constructors.find(_.getParameterTypes.toSeq == params).getOrElse {

cloud-fan · 2020-02-19T13:05:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+  def apply(left: Expression, right: Expression): IntegralDivide = {
+    new IntegralDivide(left, right)
+  }
+}


shall we define unapply as well? Most of the time we don't care about the returnLong paramter. When we do, we should write case e @ IntegralDivide(left, right) if e.returnLong.

There is only one place where we unapply IntegralDivide. It is here https://github.com/apache/spark/pull/27628/files#diff-8e1575bb706d6f7e8b5ea0b175eaeafcR178 . And returnLong is not checked, it is just extracted and copy-pasted.

If I define unapply, it will be not used. Not sure that is is needed.

cloud-fan · 2020-02-19T13:07:28Z

@MaxGekk not related to this PR, can you check other places that call SQLConf.get inside expressions? We should try to make them consistent. One exception is Cast, we decided to create AnsiCast expression.

MaxGekk · 2020-02-19T20:38:31Z

There are at least a few more places:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Line 95 in ab186e3

val legacySizeOfNull = SQLConf.get.legacySizeOfNull
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala

Line 96 in 4e50f02

val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

Line 562 in 8e280ce

val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

Lines 528 to 529 in b0db623

    
           private val followThreeValuedLogic = 
        
             SQLConf.get.getConf(SQLConf.LEGACY_ARRAY_EXISTS_FOLLOWS_THREE_VALUED_LOGIC)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/decimalExpressions.scala

Line 50 in b917a65

private val nullOnOverflow = !SQLConf.get.ansiEnabled
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

Line 49 in 926e3a1

if (SQLConf.get.getConf(SQLConf.LEGACY_CREATE_EMPTY_COLLECTION_USING_STRING_TYPE)) {
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

Line 149 in 926e3a1

if (SQLConf.get.getConf(SQLConf.LEGACY_CREATE_EMPTY_COLLECTION_USING_STRING_TYPE)) {

hvanhovell · 2020-02-19T21:17:26Z

@MaxGekk do all of them influence the dataType or the nullability?

MaxGekk · 2020-02-19T22:02:11Z

@hvanhovell Some of them, others (except of Json/CSV exprs) influence defaultElementType

cloud-fan · 2020-02-20T13:01:12Z

@MaxGekk thanks for composing the list! I've created an umbrella JIRA for them: https://issues.apache.org/jira/browse/SPARK-30893

cloud-fan · 2020-02-20T13:26:50Z

thanks, merging to master/3.0!

…L config changes ### What changes were proposed in this pull request? In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. OptionsAttachments ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `ArithmeticExpressionSuite`. Closes #27628 from MaxGekk/integral-divide-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4248b7f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…L config changes ### What changes were proposed in this pull request? In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`. ### Why are the changes needed? This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption. OptionsAttachments ### Does this PR introduce any user-facing change? No ### How was this patch tested? By `ArithmeticExpressionSuite`. Closes apache#27628 from MaxGekk/integral-divide-conf. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Add returnLong param

ebeec38

cloud-fan reviewed Feb 19, 2020

View reviewed changes

cloud-fan closed this in 4248b7f Feb 20, 2020

MaxGekk mentioned this pull request Feb 20, 2020

[SPARK-30894][SQL] Make Size's nullable independent from SQL config changes #27658

Closed

MaxGekk deleted the integral-divide-conf branch June 5, 2020 19:45

MaxGekk mentioned this pull request Aug 21, 2023

[SPARK-44871][SQL] Fix percentile_disc behaviour #42559

Closed

	val params = Seq.fill(expressions.size)(classOf[Expression])
	val f = constructors.find(_.getParameterTypes.toSeq == params).getOrElse {

[SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes #27628

[SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes #27628

Uh oh!

Conversation

MaxGekk commented Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Feb 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 19, 2020

Uh oh!

MaxGekk commented Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Feb 19, 2020

Uh oh!

MaxGekk commented Feb 19, 2020

Uh oh!

cloud-fan commented Feb 20, 2020

Uh oh!

cloud-fan commented Feb 20, 2020

Uh oh!

Uh oh!

MaxGekk commented Feb 18, 2020 •

edited

Loading

MaxGekk Feb 19, 2020 •

edited

Loading

MaxGekk commented Feb 19, 2020 •

edited

Loading