Skip to content

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 18, 2020

What changes were proposed in this pull request?

In the PR, I propose to add the returnLong parameter to IntegralDivide, and pass the value of spark.sql.legacy.integralDivide.returnBigint if returnLong is not provided on creation of IntegralDivide.

Why are the changes needed?

This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ArithmeticExpressionSuite.

@SparkQA
Copy link

SparkQA commented Feb 19, 2020

Test build #118654 has finished for PR 27628 at commit ebeec38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class IntegralDivide(

case class IntegralDivide(
left: Expression,
right: Expression,
returnLong: Boolean) extends DivModLike {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just add a private val returnLong = SQLConf.get.integralDivideReturnLong in the class body? Then the config value is fixed when the expression is created. And it can be serialized to executors.

The spark Expression constructor is kind of exposed to end users when they call functions in SQL. BTW Cast already use a val to store config values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can potentially change value every time you transform the tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or how about we create 2 expressions IntegralDivide and IntegralDivideReturnLong? I'm just worried about we allow end users to specify the returnLong parameter which becomes an API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just worried about we allow end users to specify the returnLong parameter which becomes an API.

We don't allow that:

SELECT div(3, 2, false);

fails with:

Invalid number of arguments for function div. Expected: 2; Found: 3; line 1 pos 7
org.apache.spark.sql.AnalysisException: Invalid number of arguments for function div. Expected: 2; Found: 3; line 1 pos 7
	at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602)
	at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1418)

I ran the command on the PR changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the non-expression parameter doesn't count? Then I'm fine with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we count only expressions, see

val params = Seq.fill(expressions.size)(classOf[Expression])
val f = constructors.find(_.getParameterTypes.toSeq == params).getOrElse {

def apply(left: Expression, right: Expression): IntegralDivide = {
new IntegralDivide(left, right)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we define unapply as well? Most of the time we don't care about the returnLong paramter. When we do, we should write case e @ IntegralDivide(left, right) if e.returnLong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one place where we unapply IntegralDivide. It is here https://github.com/apache/spark/pull/27628/files#diff-8e1575bb706d6f7e8b5ea0b175eaeafcR178 . And returnLong is not checked, it is just extracted and copy-pasted.

Copy link
Member Author

@MaxGekk MaxGekk Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I define unapply, it will be not used. Not sure that is is needed.

@cloud-fan
Copy link
Contributor

@MaxGekk not related to this PR, can you check other places that call SQLConf.get inside expressions? We should try to make them consistent. One exception is Cast, we decided to create AnsiCast expression.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 19, 2020

There are at least a few more places:

  1. val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
  2. val nameOfCorruptRecord = SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
  3. private val followThreeValuedLogic =
    SQLConf.get.getConf(SQLConf.LEGACY_ARRAY_EXISTS_FOLLOWS_THREE_VALUED_LOGIC)
  4. if (SQLConf.get.getConf(SQLConf.LEGACY_CREATE_EMPTY_COLLECTION_USING_STRING_TYPE)) {
  5. if (SQLConf.get.getConf(SQLConf.LEGACY_CREATE_EMPTY_COLLECTION_USING_STRING_TYPE)) {

@hvanhovell
Copy link
Contributor

@MaxGekk do all of them influence the dataType or the nullability?

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 19, 2020

@hvanhovell Some of them, others (except of Json/CSV exprs) influence defaultElementType

@cloud-fan
Copy link
Contributor

@MaxGekk thanks for composing the list! I've created an umbrella JIRA for them: https://issues.apache.org/jira/browse/SPARK-30893

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 4248b7f Feb 20, 2020
cloud-fan pushed a commit that referenced this pull request Feb 20, 2020
…L config changes

### What changes were proposed in this pull request?
In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

OptionsAttachments

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `ArithmeticExpressionSuite`.

Closes #27628 from MaxGekk/integral-divide-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 4248b7f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…L config changes

### What changes were proposed in this pull request?
In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

OptionsAttachments

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `ArithmeticExpressionSuite`.

Closes apache#27628 from MaxGekk/integral-divide-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the integral-divide-conf branch June 5, 2020 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants