[SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold #28328

MaxGekk · 2020-04-24T12:09:55Z

What changes were proposed in this pull request?

The InSet expression expects input collections of internal Catalyst types, for example hset must contain elements of UTF8String for child of string type. So, it means isInCollection must convert users values to internal Catalyst values but currently it doesn't perform the conversion. That leads to incorrect results for collection sizes above the threshold spark.sql.optimizer.inSetConversionThreshold.

The bug was introduced by #25754.

Why are the changes needed?

The changes fix incorrect behaviour of isInCollection. For example, if the SQL config spark.sql.optimizer.inSetConversionThresholdis set to 10 (by default):

val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
data.select($"x".isInCollection(set).as("isInCollection")).show()

The function must return 'true' because "1" is in the set of "0" ... "20" but it returns "false":

+--------------+
|isInCollection|
+--------------+
|         false|
+--------------+

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By existing test suite ColumnExpressionSuite
Add new test to ColumnExpressionSuite
Manually benchmark it by the code from [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection #25754:

def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis() 
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}

Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size	explain time (before)	explain time (after)	w/o optimization
5	64ms	65ms	62ms
10	68ms	64ms	88ms
100	41ms	162ms	227ms
1000	98ms	406ms	652ms
10000	654ms	2579ms	4504ms

MaxGekk · 2020-04-24T12:10:25Z

@cloud-fan @HyukjinKwon Please, review the PR.

cloud-fan · 2020-04-24T12:37:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

@@ -519,7 +519,9 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with

  override def sql: String = {
    val valueSQL = child.sql
-    val listSQL = hset.toSeq.map(Literal(_).sql).mkString(", ")


what's wrong with Literal.sql?

It doesn't accept UTF8String at

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala

Line 292 in ff39c92

Literal.validateLiteralValue(value, dataType)

This seems to be an orthogonal fix. @cloud-fan and @MaxGekk . We need this in branch-2.4 because this is SPARK-12593 (Converts resolved logical plan back to SQL) since Apache Spark 2.0.0, don't we?

If you don't mind, please file a separate JIRA issue with a separate test case. We need to merge this seperately.
cc @holdenk

Please let me know the JIRA when you file it and I'll add it to my tracking for the 2.4.6 release.

If you don't mind, please file a separate JIRA issue with a separate test case. We need to merge this seperately.

So far, I have not found how to trigger the issue in the sql method without this fix. I will think of that and try tomorrow but if you have some ideas you are welcome.

What I have already tried is to build a dataset with IsIn, optimizer converts it to InSet but I wasn't able to call sql() on the replaced expression.

Here is JIRA ticket https://issues.apache.org/jira/browse/SPARK-31563

Here is the PR #28343

cloud-fan

good catch!

SparkQA · 2020-04-24T12:57:15Z

Test build #121756 has finished for PR 28328 at commit 47a1e44.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-04-24T13:17:04Z

The build #28328 (comment) fails on
org.apache.spark.sql.hive.thriftserver.CliSuite.SPARK-11188 Analysis error reporting
I don't think it is related to the changes

MaxGekk · 2020-04-24T13:17:19Z

jenkins, retest this, please

dongjoon-hyun · 2020-04-24T15:32:24Z

sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

@@ -869,4 +869,15 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession {
      df.select(typedLit(("a", 2, 1.0))),
      Row(Row("a", 2, 1.0)) :: Nil)
  }
+
+  test("SPARK-31553: isInCollection for collection sizes above a threshold") {


Thank you, @MaxGekk .

cc @aokolnychyi and @dbtsai

dongjoon-hyun · 2020-04-24T15:45:03Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

    } else {
-      In(expr, values.toSeq.map(lit(_).expr))
+      In(expr, exprValues)


So, this is caused by SPARK-29048 (Improve performance on Column.isInCollection() with a large size collection, #25754 ) and only affects 3.0.0, right?

Thanks for confirming, @MaxGekk .
cc @WeichenXu123 and @gatorsmile

dongjoon-hyun · 2020-04-24T15:55:40Z

This is a nice catch, @MaxGekk . As I wrote in the comment, it would be great if we can proceed two PR separately. Thanks!

SparkQA · 2020-04-24T19:26:21Z

Test build #121762 has finished for PR 28328 at commit 47a1e44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-25T17:29:41Z

I merged #28343 first. Could you rebase this PR, @MaxGekk ? Thanks.

…lection # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala # sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

MaxGekk · 2020-04-25T18:20:23Z

@dongjoon-hyun I have rebased but the test from this PR starts failing. The function convertToScala(elem, child.dataType) doesn't convert UTF8String to String because child.dataType is NullType when InSet is created from isInCollection.

The test passed here because I don't wrap the result of convertToScala by Literal.

Probably, we need to revert 7d8216a . Sorry, my bad.

MaxGekk · 2020-04-25T18:49:10Z

@dongjoon-hyun @cloud-fan Regarding to this PR, WDYT of reverting the optimization #25754 instead of to fix it by this PR?

dongjoon-hyun · 2020-04-25T20:28:27Z

It's possible. I guess that we need @gatorsmile 's opinion since he was the merger of that.

MaxGekk · 2020-04-25T20:53:03Z

I have fixed the test failure after rebasing on #28343 by passing element type from the place where the type is known 67f34a1 . I could open a follow-up PR for #28343 @dongjoon-hyun Let me know if you are ok with that.

SparkQA · 2020-04-25T21:03:13Z

Test build #121815 has finished for PR 28328 at commit 67f34a1.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class InSet(

SparkQA · 2020-04-26T02:17:35Z

Test build #121818 has finished for PR 28328 at commit dd69aa6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-26T02:48:22Z

retest this please

SparkQA · 2020-04-26T07:05:02Z

Test build #121826 has finished for PR 28328 at commit dd69aa6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-04-26T07:12:31Z

jenkins, retest this, please

MaxGekk · 2020-04-26T08:48:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+case class InSet(
+    child: Expression,
+    hset: Set[Any],
+    hsetElemType: DataType) extends UnaryExpression with Predicate {


Matching internal Catalyst's types to external types is ambiguous. For example,
Long -> Long
Long -> Timestamp

Also type of child can be unknown when InSet has to know Catalyst's type of hset elements.

hsetElemType is needed to eliminate the ambiguity

Do you think we can make this Option[DataType] because only a few things are ambiguous?

We can but if a caller passes None, InSet will be not able to infer elem types when child.dataType is NullType like in this case. dataType returns NullType if child is PrettyAttribute.

when hsetElemType can be different from child.dataType?

When InSet is created from isInCollection, in that case child.dataType is NullType. For example, it is NullType in the test https://github.com/apache/spark/pull/28328/files#diff-aa655ba249e00d2591b21cf6a360cf82R886 because child is PrettyAttribute when the sql method is called.

And InSet.sql() is called from Dataset.select _.named:

Project(untypedCols.map(_.named), logicalPlan)

The named method calls toPrettySQL(expr):

case expr: Expression => Alias(expr, toPrettySQL(expr))()

The toPrettySQL method calls sql:

def toPrettySQL(e: Expression): String = usePrettyExpression(e).sql

SparkQA · 2020-04-26T09:54:34Z

Test build #121840 has finished for PR 28328 at commit 05ce50a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-04-26T10:41:37Z

Failures of CliSuite annoy a lot #28328 (comment) @dongjoon-hyun @cloud-fan @gatorsmile This is the PR to repeat tests from CliSuite #28329

MaxGekk · 2020-04-26T10:41:50Z

jenkins, retest this, please

SparkQA · 2020-04-26T11:40:45Z

Test build #121835 has finished for PR 28328 at commit dd69aa6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-26T15:27:28Z

Test build #121844 has finished for PR 28328 at commit 05ce50a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-27T06:20:40Z

cc @viirya , this is another instance that merging InSet and Set can fix the issue.

cloud-fan · 2020-04-27T07:16:03Z

Actually this PR shows we still need InSet, to make the analyzer fast...

cloud-fan · 2020-04-27T07:20:00Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

+    val exprValues = values.toSeq.map(lit(_).expr)
+    if (exprValues.size > SQLConf.get.optimizerInSetConversionThreshold) {
+      val elemType = exprValues.headOption.map(_.dataType).getOrElse(NullType)
+      InSet(expr, exprValues.map(_.eval()).toSet, elemType)


How can we make sure the expr has the same data type as exprValues? Do we have a type coercion rule for it?

To make sure, we need something similar to In.checkInputDataTypes() in InSet:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

Lines 313 to 322 in 7d8216a

override def checkInputDataTypes(): TypeCheckResult = {

val mismatchOpt = list.find(l => !DataType.equalsStructurally(l.dataType, value.dataType,

ignoreNullability = true))

if (mismatchOpt.isDefined) {

TypeCheckResult.TypeCheckFailure(s"Arguments must be same type but were: " +

s"${value.dataType.catalogString} != ${mismatchOpt.get.dataType.catalogString}")

} else {

TypeUtils.checkForOrderingExpr(value.dataType, s"function $prettyName")

}

}

I could add such check in the PR if you don't mind.

I added similar check to InSet

viirya · 2020-04-27T07:38:49Z

Actually this PR shows we still need InSet, to make the analyzer fast...

What that means? We optimize In with InSet in optimizer, right?

cloud-fan · 2020-04-27T07:43:16Z

A In with many values is slow to analyze, as the type coercion rules or In.resolved are very slow.

MaxGekk · 2020-04-27T10:55:55Z

I have update PR's description and added a column w/o optimization. I got the numbers by running the code:

  test("isInCollection benchmark") {
    def testExplainTime(collectionSize: Int) = {
      val df = spark.range(10).withColumn("id2", col("id") + 1)
      val list = Range(0, collectionSize).toList
      val startTime = System.currentTimeMillis()
      df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
      val elapsedTime = System.currentTimeMillis() - startTime
      println(s"cost time: ${elapsedTime}ms")
    }
    withSQLConf(SQLConf.OPTIMIZER_INSET_CONVERSION_THRESHOLD.key -> "100000000") {
      testExplainTime(1)
      testExplainTime(5)
      testExplainTime(10)
      testExplainTime(100)
      testExplainTime(1000)
      testExplainTime(10000)
    }
  }

SparkQA · 2020-04-27T16:34:44Z

Test build #121899 has finished for PR 28328 at commit 4bc0e26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-04-27T16:43:02Z

A In with many values is slow to analyze, as the type coercion rules or In.resolved are very slow.

That's a pain point. But when we merge In and InSet, we can have a constructor similar to current InSet? We can have both list and hset in In. For example, in isInCollection, we still can construct In which has empty children but only hset?

HyukjinKwon · 2020-04-28T05:22:44Z

Okay, the more I look, the more it makes me to think we should revert #25754 rather than adding bandaid fixes. Shall we revert?

MaxGekk · 2020-04-28T06:03:08Z

After offline discussion with @gatorsmile @cloud-fan @HyukjinKwon, we decided to revert #25754 . I will open a PR for that and close this PR.

…n.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit 5631a96. Closes #28328 ### Why are the changes needed? The PR #25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ |isInCollection| +--------------+ | false| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7cabc8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 4 commits April 24, 2020 14:30

Add a test

7306a5c

Bug fix

08d107d

Fix the sql() method of InSet

7b200bf

Add JIRA id

47a1e44

probot-autolabeler bot added the SQL label Apr 24, 2020

cloud-fan reviewed Apr 24, 2020

View reviewed changes

cloud-fan approved these changes Apr 24, 2020

View reviewed changes

dongjoon-hyun reviewed Apr 24, 2020

View reviewed changes

Merge remote-tracking branch 'remotes/origin/master' into fix-isInCol…

7357427

…lection # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala # sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala

Fix NullType

67f34a1

Fix build

dd69aa6

maropu approved these changes Apr 25, 2020

View reviewed changes

Require elem type

05ce50a

MaxGekk commented Apr 26, 2020

View reviewed changes

HyukjinKwon approved these changes Apr 27, 2020

View reviewed changes

cloud-fan reviewed Apr 27, 2020

View reviewed changes

Check input types of InSet

4bc0e26

MaxGekk mentioned this pull request Apr 28, 2020

[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types #28343

Closed

MaxGekk mentioned this pull request Apr 28, 2020

[SPARK-31553][SQL] Revert "[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection" #28388

Closed

cloud-fan closed this in b7cabc8 Apr 28, 2020

MaxGekk deleted the fix-isInCollection branch June 5, 2020 19:48

	override def checkInputDataTypes(): TypeCheckResult = {
	val mismatchOpt = list.find(l => !DataType.equalsStructurally(l.dataType, value.dataType,
	ignoreNullability = true))
	if (mismatchOpt.isDefined) {
	TypeCheckResult.TypeCheckFailure(s"Arguments must be same type but were: " +
	s"${value.dataType.catalogString} != ${mismatchOpt.get.dataType.catalogString}")
	} else {
	TypeUtils.checkForOrderingExpr(value.dataType, s"function $prettyName")
	}
	}

[SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold #28328

[SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold #28328

Conversation

MaxGekk commented Apr 24, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Apr 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2020

MaxGekk commented Apr 24, 2020

MaxGekk commented Apr 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 24, 2020

SparkQA commented Apr 24, 2020

dongjoon-hyun commented Apr 25, 2020

MaxGekk commented Apr 25, 2020

MaxGekk commented Apr 25, 2020

dongjoon-hyun commented Apr 25, 2020

MaxGekk commented Apr 25, 2020

SparkQA commented Apr 25, 2020

SparkQA commented Apr 26, 2020

maropu commented Apr 26, 2020

SparkQA commented Apr 26, 2020

MaxGekk commented Apr 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 26, 2020

MaxGekk commented Apr 26, 2020

MaxGekk commented Apr 26, 2020

SparkQA commented Apr 26, 2020

SparkQA commented Apr 26, 2020

cloud-fan commented Apr 27, 2020

cloud-fan commented Apr 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Apr 27, 2020

cloud-fan commented Apr 27, 2020

MaxGekk commented Apr 27, 2020

SparkQA commented Apr 27, 2020

viirya commented Apr 27, 2020

HyukjinKwon commented Apr 28, 2020

MaxGekk commented Apr 28, 2020

MaxGekk commented Apr 24, 2020 •

edited