New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold #28328
Conversation
@cloud-fan @HyukjinKwon Please, review the PR. |
@@ -519,7 +519,9 @@ case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with | |||
|
|||
override def sql: String = { | |||
val valueSQL = child.sql | |||
val listSQL = hset.toSeq.map(Literal(_).sql).mkString(", ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's wrong with Literal.sql
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't accept UTF8String at
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
Line 292 in ff39c92
Literal.validateLiteralValue(value, dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an orthogonal fix. @cloud-fan and @MaxGekk . We need this in branch-2.4
because this is SPARK-12593 (Converts resolved logical plan back to SQL) since Apache Spark 2.0.0, don't we?
If you don't mind, please file a separate JIRA issue with a separate test case. We need to merge this seperately.
cc @holdenk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know the JIRA when you file it and I'll add it to my tracking for the 2.4.6 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, please file a separate JIRA issue with a separate test case. We need to merge this seperately.
So far, I have not found how to trigger the issue in the sql
method without this fix. I will think of that and try tomorrow but if you have some ideas you are welcome.
What I have already tried is to build a dataset with IsIn, optimizer converts it to InSet but I wasn't able to call sql()
on the replaced expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is JIRA ticket https://issues.apache.org/jira/browse/SPARK-31563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR #28343
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
Test build #121756 has finished for PR 28328 at commit
|
The build #28328 (comment) fails on |
jenkins, retest this, please |
@@ -869,4 +869,15 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { | |||
df.select(typedLit(("a", 2, 1.0))), | |||
Row(Row("a", 2, 1.0)) :: Nil) | |||
} | |||
|
|||
test("SPARK-31553: isInCollection for collection sizes above a threshold") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @MaxGekk .
cc @aokolnychyi and @dbtsai
} else { | ||
In(expr, values.toSeq.map(lit(_).expr)) | ||
In(expr, exprValues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is caused by SPARK-29048 (Improve performance on Column.isInCollection() with a large size collection, #25754 ) and only affects 3.0.0, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming, @MaxGekk .
cc @WeichenXu123 and @gatorsmile
This is a nice catch, @MaxGekk . As I wrote in the comment, it would be great if we can proceed two PR separately. Thanks! |
Test build #121762 has finished for PR 28328 at commit
|
…lection # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala # sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
@dongjoon-hyun I have rebased but the test from this PR starts failing. The function The test passed here because I don't wrap the result of Probably, we need to revert 7d8216a . Sorry, my bad. |
@dongjoon-hyun @cloud-fan Regarding to this PR, WDYT of reverting the optimization #25754 instead of to fix it by this PR? |
It's possible. I guess that we need @gatorsmile 's opinion since he was the merger of that. |
I have fixed the test failure after rebasing on #28343 by passing element type from the place where the type is known 67f34a1 . I could open a follow-up PR for #28343 @dongjoon-hyun Let me know if you are ok with that. |
Test build #121815 has finished for PR 28328 at commit
|
Test build #121818 has finished for PR 28328 at commit
|
retest this please |
Test build #121826 has finished for PR 28328 at commit
|
jenkins, retest this, please |
case class InSet( | ||
child: Expression, | ||
hset: Set[Any], | ||
hsetElemType: DataType) extends UnaryExpression with Predicate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching internal Catalyst's types to external types is ambiguous. For example,
Long -> Long
Long -> Timestamp
Also type of child can be unknown when InSet has to know Catalyst's type of hset elements.
hsetElemType
is needed to eliminate the ambiguity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can make this Option[DataType]
because only a few things are ambiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but if a caller passes None, InSet
will be not able to infer elem types when child.dataType
is NullType
like in this case. dataType
returns NullType
if child
is PrettyAttribute
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when hsetElemType
can be different from child.dataType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When InSet is created from isInCollection, in that case child.dataType
is NullType
. For example, it is NullType in the test https://github.com/apache/spark/pull/28328/files#diff-aa655ba249e00d2591b21cf6a360cf82R886 because child is PrettyAttribute when the sql
method is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And InSet.sql()
is called from Dataset.select
_.named:
Project(untypedCols.map(_.named), logicalPlan)
The named
method calls toPrettySQL(expr)
:
case expr: Expression => Alias(expr, toPrettySQL(expr))()
The toPrettySQL
method calls sql
:
def toPrettySQL(e: Expression): String = usePrettyExpression(e).sql
Test build #121840 has finished for PR 28328 at commit
|
Failures of CliSuite annoy a lot #28328 (comment) @dongjoon-hyun @cloud-fan @gatorsmile This is the PR to repeat tests from CliSuite #28329 |
jenkins, retest this, please |
Test build #121835 has finished for PR 28328 at commit
|
Test build #121844 has finished for PR 28328 at commit
|
cc @viirya , this is another instance that merging |
Actually this PR shows we still need |
val exprValues = values.toSeq.map(lit(_).expr) | ||
if (exprValues.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||
val elemType = exprValues.headOption.map(_.dataType).getOrElse(NullType) | ||
InSet(expr, exprValues.map(_.eval()).toSet, elemType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we make sure the expr
has the same data type as exprValues
? Do we have a type coercion rule for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure, we need something similar to In.checkInputDataTypes()
in InSet
:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
Lines 313 to 322 in 7d8216a
override def checkInputDataTypes(): TypeCheckResult = { | |
val mismatchOpt = list.find(l => !DataType.equalsStructurally(l.dataType, value.dataType, | |
ignoreNullability = true)) | |
if (mismatchOpt.isDefined) { | |
TypeCheckResult.TypeCheckFailure(s"Arguments must be same type but were: " + | |
s"${value.dataType.catalogString} != ${mismatchOpt.get.dataType.catalogString}") | |
} else { | |
TypeUtils.checkForOrderingExpr(value.dataType, s"function $prettyName") | |
} | |
} |
I could add such check in the PR if you don't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added similar check to InSet
What that means? We optimize |
A |
I have update PR's description and added a column w/o optimization. I got the numbers by running the code: test("isInCollection benchmark") {
def testExplainTime(collectionSize: Int) = {
val df = spark.range(10).withColumn("id2", col("id") + 1)
val list = Range(0, collectionSize).toList
val startTime = System.currentTimeMillis()
df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
val elapsedTime = System.currentTimeMillis() - startTime
println(s"cost time: ${elapsedTime}ms")
}
withSQLConf(SQLConf.OPTIMIZER_INSET_CONVERSION_THRESHOLD.key -> "100000000") {
testExplainTime(1)
testExplainTime(5)
testExplainTime(10)
testExplainTime(100)
testExplainTime(1000)
testExplainTime(10000)
}
} |
Test build #121899 has finished for PR 28328 at commit
|
That's a pain point. But when we merge |
Okay, the more I look, the more it makes me to think we should revert #25754 rather than adding bandaid fixes. Shall we revert? |
After offline discussion with @gatorsmile @cloud-fan @HyukjinKwon, we decided to revert #25754 . I will open a PR for that and close this PR. |
…n.isInCollection() with a large size collection" ### What changes were proposed in this pull request? This reverts commit 5631a96. Closes #28328 ### Why are the changes needed? The PR #25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default): ```scala val set = (0 to 20).map(_.toString).toSet val data = Seq("1").toDF("x") data.select($"x".isInCollection(set).as("isInCollection")).show() ``` The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false": ``` +--------------+ |isInCollection| +--------------+ | false| +--------------+ ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? ``` $ ./build/sbt "test:testOnly *ColumnExpressionSuite" ``` Closes #28388 from MaxGekk/fix-isInCollection-revert. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7cabc8) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The
InSet
expression expects input collections of internal Catalyst types, for examplehset
must contain elements ofUTF8String
forchild
of string type. So, it meansisInCollection
must convert users values to internal Catalyst values but currently it doesn't perform the conversion. That leads to incorrect results for collection sizes above the thresholdspark.sql.optimizer.inSetConversionThreshold
.The bug was introduced by #25754.
Why are the changes needed?
The changes fix incorrect behaviour of
isInCollection
. For example, if the SQL configspark.sql.optimizer.inSetConversionThreshold
is set to 10 (by default):The function must return 'true' because "1" is in the set of "0" ... "20" but it returns "false":
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
ColumnExpressionSuite
ColumnExpressionSuite
Then test on collection size 5, 10, 100, 1000, 10000, test result is: