New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31553][SQL] Fix isInCollection for collection sizes above the optimisation threshold #28328
Changes from all commits
7306a5c
08d107d
7b200bf
47a1e44
7357427
67f34a1
dd69aa6
05ce50a
4bc0e26
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -828,11 +828,12 @@ class Column(val expr: Expression) extends Logging { | |||||||||||||||||||||
* @since 2.4.0 | ||||||||||||||||||||||
*/ | ||||||||||||||||||||||
def isInCollection(values: scala.collection.Iterable[_]): Column = withExpr { | ||||||||||||||||||||||
val hSet = values.toSet[Any] | ||||||||||||||||||||||
if (hSet.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||||||||||||||||||||||
InSet(expr, hSet) | ||||||||||||||||||||||
val exprValues = values.toSeq.map(lit(_).expr) | ||||||||||||||||||||||
if (exprValues.size > SQLConf.get.optimizerInSetConversionThreshold) { | ||||||||||||||||||||||
val elemType = exprValues.headOption.map(_.dataType).getOrElse(NullType) | ||||||||||||||||||||||
InSet(expr, exprValues.map(_.eval()).toSet, elemType) | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can we make sure the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To make sure, we need something similar to spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala Lines 313 to 322 in 7d8216a
I could add such check in the PR if you don't mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added similar check to InSet |
||||||||||||||||||||||
} else { | ||||||||||||||||||||||
In(expr, values.toSeq.map(lit(_).expr)) | ||||||||||||||||||||||
In(expr, exprValues) | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, this is caused by SPARK-29048 (Improve performance on Column.isInCollection() with a large size collection, #25754 ) and only affects 3.0.0, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for confirming, @MaxGekk . |
||||||||||||||||||||||
} | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -483,6 +483,10 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { | |
"due to data type mismatch: Arguments must be same type but were").foreach { s => | ||
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT))) | ||
} | ||
val errMsg = intercept[AnalysisException] { | ||
df.select($"a".isInCollection(Seq(0, 1).map(new java.sql.Timestamp(_)))).collect() | ||
}.getMessage | ||
assert(errMsg.contains("Arguments must be same type")) | ||
} | ||
} | ||
} | ||
|
@@ -872,7 +876,18 @@ class ColumnExpressionSuite extends QueryTest with SharedSparkSession { | |
} | ||
|
||
test("SPARK-31563: sql of InSet for UTF8String collection") { | ||
val inSet = InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString)) | ||
val inSet = InSet(Literal("a"), Set("a", "b").map(UTF8String.fromString), StringType) | ||
assert(inSet.sql === "('a' IN ('a', 'b'))") | ||
} | ||
|
||
test("SPARK-31553: isInCollection for collection sizes above a threshold") { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, @MaxGekk . cc @aokolnychyi and @dbtsai |
||
val threshold = 100 | ||
withSQLConf(SQLConf.OPTIMIZER_INSET_CONVERSION_THRESHOLD.key -> threshold.toString) { | ||
val set = (0 until 2 * threshold).map(_.toString).toSet | ||
val elem = "10" | ||
val data = Seq(elem).toDF("x") | ||
assert(set.contains(elem)) | ||
checkAnswer(data.select($"x".isInCollection(set)), Row(true)) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching internal Catalyst's types to external types is ambiguous. For example,
Long -> Long
Long -> Timestamp
Also type of child can be unknown when InSet has to know Catalyst's type of hset elements.
hsetElemType
is needed to eliminate the ambiguityThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can make this
Option[DataType]
because only a few things are ambiguous?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but if a caller passes None,
InSet
will be not able to infer elem types whenchild.dataType
isNullType
like in this case.dataType
returnsNullType
ifchild
isPrettyAttribute
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when
hsetElemType
can be different fromchild.dataType
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When InSet is created from isInCollection, in that case
child.dataType
isNullType
. For example, it is NullType in the test https://github.com/apache/spark/pull/28328/files#diff-aa655ba249e00d2591b21cf6a360cf82R886 because child is PrettyAttribute when thesql
method is called.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And
InSet.sql()
is called fromDataset.select
_.named:Project(untypedCols.map(_.named), logicalPlan)
The
named
method callstoPrettySQL(expr)
:The
toPrettySQL
method callssql
: