[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification#30975
[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification#30975beliefer wants to merge 55 commits intoapache:masterfrom
Conversation
|
Test build #133635 has finished for PR 30975 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133619 has finished for PR 30975 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
| new AnalysisException(s"Column $colName does not exist") | ||
| } | ||
|
|
||
| def cannotSimplifyMultiLikeError(multi: MultiLikeBase): Throwable = { |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133641 has finished for PR 30975 at commit
|
|
Test build #133669 has finished for PR 30975 at commit
|
|
Test build #133654 has finished for PR 30975 at commit
|
|
Test build #133665 has finished for PR 30975 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133681 has finished for PR 30975 at commit
|
|
The optimization was already there before we add the |
|
It has conflicts with 3.1, @beliefer can you create a backport PR? |
OK. |
|
@cloud-fan Thanks for your work! |
| multi | ||
| } else { | ||
| multi match { | ||
| case l: LikeAll => And(replacements.reduceLeft(And), l.copy(patterns = remainPatterns)) |
There was a problem hiding this comment.
It may cause StackOverflowError.
scala> spark.sql("drop table SPARK_33938")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.sql("create table SPARK_33938(id string) using parquet")
res7: org.apache.spark.sql.DataFrame = []
scala> val values = Range(1, 10000)
values: scala.collection.immutable.Range = Range 1 until 10000
scala> spark.sql(s"select * from SPARK_33938 where id like all (${values.map(s => s"'$s'").mkString(", ")})").show
java.lang.StackOverflowError
at java.lang.ThreadLocal.set(ThreadLocal.java:201)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.set(TreeNode.scala:62)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
There was a problem hiding this comment.
For example, patterns a, b, c, d, e, and f. Suppose a, b, c, and d are patterns that can be optimized with startsWith. According to the current logic, it is startsWith(a)&startsWith(b)&startsWith(c)&startsWith(d)&LikeAll(e,f). Their hierarchy is not shown here.
We can use the threshold to determine the number of patterns that can be optimized, for example, only two patterns can be optimized. Then it is startsWith(a)&startsWith(b)&LikeAll(c,d,e,f)
What changes were proposed in this pull request?
We should optimize Like Any/All by LikeSimplification to improve performance.
Why are the changes needed?
Optimize Like Any/All
Does this PR introduce any user-facing change?
'No'.
How was this patch tested?
Jenkins test.