[SPARK-49506][SQL] Optimize ArrayBinarySearch for foldable array#47984
[SPARK-49506][SQL] Optimize ArrayBinarySearch for foldable array#47984panbingkun wants to merge 30 commits intoapache:masterfrom
Conversation
Before: After: |
|
Thanks @panbingkun for working on this! |
| @transient private lazy val isPrimitiveType: Boolean = CodeGenerator.isPrimitiveType(elementType) | ||
| @transient private lazy val canPerformFastBinarySearch: Boolean = isPrimitiveType && | ||
| elementType != BooleanType && !resultArrayElementNullable | ||
| @transient private lazy val arrayIsFoldable: Boolean = array.foldable |
There was a problem hiding this comment.
This isn't worth a lazy val. Can a simple def work?
|
|
||
| // boolean | ||
| // foldable optimize | ||
| public static int binarySearchNullSafe(Boolean[] data, Boolean value) { |
There was a problem hiding this comment.
I think it's better to take ArrayData here to simplify the expression implementation. We can call arrayData.toBooleanArray or .toObjectArray for non-nullable and nullable arrays.
There was a problem hiding this comment.
For nullable arrays, I don't think concrete types can help the performance. Object[] should be OK.
| "binarySearch", | ||
| Seq(array, value), | ||
| inputTypes) | ||
| if (arrayIsFoldable) { |
There was a problem hiding this comment.
I'm a bit confused. What's the difference between foldable and non-foldable arrays regarding this optimization?
There was a problem hiding this comment.
In the case where the array is foldable, after optimization, the array only needs to be toArray once instead of toArray every time.
|
has an offline discussion with @panbingkun , another approach maybe: the bottleneck is |
|
@zhengruifeng This is a good idea. I think |
|
Great suggestion, let me give it a try, thanks! |
@zhengruifeng @cloud-fan |
|
ArrayBinarySearchBenchmark
|
| * Results will be written to "benchmarks/ArrayBinarySearchBenchmark-results.txt". | ||
| * }}} | ||
| */ | ||
| object ArrayBinarySearchBenchmark extends SqlBasedBenchmark { |
There was a problem hiding this comment.
If we don't need ArrayBinarySearchBenchmark, I can delete it.
There was a problem hiding this comment.
I think we don't need it, we can delete it before merge
There was a problem hiding this comment.
I have already deleted it.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
...st/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
Show resolved
Hide resolved
| @@ -90,7 +90,24 @@ trait InvokeLike extends Expression with NonSQLExpression with ImplicitCastInput | |||
| // serializability, because the type-level info with java.io.Serializable and | |||
| // java.io.Externalizable marker interfaces are not strong guarantees. | |||
| // This restriction can be relaxed in the future to expose more optimizations. | |||
There was a problem hiding this comment.
We should update the comment here. We do not block all ObjectType now.
This reverts commit 63b91e4.
|
@zhengruifeng @cloud-fan |
|
I will close it. |



What changes were proposed in this pull request?
The pr aims to
ArrayBinarySearchforfoldablearray.Why are the changes needed?
The changes improve performance of the
array_binary_search()function.foldable{DataType}ArrayDataonly once at the initialization ( avoid frequent calls toArrayData.to{DataType}Array()), and reuse it inside ofreplacementin the case when thearrayparameter is foldable.Before:
After:
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.