[SPARK-49506][SQL] Optimize ArrayBinarySearch for foldable array by panbingkun · Pull Request #47984 · apache/spark

panbingkun · 2024-09-04T06:28:43Z

What changes were proposed in this pull request?

The pr aims to

optimize ArrayBinarySearch for foldable array.
fix a bug in the original implementation.

Why are the changes needed?

The changes improve performance of the array_binary_search() function.

create an instance of foldable{DataType}ArrayData only once at the initialization ( avoid frequent calls to ArrayData.to{DataType}Array() ), and reuse it inside of replacement in the case when the array parameter is foldable.

Before:

Running benchmark: array binary search
  Running case: no foldable optimize
  Stopped after 100 iterations, 93668 ms

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.6.1
Apple M2
array binary search:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
no foldable optimize                                916            937          24         10.9          91.6       1.0X

After:

Running benchmark: array binary search
  Running case: has foldable optimize
  Stopped after 100 iterations, 17206 ms

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.6.1
Apple M2
array binary search:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
has foldable optimize                               164            172          22         61.1          16.4       1.0X

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Update existed UT.
Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

panbingkun · 2024-09-04T07:49:59Z

object ArrayBinarySearchBenchmark extends SqlBasedBenchmark {
  private val N = 10000000
  private val M = 100

  private val arrayData = (0 until M).mkString("array(", ",", ")")
  private val exprs = s"array_binary_search($arrayData, value % $M)"
  private val df = spark.range(N).toDF("value")

  private def doBenchmark(): Unit = {
    df.selectExpr(exprs).noop()
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    runBenchmark("array binary search") {
      val benchmark = new Benchmark("array binary search", N, output = output)
      benchmark.addCase("no foldable optimize", M) { _ =>
        doBenchmark()
      }
      benchmark.run()
    }
  }
}

Before:

Running benchmark: array binary search
  Running case: no foldable optimize
  Stopped after 100 iterations, 93668 ms

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.6.1
Apple M2
array binary search:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
no foldable optimize                                916            937          24         10.9          91.6       1.0X

After:

Running benchmark: array binary search
  Running case: has foldable optimize
  Stopped after 100 iterations, 17206 ms

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.6.1
Apple M2
array binary search:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
has foldable optimize                               164            172          22         61.1          16.4       1.0X

zhengruifeng · 2024-09-04T11:42:38Z

Thanks @panbingkun for working on this!
Existing usages (pyspark and ml) both apply binary search with a literal double array, this optimization will improve the performance of them.

cloud-fan · 2024-09-09T09:05:16Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

  @transient private lazy val isPrimitiveType: Boolean = CodeGenerator.isPrimitiveType(elementType)
  @transient private lazy val canPerformFastBinarySearch: Boolean = isPrimitiveType &&
    elementType != BooleanType && !resultArrayElementNullable
+  @transient private lazy val arrayIsFoldable: Boolean = array.foldable


This isn't worth a lazy val. Can a simple def work?

cloud-fan · 2024-09-09T09:08:07Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ArrayExpressionUtils.java


+  // boolean
+  // foldable optimize
+  public static int binarySearchNullSafe(Boolean[] data, Boolean value) {


I think it's better to take ArrayData here to simplify the expression implementation. We can call arrayData.toBooleanArray or .toObjectArray for non-nullable and nullable arrays.

For nullable arrays, I don't think concrete types can help the performance. Object[] should be OK.

cloud-fan · 2024-09-09T09:15:13Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

-        "binarySearch",
-        Seq(array, value),
-        inputTypes)
+      if (arrayIsFoldable) {


I'm a bit confused. What's the difference between foldable and non-foldable arrays regarding this optimization?

Take the following case as an example:

val a6_0 = Literal.create(Seq(1.0d, 2.0d, 3.0d), ArrayType(DoubleType, containsNull = false)) checkEvaluation(ArrayBinarySearch(a6_0, Literal(1.0d)), 0)

Before:

After:

In the case where the array is foldable, after optimization, the array only needs to be toArray once instead of toArray every time.

zhengruifeng · 2024-09-09T12:03:44Z

has an offline discussion with @panbingkun , another approach maybe:

the bottleneck is ArrayData.toXXXArray that requires a deep copy of the whole array, if ConstantFolding can convert all foldable arrays to literals (with GenericArrayData type value), then maybe we can optimize it by override toXXXArray methods to directly return the val array in some way

cloud-fan · 2024-09-09T12:59:32Z

@zhengruifeng This is a good idea. I think ArrayBinarySearch should be replaced by invoke_binary_search_function(ToJavaArray(array_expr)), and ConstantFolding should do the optimization automatically. For the new expression ToJavaArray, it can create primitive java array if the element type is primitive type and not nullable.

panbingkun · 2024-09-09T13:51:40Z

Great suggestion, let me give it a try, thanks!

panbingkun · 2024-09-11T06:26:30Z

@zhengruifeng This is a good idea. I think ArrayBinarySearch should be replaced by invoke_binary_search_function(ToJavaArray(array_expr)), and ConstantFolding should do the optimization automatically. For the new expression ToJavaArray, it can create primitive java array if the element type is primitive type and not nullable.

@zhengruifeng @cloud-fan
The latest logic based on ToJavaType has been submitted.
Please help review it when you have free time, thanks!

panbingkun · 2024-09-11T06:39:10Z

ArrayBinarySearchBenchmark

Before:

Running benchmark: array binary search
  Running case: has foldable optimize
  Stopped after 100 iterations, 371269 ms

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.6.1
Apple M2
array binary search:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
has foldable optimize                              3585           3713         100          2.8         358.5       1.0X

After:

Running benchmark: array binary search
  Running case: has foldable optimize
  Stopped after 100 iterations, 21097 ms

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 14.6.1
Apple M2
array binary search:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
has foldable optimize                               201            211          23         49.9          20.1       1.0X

panbingkun · 2024-09-11T06:41:37Z

...ore/src/test/scala/org/apache/spark/sql/execution/benchmark/ArrayBinarySearchBenchmark.scala

+ *      Results will be written to "benchmarks/ArrayBinarySearchBenchmark-results.txt".
+ * }}}
+ */
+object ArrayBinarySearchBenchmark extends SqlBasedBenchmark {


If we don't need ArrayBinarySearchBenchmark, I can delete it.

I think we don't need it, we can delete it before merge

I have already deleted it.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

...st/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala

cloud-fan · 2024-09-23T13:10:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

@@ -90,7 +90,24 @@ trait InvokeLike extends Expression with NonSQLExpression with ImplicitCastInput
    // serializability, because the type-level info with java.io.Serializable and
    // java.io.Externalizable marker interfaces are not strong guarantees.
    // This restriction can be relaxed in the future to expose more optimizations.


We should update the comment here. We do not block all ObjectType now.

…o SPARK-49506

This reverts commit 86bad41.

…park into SPARK-49506" This reverts commit d1fd838, reversing changes made to 94d288e.

…spark into SPARK-49506" This reverts commit 42b1376.

…park into SPARK-49506" This reverts commit d1fd838, reversing changes made to 7f62093.

This reverts commit 63b91e4.

dismiss

panbingkun · 2024-09-24T11:49:40Z

@zhengruifeng @cloud-fan
I'm very sorry that I broke this PR and couldn't restore it, so I opened a new one
#48225

panbingkun · 2024-09-24T11:54:58Z

I will close it.

[SPARK-49506][SQL] Optimize ArrayBinarySearch for foldable array

bdf34de

github-actions bot added the SQL label Sep 4, 2024

panbingkun marked this pull request as ready for review September 4, 2024 07:54

zhengruifeng requested a review from cloud-fan September 4, 2024 11:40

panbingkun added 4 commits September 4, 2024 19:45

Merge branch 'master' into SPARK-49506

50cef64

add ArrayBinarySearchBenchmark

f82f75c

add some case

dd21230

fix code style

4d42925

cloud-fan reviewed Sep 9, 2024

View reviewed changes

panbingkun added 4 commits September 10, 2024 11:18

Merge branch 'master' into SPARK-49506

b677ec5

update

74fa5fa

use ToJavaArray

6d73eaa

Merge branch 'master' into SPARK-49506

0b77cf1

panbingkun commented Sep 11, 2024

View reviewed changes

panbingkun requested a review from cloud-fan September 11, 2024 06:48

zhengruifeng reviewed Sep 11, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Outdated Show resolved Hide resolved

zhengruifeng reviewed Sep 11, 2024

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala Outdated Show resolved Hide resolved

panbingkun added 2 commits September 11, 2024 19:16

update

17a1df9

Merge branch 'master' into SPARK-49506

7e75b5e

zhengruifeng reviewed Sep 11, 2024

View reviewed changes

...st/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala Show resolved Hide resolved

remove register

83facef

cloud-fan reviewed Sep 23, 2024

View reviewed changes

panbingkun added 2 commits September 24, 2024 09:25

Merge branch 'SPARK-49506' of https://github.com/panbingkun/spark int…

d1fd838

…o SPARK-49506

new file ToJavaArrayUtils.java & ToJavaArray.scala

86bad41

github-actions bot added BUILD CORE PYTHON CONNECT labels Sep 24, 2024

panbingkun added 4 commits September 24, 2024 10:26

Revert "new file ToJavaArrayUtils.java & ToJavaArray.scala"

63b91e4

This reverts commit 86bad41.

Revert "Merge branch 'SPARK-49506' of https://github.com/panbingkun/s…

42b1376

…park into SPARK-49506" This reverts commit d1fd838, reversing changes made to 94d288e.

Reapply "Merge branch 'SPARK-49506' of https://github.com/panbingkun/…

d0e3e35

…spark into SPARK-49506" This reverts commit 42b1376.

Revert "Merge branch 'SPARK-49506' of https://github.com/panbingkun/s…

dfd2f39

…park into SPARK-49506" This reverts commit d1fd838, reversing changes made to 7f62093.

github-actions bot removed BUILD CORE PYTHON CONNECT labels Sep 24, 2024

Reapply "new file ToJavaArrayUtils.java & ToJavaArray.scala"

277d12e

This reverts commit 63b91e4.

panbingkun requested a review from cloud-fan September 24, 2024 02:50

resolve file conflict

728fdad

github-actions bot added BUILD CORE PYTHON CONNECT labels Sep 24, 2024

fix

3ac12d1

github-actions bot added the STRUCTURED STREAMING label Sep 24, 2024

panbingkun mentioned this pull request Sep 24, 2024

[SPARK-49506][SQL] Optimize ArrayBinarySearch for foldable array #48225

Closed

panbingkun closed this Sep 24, 2024

panbingkun deleted the SPARK-49506 branch September 24, 2024 12:00

Conversation

panbingkun commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

panbingkun commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Sep 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 9, 2024

Uh oh!

panbingkun commented Sep 9, 2024

Uh oh!

panbingkun commented Sep 11, 2024

Uh oh!

panbingkun commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

panbingkun commented Sep 24, 2024

Uh oh!

panbingkun commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

panbingkun commented Sep 4, 2024 •

edited

Loading

panbingkun commented Sep 4, 2024 •

edited

Loading

zhengruifeng commented Sep 9, 2024 •

edited

Loading

panbingkun commented Sep 11, 2024 •

edited

Loading