Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48735][SQL] Performance Improvement for BIN function #47119

Closed
wants to merge 7 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jun 27, 2024

What changes were proposed in this pull request?

This PR implemented a long-to-binary form UTF8String method directly to improve the performance of the BIN function. It omits the procedure of encoding/decoding and array copying.

Why are the changes needed?

performance improvement

Does this PR introduce any user-facing change?

no

How was this patch tested?

  • new unit tests
  • offline benchmarking ~2x

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 27, 2024
@LuciferYang
Copy link
Contributor

LuciferYang commented Jun 27, 2024

I executed the benchmark provided in the PR description using both Java 17 and 21, but it seems I did not observe the same effect

Java 17

before

[info] OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Mac OS X 14.5
[info] Apple M2 Max
[info] encode:                                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] BIN                                                2704           2748          62          3.7         270.4       1.0X

after

[info] OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Mac OS X 14.5
[info] Apple M2 Max
[info] encode:                                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] BIN                                                2863           2905          59          3.5         286.3       1.0X

Java 21

before

[info] OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Mac OS X 14.5
[info] Apple M2 Max
[info] encode:                                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] BIN                                                2824           2828           7          3.5         282.4       1.0X

after

[info] OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Mac OS X 14.5
[info] Apple M2 Max
[info] encode:                                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] BIN                                                2814           2825          15          3.6         281.4       1.0X

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 27, 2024

Sorry @LuciferYang , the previous benchmark code might have been influenced by IO

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.{Bin, Expression, ImplicitCastInputTypes, NullIntolerant, UnaryExpression}
import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{DataType, LongType}
import org.apache.spark.unsafe.types.UTF8String

object MathFunctionBenchmark extends SqlBasedBenchmark {
  private val N = 100L * 1000 * 1000

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val benchmark = new Benchmark("BIN", N, output = output)
    benchmark.addCase("BIN") { _ =>
      spark.range(-N, N).select(Column(Bin(Column("id").expr))).noop()
    }

    benchmark.addCase("BIN OLD") { _ =>
      spark.range(-N, N).select(Column(BinOld(Column("id").expr))).noop()
    }
    benchmark.run()
  }
}

case class BinOld(child: Expression)
  extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant with Serializable {

  override def inputTypes: Seq[DataType] = Seq(LongType)
  override def dataType: DataType = SQLConf.get.defaultStringType

  protected override def nullSafeEval(input: Any): Any =
    UTF8String.fromString(java.lang.Long.toBinaryString(input.asInstanceOf[Long]))

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
    defineCodeGen(ctx, ev, (c) =>
      s"UTF8String.fromString(java.lang.Long.toBinaryString($c))")
  }
  override protected def withNewChildInternal(newChild: Expression): BinOld =
    copy(child = newChild)
}
[info] Running benchmark: BIN
[info]   Running case: BIN
[info]   Stopped after 2 iterations, 12111 ms
[info]   Running case: BIN OLD
[info]   Stopped after 2 iterations, 25052 ms
[info] OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
[info] Apple M2 Max
[info] BIN:                                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] BIN                                                6047           6056          13         16.5          60.5       1.0X
[info] BIN OLD                                           12459          12526          96          8.0         124.6       0.5X
[success] Total time: 114 s (01:54), completed Jun 27, 2024, 5:24:10 PM

@yaooqinn
Copy link
Member Author

OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
AMD EPYC 7763 64-Core Processor
BIN:                                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
BIN                                               11425          11429           6          8.8         114.2       1.0X
BIN OLD                                           16381          16387           9          6.1         163.8       0.7X

The GA environment does produce the result as sufficient as My local labtop but still positive

This reverts commit 364537d.
This reverts commit d43789f.
@LuciferYang
Copy link
Contributor

OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Mac OS X 14.5
Apple M2 Max
BIN:                                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
BIN                                               12165          12271         151          8.2         121.6       1.0X
BIN OLD                                           11976          12020          62          8.3         119.8       1.0X

The new benchmark also fails to show performance differences on my Mac. But it's ok if the new code's advantages can be demonstrated on GA, given that there are differences in CPU architecture.

@yaooqinn yaooqinn closed this in df13ca0 Jun 27, 2024
@yaooqinn yaooqinn deleted the SPARK-48735 branch June 27, 2024 13:42
@yaooqinn
Copy link
Member Author

Thanks, @LuciferYang. Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants