-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48735][SQL] Performance Improvement for BIN function #47119
Conversation
I executed the benchmark provided in the PR description using both Java 17 and 21, but it seems I did not observe the same effect Java 17 before
after
Java 21 before
after
|
Sorry @LuciferYang , the previous benchmark code might have been influenced by IO package org.apache.spark.sql.execution.benchmark
import org.apache.spark.benchmark.Benchmark
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.{Bin, Expression, ImplicitCastInputTypes, NullIntolerant, UnaryExpression}
import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{DataType, LongType}
import org.apache.spark.unsafe.types.UTF8String
object MathFunctionBenchmark extends SqlBasedBenchmark {
private val N = 100L * 1000 * 1000
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val benchmark = new Benchmark("BIN", N, output = output)
benchmark.addCase("BIN") { _ =>
spark.range(-N, N).select(Column(Bin(Column("id").expr))).noop()
}
benchmark.addCase("BIN OLD") { _ =>
spark.range(-N, N).select(Column(BinOld(Column("id").expr))).noop()
}
benchmark.run()
}
}
case class BinOld(child: Expression)
extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant with Serializable {
override def inputTypes: Seq[DataType] = Seq(LongType)
override def dataType: DataType = SQLConf.get.defaultStringType
protected override def nullSafeEval(input: Any): Any =
UTF8String.fromString(java.lang.Long.toBinaryString(input.asInstanceOf[Long]))
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
defineCodeGen(ctx, ev, (c) =>
s"UTF8String.fromString(java.lang.Long.toBinaryString($c))")
}
override protected def withNewChildInternal(newChild: Expression): BinOld =
copy(child = newChild)
}
|
The GA environment does produce the result as sufficient as My local labtop but still positive |
The new benchmark also fails to show performance differences on my Mac. But it's ok if the new code's advantages can be demonstrated on GA, given that there are differences in CPU architecture. |
Thanks, @LuciferYang. Merged to master |
What changes were proposed in this pull request?
This PR implemented a long-to-binary form UTF8String method directly to improve the performance of the BIN function. It omits the procedure of encoding/decoding and array copying.
Why are the changes needed?
performance improvement
Does this PR introduce any user-facing change?
no
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
no