-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19008][SQL] Improve performance of Dataset.map by eliminating boxing/unboxing #17172
Conversation
Test build #73971 has finished for PR 17172 at commit
|
@@ -17,6 +17,8 @@ | |||
|
|||
package org.apache.spark.sql.execution | |||
|
|||
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.VoidType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a mistaken import? I don't see it used in the change and can't imagine we'd be invoking Xalan here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is why the first try was failed. It was unintentionally imported during my debugging.
@@ -217,9 +219,33 @@ case class MapElementsExec( | |||
} | |||
|
|||
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { | |||
val inType = if (child.output.length == 1) child.output(0).dataType else NullType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two are only needed inside the case _
block right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I simplified their scope.
Test build #73987 has finished for PR 17172 at commit
|
Test build #73990 has finished for PR 17172 at commit
|
cc @cloud-fan |
@@ -219,7 +219,30 @@ case class MapElementsExec( | |||
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { | |||
val (funcClass, methodName) = func match { | |||
case m: MapFunction[_, _] => classOf[MapFunction[_, _]] -> "call" | |||
case _ => classOf[Any => Any] -> "apply" | |||
case _ => | |||
(if (child.output.length == 1) child.output(0).dataType else NullType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the if
is not needed, see the assert
in ObjectConsumerExec
outputObjAttr.dataType) match { | ||
// if a pair of an argument and return types is one of specific types | ||
// whose specialized method (apply$mc..$sp) is generated by scalac, | ||
// Catalyst generated a direct method call to the specialized method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you link to some official document or blogpost?
This is cool! Can you also update the benchmark result in |
The latest Do we add a new benchmark with a function |
yea let's add a new case in the benchmark |
Test build #74201 has finished for PR 17172 at commit
|
// SpecializeTypes.scala | ||
// http://www.cakesolutions.net/teamblogs/scala-dissection-functions | ||
// http://axel22.github.io/2013/11/03/specialization-quirks.html | ||
case (IntegerType, IntegerType) => classOf[Int => Int] -> "apply$mcII$sp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about boolean type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I overlooked boolean type for return type.
// http://www.cakesolutions.net/teamblogs/scala-dissection-functions | ||
// http://axel22.github.io/2013/11/03/specialization-quirks.html | ||
case (IntegerType, IntegerType) => classOf[Int => Int] -> "apply$mcII$sp" | ||
case (IntegerType, LongType) => classOf[Int => Long] -> "apply$mcJI$sp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible do it in a composable way instead of enumerating all combinations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found a composable way.
Use compositional approach instead of enumeration approach
Test build #74249 has finished for PR 17172 at commit
|
Test build #74257 has finished for PR 17172 at commit
|
@@ -216,10 +217,39 @@ case class MapElementsExec( | |||
child.asInstanceOf[CodegenSupport].produce(ctx, this) | |||
} | |||
|
|||
private def getMethodType(dt: DataType, isOutput: Boolean): String = { | |||
dt match { | |||
case BooleanType if isOutput => "Z" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so boolean type can't be a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this code specializes boolean type only for return type.
@@ -216,10 +217,39 @@ case class MapElementsExec( | |||
child.asInstanceOf[CodegenSupport].produce(ctx, this) | |||
} | |||
|
|||
private def getMethodType(dt: DataType, isOutput: Boolean): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's return a Option[String]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will do this
@@ -165,11 +208,23 @@ object DatasetBenchmark { | |||
val numRows = 100000000 | |||
val numChains = 10 | |||
|
|||
val benchmark = backToBackMap(spark, numRows, numChains) | |||
val benchmark0 = backToBackMapLong(spark, numRows, numChains) | |||
val benchmark1 = backToBackMap(spark, numRows, numChains) | |||
val benchmark2 = backToBackFilter(spark, numRows, numChains) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also add a new case for backToBackFilterLong
, as we handle boolean type now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter()
is handled by FilterExec()
. Should this PR handle filter()
, too? Or, do I open another PR for filter()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: FilterExec()
generates code. TypedFilter
generated code piece for a method invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new case backToBackFilterLong
, but made a mistake to put different result.
Let me correct it soon.
case _ => null | ||
} | ||
} | ||
|
||
override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { | ||
val (funcClass, methodName) = func match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put this thing in a util so that FilterExec
can also use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Now, can generate a call to a specialized method for Dataset.filter()
.
Test build #74296 has finished for PR 17172 at commit
|
Test build #74297 has finished for PR 17172 at commit
|
cool! merging to master! |
What changes were proposed in this pull request?
This PR improve performance of Dataset.map() for primitive types by removing boxing/unbox operations. This is based on the discussion with @cloud-fan.
Current Catalyst generates a method call to a
apply()
method of an anonymous function written in Scala. The types of an argument and return value arejava.lang.Object
. As a result, each method call for a primitive value involves a pair of unboxing and boxing for calling thisapply()
method and a pair of boxing and unboxing for returning from thisapply()
method.This PR directly calls a specialized version of a
apply()
method without boxing and unboxing. For example, if types of an arguments ant return value isint
, this PR generates a method call toapply$mcII$sp
. This PR supports any combination ofInt
,Long
,Float
, andDouble
.The following is a benchmark result using this program with 4.7x. Here is a Dataset part of this program.
Without this PR
With this PR
A motivating example
Generated code without this PR
Generated code with this PR (lines 48-56 are changed)
Java bytecode for methods for
i => i * 7
How was this patch tested?
Added new test suites to
DatasetPrimitiveSuite
.