[SPARK-16331] [SQL] Reduce code generation time #14000

inouehrs · 2016-06-30T17:33:12Z

What changes were proposed in this pull request?

During the code generation, a LocalRelation often has a huge Vector object as data. In the simple example below, a LocalRelation has a Vector with 1000000 elements of UnsafeRow.

val numRows = 1000000
val ds = (1 to numRows).toDS().persist()
benchmark.addCase("filter+reduce") { iter =>
  ds.filter(a => (a & 1) == 0).reduce(_ + _)
}

At TreeNode.transformChildren, all elements of the vector is unnecessarily iterated to check whether any children exist in the vector since Vector is Traversable. This part significantly increases code generation time.

This patch avoids this overhead by checking the number of children before iterating all elements; LocalRelation does not have children since it extends LeafNode.

The performance of the above example

without this patch
Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
compilationTime:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
filter+reduce                                 4426 / 4533          0.2        4426.0       1.0X

with this patch
compilationTime:                         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
filter+reduce                                 3117 / 3391          0.3        3116.6       1.0X

How was this patch tested?

using existing unit tests

…dren

marmbrus · 2016-06-30T20:32:37Z

ok to test

rxin · 2016-06-30T22:13:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

-      case nonChild: AnyRef => nonChild
-      case null => null
+      if (changed) makeCopy(newArgs) else this
+    }


a small style nit:

} else { this }

rxin · 2016-06-30T22:13:32Z

LGTM other than the small nit.

The actual diff is very small: https://github.com/apache/spark/pull/14000/files?w=1

SparkQA · 2016-06-30T22:31:18Z

Test build #61567 has finished for PR 14000 at commit 153e170.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

inouehrs · 2016-07-01T01:10:11Z

I fixed the coding style issue.

SparkQA · 2016-07-01T03:06:03Z

Test build #61586 has finished for PR 14000 at commit 639090d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-01T03:52:05Z

LGTM. good catch!

rxin · 2016-07-01T04:47:22Z

Merging in master. Thanks.

inouehrs added 2 commits July 1, 2016 00:52

Merge branch 'apache/master'

5afd079

add check of # children to avoid redundant iteration in transformChil…

153e170

…dren

rxin reviewed Jun 30, 2016
View reviewed changes

coding style fix

639090d

asfgit closed this in 14cf61e Jul 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16331] [SQL] Reduce code generation time #14000

[SPARK-16331] [SQL] Reduce code generation time #14000

inouehrs commented Jun 30, 2016

marmbrus commented Jun 30, 2016

rxin Jun 30, 2016

rxin commented Jun 30, 2016

SparkQA commented Jun 30, 2016

inouehrs commented Jul 1, 2016

SparkQA commented Jul 1, 2016

viirya commented Jul 1, 2016

rxin commented Jul 1, 2016

[SPARK-16331] [SQL] Reduce code generation time #14000

[SPARK-16331] [SQL] Reduce code generation time #14000

Conversation

inouehrs commented Jun 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

marmbrus commented Jun 30, 2016

rxin Jun 30, 2016

Choose a reason for hiding this comment

rxin commented Jun 30, 2016

SparkQA commented Jun 30, 2016

inouehrs commented Jul 1, 2016

SparkQA commented Jul 1, 2016

viirya commented Jul 1, 2016

rxin commented Jul 1, 2016