Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15742][SQL] Reduce temp collections allocations in TreeNode transform methods #13484

Closed

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Jun 3, 2016

In Catalyst's TreeNode transform methods we end up calling productIterator.map(...).toArray in a number of places, which is slightly inefficient because it needs to allocate an ArrayBuilder and grow a temporary array. Since we already know the size of the final output (productArity), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.

For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).

Perf results (from #13456 benchmarks)

Before

Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            19 /   22          0.0    19119858.0       1.0X
10 select expressions                           23 /   25          0.0    23208774.0       0.8X
100 select expressions                          55 /   73          0.0    54768402.0       0.3X
1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
2500 select expressions                        530 /  554          0.0   529938178.0       0.0X

After

parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            15 /   21          0.0    14978203.0       1.0X
10 select expressions                           22 /   27          0.0    22492262.0       0.7X
100 select expressions                          48 /   64          0.0    48449834.0       0.3X
1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
2500 select expressions                        429 /  449          0.0   428943897.0       0.0X

@marmbrus
Copy link
Contributor

marmbrus commented Jun 3, 2016

LGTM

How big of a difference in the benchmarks?

@JoshRosen
Copy link
Contributor Author

About 20%, give or take.

@rxin
Copy link
Contributor

rxin commented Jun 3, 2016

Can you include some perf data in the description?

@JoshRosen
Copy link
Contributor Author

Updated.

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59901 has finished for PR 13484 at commit da79ec6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Want to do more benchmarking to investigate something so don't merge yet.
On Thu, Jun 2, 2016 at 7:27 PM UCB AMPLab notifications@github.com wrote:

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59901/
Test PASSed.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13484 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AADGPL8RubvbrRbIK6eH0tTW0I6_-jtdks5qH5D8gaJpZM4ItGFp
.

@JoshRosen JoshRosen force-pushed the treenode-productiterator-map branch from da79ec6 to b3e63bb Compare June 3, 2016 18:12
val arr = Array.ofDim[B](productArity)
var i = 0
while (i < arr.length) {
arr(i) = f(productElement(i))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that ProductIterator was maintaining its own internal counter and was calling productElement(i):

 /** An iterator over all the elements of this product.
   *  @return     in the default implementation, an `Iterator[Any]`
   */
  def productIterator: Iterator[Any] = new scala.collection.AbstractIterator[Any] {
    private var c: Int = 0
    private val cmax = productArity
    def hasNext = c < cmax
    def next() = { val result = productElement(c); c += 1; result }
  }

Calling productElement ourselves here avoids another layer of object allocation and buys a bit of additional perf. boost.

@ericl
Copy link
Contributor

ericl commented Jun 3, 2016

LGTM too, probably this will be even more of a improvement once we get the other bottlenecks fixed

@SparkQA
Copy link

SparkQA commented Jun 3, 2016

Test build #59946 has finished for PR 13484 at commit b3e63bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

LGTM2

@JoshRosen
Copy link
Contributor Author

Alright, I'm going to merge this into master and 2.0.

asfgit pushed a commit that referenced this pull request Jun 3, 2016
…ansform methods

In Catalyst's TreeNode transform methods we end up calling `productIterator.map(...).toArray` in a number of places, which is slightly inefficient because it needs to allocate an `ArrayBuilder` and grow a temporary array. Since we already know the size of the final output (`productArity`), we can simply allocate an array up-front and use a while loop to consume the iterator and populate the array.

For most workloads, this performance difference is negligible but it does make a measurable difference in optimizer performance for queries that operate over very wide schemas (such as the benchmark queries in #13456).

### Perf results (from #13456 benchmarks)

**Before**

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz

parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            19 /   22          0.0    19119858.0       1.0X
10 select expressions                           23 /   25          0.0    23208774.0       0.8X
100 select expressions                          55 /   73          0.0    54768402.0       0.3X
1000 select expressions                        229 /  259          0.0   228606373.0       0.1X
2500 select expressions                        530 /  554          0.0   529938178.0       0.0X
```

**After**

```
parsing large select:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
1 select expressions                            15 /   21          0.0    14978203.0       1.0X
10 select expressions                           22 /   27          0.0    22492262.0       0.7X
100 select expressions                          48 /   64          0.0    48449834.0       0.3X
1000 select expressions                        189 /  208          0.0   189346428.0       0.1X
2500 select expressions                        429 /  449          0.0   428943897.0       0.0X
```

###

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13484 from JoshRosen/treenode-productiterator-map.

(cherry picked from commit e526913)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
@asfgit asfgit closed this in e526913 Jun 3, 2016
@JoshRosen JoshRosen deleted the treenode-productiterator-map branch June 3, 2016 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants