[SPARK-13790] Speed up ColumnVector's getDecimal #11624

nongli · 2016-03-10T01:31:38Z

What changes were proposed in this pull request?

We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.

How was this patch tested?

Existing tests and this benchmark

TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
--------------------------------------------------------------------------------
q27-agg (master)                       10627 / 11057         10.8          92.3
q27-agg (this patch)                     9722 / 9832         11.8          84.4

We should reuse an object similar to the other non-primitive type getters. For a query that computes averages over decimal columns, this shows a 10% speedup on overall query times. TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q27-agg (master) 10627 / 11057 10.8 92.3 q27-agg (this patch) 9722 / 9832 11.8 84.4

SparkQA · 2016-03-10T03:50:04Z

Test build #52800 has finished for PR 11624 at commit 3b5ad12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-03-10T07:14:30Z

cc @davies

davies · 2016-03-10T18:21:24Z

Right now, it's not safe to re-use the objects from reader or UnsafeRow, because some of the expression may hold the object (for example, aggregate without grouping key, and some string functions). That's the reason we know the cost of creating new object every time you access UTF8String/Decimal/Array/MapArray/Struct, but have not optimize it yet.

I tried this patch locally, generate a parquet file with one decimal column, then read it and aggregate with max(d) and min(d), the min(d) will return wrong result:

>>> sqlContext.sql("select min(d), max(d) from t").show()
+------+------+
|min(d)|max(d)|
+------+------+
|  0.00| 99.00|
+------+------+
>>> sqlContext.sql("select min(d), max(d) from t2").show()
+------+------+
|min(d)|max(d)|
+------+------+
| 24.00| 99.00|
+------+------+

t1 is the table before saving as parquet file, t2 is the table loaded from parquet file.

In order to having these optimization, we need to prove that we always make the copy before holding a reference to a object that could be re-used. There are still some places we are using MutableGenericInternalRow, we also should do the copy when update it.

If we only re-use the object for new parquet reader, but do the copy for all other places, this may cause performance regression for other data sources.

nongli · 2016-03-10T19:51:22Z

Noted. The object reuse was not the slow part. Here's a variant that doesn't do the expensive checking and the performance improvement is the same.

davies · 2016-03-10T19:56:18Z

Could you also update the UnsafeRow to use this new API?

SparkQA · 2016-03-10T21:28:41Z

Test build #52847 has finished for PR 11624 at commit 50e1244.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-10T21:30:54Z

LGTM, merging this into master, thanks!

SparkQA · 2016-03-10T21:38:56Z

Test build #52849 has finished for PR 11624 at commit af7ca2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? We should reuse an object similar to the other non-primitive type getters. For a query that computes averages over decimal columns, this shows a 10% speedup on overall query times. ## How was this patch tested? Existing tests and this benchmark ``` TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q27-agg (master) 10627 / 11057 10.8 92.3 q27-agg (this patch) 9722 / 9832 11.8 84.4 ``` Author: Nong Li <nong@databricks.com> Closes apache#11624 from nongli/spark-13790.

Don't reuse. Simplify creation.

50e1244

Update UnsafeRow to use API

af7ca2d

asfgit closed this in 747d2f5 Mar 10, 2016

nongli deleted the spark-13790 branch March 23, 2016 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13790] Speed up ColumnVector's getDecimal #11624

[SPARK-13790] Speed up ColumnVector's getDecimal #11624

nongli commented Mar 10, 2016

SparkQA commented Mar 10, 2016

rxin commented Mar 10, 2016

davies commented Mar 10, 2016

nongli commented Mar 10, 2016

davies commented Mar 10, 2016

SparkQA commented Mar 10, 2016

davies commented Mar 10, 2016

SparkQA commented Mar 10, 2016

[SPARK-13790] Speed up ColumnVector's getDecimal #11624

[SPARK-13790] Speed up ColumnVector's getDecimal #11624

Conversation

nongli commented Mar 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 10, 2016

rxin commented Mar 10, 2016

davies commented Mar 10, 2016

nongli commented Mar 10, 2016

davies commented Mar 10, 2016

SparkQA commented Mar 10, 2016

davies commented Mar 10, 2016

SparkQA commented Mar 10, 2016