Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13790] Speed up ColumnVector's getDecimal #11624

Closed
wants to merge 3 commits into from

Conversation

nongli
Copy link
Contributor

@nongli nongli commented Mar 10, 2016

What changes were proposed in this pull request?

We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.

How was this patch tested?

Existing tests and this benchmark

TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
--------------------------------------------------------------------------------
q27-agg (master)                       10627 / 11057         10.8          92.3
q27-agg (this patch)                     9722 / 9832         11.8          84.4

We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.

TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
--------------------------------------------------------------------------------
q27-agg (master)                       10627 / 11057         10.8          92.3
q27-agg (this patch)                     9722 / 9832         11.8          84.4
@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52800 has finished for PR 11624 at commit 3b5ad12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 10, 2016

cc @davies

@davies
Copy link
Contributor

davies commented Mar 10, 2016

Right now, it's not safe to re-use the objects from reader or UnsafeRow, because some of the expression may hold the object (for example, aggregate without grouping key, and some string functions). That's the reason we know the cost of creating new object every time you access UTF8String/Decimal/Array/MapArray/Struct, but have not optimize it yet.

I tried this patch locally, generate a parquet file with one decimal column, then read it and aggregate with max(d) and min(d), the min(d) will return wrong result:

>>> sqlContext.sql("select min(d), max(d) from t").show()
+------+------+
|min(d)|max(d)|
+------+------+
|  0.00| 99.00|
+------+------+
>>> sqlContext.sql("select min(d), max(d) from t2").show()
+------+------+
|min(d)|max(d)|
+------+------+
| 24.00| 99.00|
+------+------+

t1 is the table before saving as parquet file, t2 is the table loaded from parquet file.

In order to having these optimization, we need to prove that we always make the copy before holding a reference to a object that could be re-used. There are still some places we are using MutableGenericInternalRow, we also should do the copy when update it.

If we only re-use the object for new parquet reader, but do the copy for all other places, this may cause performance regression for other data sources.

@nongli
Copy link
Contributor Author

nongli commented Mar 10, 2016

Noted. The object reuse was not the slow part. Here's a variant that doesn't do the expensive checking and the performance improvement is the same.

@davies
Copy link
Contributor

davies commented Mar 10, 2016

Could you also update the UnsafeRow to use this new API?

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52847 has finished for PR 11624 at commit 50e1244.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Mar 10, 2016

LGTM, merging this into master, thanks!

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52849 has finished for PR 11624 at commit af7ca2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 747d2f5 Mar 10, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

We should reuse an object similar to the other non-primitive type getters. For
a query that computes averages over decimal columns, this shows a 10% speedup
on overall query times.

## How was this patch tested?

Existing tests and this benchmark

```
TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
--------------------------------------------------------------------------------
q27-agg (master)                       10627 / 11057         10.8          92.3
q27-agg (this patch)                     9722 / 9832         11.8          84.4
```

Author: Nong Li <nong@databricks.com>

Closes apache#11624 from nongli/spark-13790.
@nongli nongli deleted the spark-13790 branch March 23, 2016 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants