[SPARK-10917] [SQL] improve performance of complex type in columnar cache by davies · Pull Request #8971 · apache/spark

davies · 2015-10-03T00:54:43Z

This PR improve the performance of complex types in columnar cache by using UnsafeProjection instead of KryoSerializer.

A simple benchmark show that this PR could improve the performance of scanning a cached table with complex columns by 15x (comparing to Spark 1.5).

Here is the code used to benchmark:

df = sc.range(1<<23).map(lambda i: Row(a=Row(b=i, c=str(i)), d=range(10), e=dict(zip(range(10), [str(i) for i in range(10)])))).toDF()
df.write.parquet("table")

df = sqlContext.read.parquet("table")
df.cache()
df.count()
t = time.time()
print df.select("*")._jdf.queryExecution().toRdd().count()
print time.time() - t

SparkQA · 2015-10-03T01:18:42Z

Test build #43203 has finished for PR 8971 at commit 713c884.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-05T17:48:39Z

Test build #43241 has finished for PR 8971 at commit 59bb2f9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-05T20:41:30Z

Test build #1843 has finished for PR 8971 at commit 59bb2f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-10-05T21:26:39Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java

The sizeInBytes is not aligned to words.

SparkQA · 2015-10-05T21:50:30Z

Test build #43249 has finished for PR 8971 at commit f9a3502.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class BufferHolder
- public class UnsafeArrayWriter
- public class UnsafeRowWriter

davies · 2015-10-05T21:56:45Z

@liancheng Could you help to review this?

SparkQA · 2015-10-05T23:59:17Z

Test build #43253 has finished for PR 8971 at commit 3ad1e9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class BufferHolder
- public class UnsafeArrayWriter
- public class UnsafeRowWriter

SparkQA · 2015-10-06T01:32:04Z

Test build #43257 has finished for PR 8971 at commit 3b9f59e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala sql/core/src/main/scala/org/apache/spark/sql/columnar/compression/CompressionScheme.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnTypeSuite.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnarTestUtils.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/NullableColumnAccessorSuite.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/NullableColumnBuilderSuite.scala

SparkQA · 2015-10-06T18:02:31Z

Test build #43276 has finished for PR 8971 at commit 54479f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-06T19:21:02Z

Test build #43281 has finished for PR 8971 at commit d0be9e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-10-07T05:09:04Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala

can we save the length header for decimal? i.e. always write 16 bytes.

The constructor of BigInteger need to how the number of bytes, it will become even complicated. And most of Decimal will be smaller than 8 bytes, even with precision as 38.

liancheng · 2015-10-07T18:16:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

This seems to be a bug that worths a separate JIRA ticket.

https://issues.apache.org/jira/browse/SPARK-10980

liancheng · 2015-10-07T18:58:03Z

LGTM except for a few minor issues.

cloud-fan · 2015-10-07T18:58:10Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala

For non-compact decimal, we use ByteArrayColumnType, so maybe use ObjectColumnStats here?

No, we may want to have min/max of Decimal

SparkQA · 2015-10-07T19:43:37Z

Test build #43329 has finished for PR 8971 at commit 9c5718d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-10-07T20:30:15Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala

maybe LargeDecimalColumnAccessor according to the renaming change?

SparkQA · 2015-10-07T22:30:24Z

Test build #43341 has finished for PR 8971 at commit 4f0a94e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-10-07T22:59:02Z

Merged into master. Other nit comments will be addressed by follow up PR, thanks!

improve performance of complex type in columnar cache

713c884

fix mima

59bb2f9

Davies Liu added 2 commits October 5, 2015 14:24

fix tests

a791ba3

Merge branch 'master' of github.com:apache/spark into complex

f9a3502

davies reviewed Oct 5, 2015
View reviewed changes

cleanup

3ad1e9c

fix typeId

3b9f59e

davies force-pushed the complex branch from 69fe7d3 to 54479f1 Compare October 6, 2015 15:56

cleanup

d0be9e4

cloud-fan reviewed Oct 7, 2015
View reviewed changes

address comments

9c5718d

liancheng reviewed Oct 7, 2015
View reviewed changes

cloud-fan reviewed Oct 7, 2015
View reviewed changes

address comments

4f0a94e

cloud-fan reviewed Oct 7, 2015
View reviewed changes

asfgit closed this in 075a0b6 Oct 7, 2015

Conversation

davies commented Oct 3, 2015

Uh oh!

SparkQA commented Oct 3, 2015

Uh oh!

SparkQA commented Oct 5, 2015

Uh oh!

SparkQA commented Oct 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 5, 2015

Uh oh!

davies commented Oct 5, 2015

Uh oh!

SparkQA commented Oct 5, 2015

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

SparkQA commented Oct 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 7, 2015

Uh oh!

davies commented Oct 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants