[SPARK-11243] [SQL] output UnsafeRow from columnar cache #9203

davies · 2015-10-21T19:13:11Z

This PR change InMemoryTableScan to output UnsafeRow, and optimize the unrolling and scanning by coping the bytes for var-length types between UnsafeRow and ByteBuffer directly without creating the wrapper objects. When scanning the decimals in TPC-DS store_sales table, it's 80% faster (copy it as long without create Decimal objects).

SparkQA · 2015-10-21T19:23:33Z

Test build #44083 has finished for PR 9203 at commit 15ebb3b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ColumnarIterator extends Iterator[InternalRow]\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n * class SpecificColumnarIterator extends $\n

rxin · 2015-10-21T21:27:41Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

as discussed offline, maybe change this to assert?

will do in separate PR

JoshRosen · 2015-10-21T21:47:07Z

I'm really excited to see how much of a performance difference this makes when scanning string columns, since this could potentially provide a big perf. boost to the sc.textFile replacement that I showed the other day.

rxin · 2015-10-21T21:48:06Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

Do we ever start from the middle of an existing buffer? If not, can we move startingOffset into holder class?

Yes, in UnsafeProjection, there could be multiple UnsafeRowWriter share a single BufferHolder.

rxin · 2015-10-21T22:06:43Z

LGTM otherwise.

SparkQA · 2015-10-21T22:24:51Z

Test build #44089 has finished for PR 9203 at commit 81e3fad.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

davies · 2015-10-21T22:38:51Z

@JoshRosen If we only scan the cached string and access it (without adding a ConvertToUnsafe), this could be a little bit slower, because we copy the bytes into UnsafeRow now. For the UTF8String object, we still will create it anyway, before this patch, we create it in ColumnAccessor, after this patch, we create in UnsafeRow.getUTF8String() (we could re-use that actually).

SparkQA · 2015-10-21T22:39:50Z

Test build #44092 has finished for PR 9203 at commit e33170d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

This reverts commit c76c759.

cloud-fan · 2015-10-22T00:38:49Z

...alyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java

Why implement these primitive write method in writer class not in codegen?

Those in UnsafeRowWriter are needed for ColumnAccessor. The purpose of these *Writer is to simplify generated code (which is already hard to understand), these method also go in this direction.

In terms of performance, I think there will not difference, they all could be JITed and inlined.

I think it is simpler to put them there rather than in generated code.

SparkQA · 2015-10-22T01:22:45Z

Test build #44098 has finished for PR 9203 at commit 1421bce.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

SparkQA · 2015-10-22T01:43:56Z

Test build #44100 has finished for PR 9203 at commit cab9286.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class MutableUnsafeRow(val writer: UnsafeRowWriter) extends GenericMutableRow(null)\n

rxin · 2015-10-22T02:20:11Z

I'm going to merge this first. @davies please take a look at @cloud-fan 's comments, and address them in follow-up prs if necessary.

output UnsafeRow from columnar cache

15ebb3b

Davies Liu added 2 commits October 21, 2015 13:28

fix style and refactor

81e3fad

fix code style

e33170d

rxin reviewed Oct 21, 2015
View reviewed changes

Davies Liu added 3 commits October 21, 2015 16:22

address comments

c76c759

Revert "address comments"

f0eb10c

This reverts commit c76c759.

address comments

cab9286

davies force-pushed the unsafe_cache branch from 1421bce to cab9286 Compare October 21, 2015 23:30

cloud-fan reviewed Oct 22, 2015
View reviewed changes

asfgit closed this in 1d97332 Oct 22, 2015

[SPARK-11243] [SQL] output UnsafeRow from columnar cache #9203

[SPARK-11243] [SQL] output UnsafeRow from columnar cache #9203

Uh oh!

Conversation

davies commented Oct 21, 2015

Uh oh!

SparkQA commented Oct 21, 2015

Uh oh!

rxin Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

davies Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Oct 21, 2015

Uh oh!

rxin Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

davies Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Oct 21, 2015

Uh oh!

SparkQA commented Oct 21, 2015

Uh oh!

davies commented Oct 21, 2015

Uh oh!

SparkQA commented Oct 21, 2015

Uh oh!

cloud-fan Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

davies Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

rxin commented Oct 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants