[SPARK-11149] [SQL] Improve cache performance for primitive types #9145

davies · 2015-10-16T00:57:25Z

This PR improve the performance by:

Generate an Iterator that take Iterator[CachedBatch] as input, and call accessors (unroll the loop for columns), avoid the expensive Iterator.flatMap.
Use Unsafe.getInt/getLong/getFloat/getDouble instead of ByteBuffer.getInt/getLong/getFloat/getDouble, the later one actually read byte by byte.
Remove the unnecessary copy() in Coalesce(), which is not related to memory cache, found during benchmark.

The following benchmark showed that we can speedup the columnar cache of int by 2x.

path = '/opt/tpcds/store_sales/'
int_cols = ['ss_sold_date_sk', 'ss_sold_time_sk', 'ss_item_sk','ss_customer_sk']
df = sqlContext.read.parquet(path).select(int_cols).cache()
df.count()

t = time.time()
print df.select("*")._jdf.queryExecution().toRdd().count()
print time.time() - t

SparkQA · 2015-10-16T01:35:43Z

Test build #43824 has finished for PR 9145 at commit 7ee54a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class ColumnarIterator extends Iterator[InternalRow]
- class SpecificColumnarIterator extends $

SparkQA · 2015-10-16T07:12:57Z

Test build #43834 has finished for PR 9145 at commit 8a49887.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class ColumnarIterator extends Iterator[InternalRow]
- class SpecificColumnarIterator extends $

SparkQA · 2015-10-16T09:42:21Z

Test build #43836 has finished for PR 9145 at commit 1ef3e18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class ColumnarIterator extends Iterator[InternalRow]
- class SpecificColumnarIterator extends $

rxin · 2015-10-16T18:38:31Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala

@@ -44,11 +45,13 @@ private class CodeFormatter {
    } else {
      indentString
    }
+    code.append(f"${currentLine}%03d ")


as discussed offline, we can add /* ... */ to still enable pasting this into an IDE.

…ctly

rxin · 2015-10-16T18:57:42Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/GenerateColumnAccessor.scala

+
+  protected def create(columnTypes: Seq[DataType]): ColumnarIterator = {
+    val ctx = newCodeGenContext()
+    val (creaters, accesses) = columnTypes.zipWithIndex.map { case (dt, index) =>


actually probably more clear to say

initializeAccessors and extractors

SparkQA · 2015-10-16T21:22:34Z

Test build #43844 has finished for PR 9145 at commit 4511781.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class ColumnarIterator extends Iterator[InternalRow]
- class SpecificColumnarIterator extends $

rxin · 2015-10-19T20:31:59Z

LGTM - although I didn't look super closely so might be good for an extra pair of eyes too.

SparkQA · 2015-10-19T22:23:25Z

Test build #43936 has finished for PR 9145 at commit f9151cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ColumnarIterator extends Iterator[InternalRow]\n * class SpecificColumnarIterator extends $\n

davies · 2015-10-20T21:01:28Z

I'm going to merge this first, unblock me to work on output UnsafeRow for columnar cache, because having Unsafe format inside a MutableRow could result unexpected behavior, any new comments will be address in follow up PR.

tedyu · 2015-10-20T22:47:59Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala

  }

  override def extract(buffer: ByteBuffer, row: MutableRow, ordinal: Int): Unit = {
-    row.setDouble(ordinal, buffer.getDouble())
+    row.setDouble(ordinal, ByteBufferHelper.getDouble(buffer))
  }

  override def setField(row: MutableRow, ordinal: Int, value: Double): Unit = {


Around line 332, there is call to buffer.getShort()

Is it worth adding corresponding method to ByteBufferHelper ?

If so, I can send a PR.

Thanks

I think it does not worth it.

cloud-fan · 2015-10-21T05:39:31Z

LGTM

Davies Liu added 2 commits October 15, 2015 17:01

speedup reading from ByteBuffer

cea0e33

codegen

7ee54a9

fix empty partition

8a49887

davies force-pushed the byte_buffer branch from df7b725 to 8a49887 Compare October 16, 2015 06:24

fix udt

1ef3e18

fix test

9610766

davies changed the title ~~[WIP] Improve cache performance for primitive types~~ [SPARK-11149] [SQL] Improve cache performance for primitive types Oct 16, 2015

rxin reviewed Oct 16, 2015
View reviewed changes

put the line numbers into comments so the code could be compiled dire…

4511781

…ctly

rxin reviewed Oct 16, 2015
View reviewed changes

address comments

f9151cc

asfgit closed this in 06e6b76 Oct 20, 2015

tedyu reviewed Oct 20, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11149] [SQL] Improve cache performance for primitive types #9145

[SPARK-11149] [SQL] Improve cache performance for primitive types #9145

davies commented Oct 16, 2015

SparkQA commented Oct 16, 2015

SparkQA commented Oct 16, 2015

SparkQA commented Oct 16, 2015

rxin Oct 16, 2015

rxin Oct 16, 2015

rxin Oct 16, 2015

SparkQA commented Oct 16, 2015

rxin commented Oct 19, 2015

SparkQA commented Oct 19, 2015

davies commented Oct 20, 2015

tedyu Oct 20, 2015

davies Oct 20, 2015

cloud-fan commented Oct 21, 2015

[SPARK-11149] [SQL] Improve cache performance for primitive types #9145

[SPARK-11149] [SQL] Improve cache performance for primitive types #9145

Conversation

davies commented Oct 16, 2015

SparkQA commented Oct 16, 2015

SparkQA commented Oct 16, 2015

SparkQA commented Oct 16, 2015

rxin Oct 16, 2015

Choose a reason for hiding this comment

rxin Oct 16, 2015

Choose a reason for hiding this comment

rxin Oct 16, 2015

Choose a reason for hiding this comment

SparkQA commented Oct 16, 2015

rxin commented Oct 19, 2015

SparkQA commented Oct 19, 2015

davies commented Oct 20, 2015

tedyu Oct 20, 2015

Choose a reason for hiding this comment

davies Oct 20, 2015

Choose a reason for hiding this comment

cloud-fan commented Oct 21, 2015