[SPARK-13805][SQL] Generate code that get a value in each column from ColumnVector when ColumnarBatch is used #11636

kiszk · 2016-03-10T19:33:07Z

What changes were proposed in this pull request?

This PR generates code that get a value in each column from ColumnVector instead of creating InternalRow when ColumnarBatch is accessed. This PR improves benchmark program by up to 15%.
This PR consists of two parts:

Get an ColumnVector by using ColumnarBatch.column() method
Get a value of each column by using rdd_col${COLIDX}.getInt(ROWIDX) instead of rdd_row.getInt(COLIDX)

This is a motivated example.

    sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    val values = 10
    withTempPath { dir =>
      withTempTable("t1", "tempTable") {
        sqlContext.range(values).registerTempTable("t1")
        sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1")
          .write.partitionBy("p").parquet(dir.getCanonicalPath)
        sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable")
        sqlContext.sql("select sum(p) from tempTable").collect
      }
    }

The original code

    ...
    /* 072 */       while (!shouldStop() && rdd_batchIdx < numRows) {
    /* 073 */         InternalRow rdd_row = rdd_batch.getRow(rdd_batchIdx++);
    /* 074 */         /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */
    /* 075 */         /* input[0, int] */
    /* 076 */         boolean rdd_isNull = rdd_row.isNullAt(0);
    /* 077 */         int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0));
    ...

The code generated by this PR

    /* 072 */       while (!shouldStop() && rdd_batchIdx < numRows) {
    /* 073 */         org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0);
    /* 074 */         /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */
    /* 075 */         /* input[0, int] */
    /* 076 */         boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx);
    /* 077 */         int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx));
    ...
    /* 128 */         rdd_batchIdx++;
    /* 129 */       }
    /* 130 */       if (shouldStop()) return;

Performance
Without this PR

model name  : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Read data column                          434 /  488         36.3          27.6       1.0X
Read partition column                     302 /  346         52.1          19.2       1.4X
Read both columns                         588 /  643         26.8          37.4       0.7X

With this PR

model name  : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Read data column                          392 /  516         40.1          24.9       1.0X
Read partition column                     256 /  318         61.4          16.3       1.5X
Read both columns                         523 /  539         30.1          33.3       0.7X

How was this patch tested?

Tested by existing test suites and benchmark

SparkQA · 2016-03-10T20:09:21Z

Test build #52846 has finished for PR 11636 at commit 0db679f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-11T01:29:58Z

This is great. Can you include the generated code snippet with a few columns?

kiszk · 2016-03-11T11:10:10Z

@nongli , here is another example.

Spark code with two columns

    sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    val values = 10
    withTempPath { dir =>
      withTempTable("t1", "tempTable") {
        sqlContext.range(values).registerTempTable("t1")
        sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1")
          .write.partitionBy("p").parquet(dir.getCanonicalPath)
        sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable")
        sqlContext.sql("select sum(p), sum(id) from tempTable").collect
      }
    }

Code snippet generated by this PR

...
    /* 073 */   private void rdd_processBatches() throws java.io.IOException {
    /* 074 */     while (true) {
    /* 075 */       int numRows = rdd_batch.numRows();
    /* 076 */       if (rdd_batchIdx == 0) rdd_metricValue.add(numRows);
    /* 077 */
    /* 078 */       while (!shouldStop() && rdd_batchIdx < numRows) {
    /* 079 */         org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0);org.apache.spark.sql.execution.vectorized.ColumnVector r
dd_col1 = rdd_batch.column(1);
    /* 080 */         /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false),(sum(cast(id#3 as bigint)),mode=Pa
rtial,isDistinct=false)], output=[sum#13L,sum#14L]) */
    /* 081 */         /* input[0, int] */
    /* 082 */         boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx);
    /* 083 */         int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx));
    /* 084 */         /* input[1, int] */
    /* 085 */         boolean rdd_isNull1 = rdd_col1.getIsNull(rdd_batchIdx);
    /* 086 */         int rdd_value1 = rdd_isNull1 ? -1 : (rdd_col1.getInt(rdd_batchIdx));
    /* 087 */
    /* 088 */         // do aggregate
...

SparkQA · 2016-03-12T04:38:50Z

Test build #52991 has finished for PR 11636 at commit 141f2d6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-12T06:29:30Z

Test build #52992 has finished for PR 11636 at commit aa09964.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-03-12T22:15:39Z

cc @davies

nongli · 2016-03-15T16:00:43Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

@@ -158,9 +158,13 @@ class CodegenContext {
  /** The variable name of the input row in generated code. */
  final var INPUT_ROW = "i"

+  /** The variable name of the input col in generated code. */
+  var INPUT_COLORDINAL = "idx"


INPUT_COL_ORDINAL

kiszk · 2016-03-16T15:18:36Z

@nongli , is it possible to change the name of API from ColumnVector.getIsNull() to ColumnVector.isNullAt(). If it is possible, we can remove a change of this PR in at lines 74-84 in BoundsAttribute.scala.

nongli · 2016-03-16T16:15:41Z

@kiszk Feel free to change that API

davies · 2016-03-16T16:17:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

+    val value = if (!ctx.isColumnarType(ctx.INPUT_ROW)) {
+      ctx.getValue(ctx.INPUT_ROW, dataType, ordinal.toString)
+    } else {
+      ctx.getValue(ctx.INPUT_ROW, dataType, ctx.INPUT_COLORDINAL)


Can we move these into DataSourceScan? (Because it's only used there)

One possible idea is to prepare getValue() function that is can be overridden as follows.
In my opinion, the following code is not easy to read. What do you think?

In BoundExpression

def getValue(...): String = { ctx.getValue(ctx.INPUT_ROW, dataType, ordinal.toString) } override genCode(...): String = { ... val value = getValue(...) ... }

In DataSourceScan, we pass BoundExpression that has own getValue() as follows:

... val exprs = output.zipWithIndex.map(x => new BoundReference(x._2, x._1.dataType, true) { def getValue(...): String = { val value = if (!ctx.isColumnarType(ctx.INPUT_ROW)) { ctx.getValue(ctx.INPUT_ROW, dataType, ordinal.toString) } else { ctx.getValue(ctx.INPUT_ROW, dataType, ctx.INPUT_COLORDINAL) } } }) ...

In DataSourceScan, we do NOT need BoundReference to generate the code to access ColumnBatch, we can generate the code directly.

Is it good to introduce new InputReference, which is similar to BoundReference, only for the code to access ColumnBatch?

Since it's only used in one place, I'd like to narrow down the changes, it's easier to maintain.

SparkQA · 2016-03-16T18:02:12Z

Test build #53324 has finished for PR 11636 at commit c522a68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-17T17:05:07Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

@@ -199,7 +210,8 @@ class CodegenContext {
      case StringType => s"$input.getUTF8String($ordinal)"
      case BinaryType => s"$input.getBinary($ordinal)"
      case CalendarIntervalType => s"$input.getInterval($ordinal)"
-      case t: StructType => s"$input.getStruct($ordinal, ${t.size})"
+      case t: StructType => if (!isColumnarType(input)) { s"$input.getStruct($ordinal, ${t.size})" }


Right now, the parquet reader does not support nested types (Array, Map, Struct), it's fine to not have this special case in this PR.

While this getStruct() for ColumnVector may not be called at rutnime now, a code generator always produces two version for InternalRow and ColumnVector.
Thus, this code is necessary to avoid compilation error for now. In the future, this code is necessary to correctly handle nested types.

Why not make them have the same APIs?

Sure, I provided the same API ColumnVector.getStruct(int, int).

SparkQA · 2016-03-17T20:56:20Z

Test build #53466 has finished for PR 11636 at commit 5544c96.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class InputReference(ordinal: Int, dataType: DataType, nullable: Boolean, isColumn: Boolean)

SparkQA · 2016-03-17T21:11:55Z

Test build #53468 has finished for PR 11636 at commit 5efadf3.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InputReference(ordinal: Int, dataType: DataType, nullable: Boolean, isColumn: Boolean)

SparkQA · 2016-03-17T22:30:36Z

Test build #53470 has finished for PR 11636 at commit 8c9d054.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T05:42:46Z

Test build #53500 has finished for PR 11636 at commit e08472b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InputReference(

SparkQA · 2016-03-18T13:54:34Z

Test build #53537 has finished for PR 11636 at commit a7ac8fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-18T17:24:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

+    val exprCols = output.zipWithIndex.map(
+      x => new InputReference(x._2, x._1.dataType, x._1.nullable, rowidx))
+    val exprRows = output.zipWithIndex.map(
+      x => new InputReference(x._2, x._1.dataType, x._1.nullable))


Should we use BoundReference here?

Used BoundReference for exprRows

SparkQA · 2016-03-19T01:16:43Z

Test build #53584 has finished for PR 11636 at commit cdd3078.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…tor for easy code generation rename getIsNull() to isNullAt() add getStruct(int, int)

…tor for easy code generation rename getIsNull() to isNullAt()

…mnVector

remove CodeGenerator.INPUT_COL_ORDINAL

SparkQA · 2016-03-21T12:27:01Z

Test build #53674 has finished for PR 11636 at commit fb693d2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-21T14:38:09Z

Have you rerun the benchmark with these changes?

SparkQA · 2016-03-21T16:01:39Z

Test build #53678 has finished for PR 11636 at commit 9ec61ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-21T17:55:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala

      |
+      |     ${columnLIVAssigns.mkString("", "\n", "\n")}


Do we still need this?

For now, I will not use this. In the future, I may revisit this scala replacement regarding all of possible variables in another PR.

kiszk · 2016-03-21T19:45:02Z

Here are benchmark results in the latest code.

Without this PR

model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Read data column                          484 /  529         32.5          30.7       1.0X
Read partition column                     323 /  368         48.7          20.5       1.5X
Read both columns                         617 /  654         25.5          39.2       0.8X

With this PR

model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
Read data column                          412 /  444         38.2          26.2       1.0X
Read partition column                     290 /  327         54.2          18.4       1.4X
Read both columns                         516 /  613         30.5          32.8       0.8X

davies · 2016-03-21T20:03:22Z

LGTM

SparkQA · 2016-03-21T21:17:32Z

Test build #53706 has finished for PR 11636 at commit 6b07e69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-21T21:37:18Z

Merged into master, thanks!

…m ColumnVector when ColumnarBatch is used ## What changes were proposed in this pull request? This PR generates code that get a value in each column from ```ColumnVector``` instead of creating ```InternalRow``` when ```ColumnarBatch``` is accessed. This PR improves benchmark program by up to 15%. This PR consists of two parts: 1. Get an ```ColumnVector ``` by using ```ColumnarBatch.column()``` method 2. Get a value of each column by using ```rdd_col${COLIDX}.getInt(ROWIDX)``` instead of ```rdd_row.getInt(COLIDX)``` This is a motivated example. ```` sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") val values = 10 withTempPath { dir => withTempTable("t1", "tempTable") { sqlContext.range(values).registerTempTable("t1") sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1") .write.partitionBy("p").parquet(dir.getCanonicalPath) sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable") sqlContext.sql("select sum(p) from tempTable").collect } } ```` The original code ````java ... /* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) { /* 073 */ InternalRow rdd_row = rdd_batch.getRow(rdd_batchIdx++); /* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */ /* 075 */ /* input[0, int] */ /* 076 */ boolean rdd_isNull = rdd_row.isNullAt(0); /* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0)); ... ```` The code generated by this PR ````java /* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) { /* 073 */ org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0); /* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */ /* 075 */ /* input[0, int] */ /* 076 */ boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx); /* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx)); ... /* 128 */ rdd_batchIdx++; /* 129 */ } /* 130 */ if (shouldStop()) return; ```` Performance Without this PR ```` model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Read data column 434 / 488 36.3 27.6 1.0X Read partition column 302 / 346 52.1 19.2 1.4X Read both columns 588 / 643 26.8 37.4 0.7X ```` With this PR ```` model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Read data column 392 / 516 40.1 24.9 1.0X Read partition column 256 / 318 61.4 16.3 1.5X Read both columns 523 / 539 30.1 33.3 0.7X ```` ## How was this patch tested? Tested by existing test suites and benchmark Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes apache#11636 from kiszk/SPARK-13805.

kiszk changed the title ~~[SPARK-13805][SQL] Generate code that get a value in each column from ColumnVector when ColumnarVector is used~~ [SPARK-13805][SQL] Generate code that get a value in each column from ColumnVector when ColumnarBatch is used Mar 11, 2016

nongli reviewed Mar 15, 2016
View reviewed changes

davies reviewed Mar 16, 2016
View reviewed changes

davies reviewed Mar 17, 2016
View reviewed changes

kiszk force-pushed the SPARK-13805 branch from 5544c96 to 5efadf3 Compare March 17, 2016 21:04

davies reviewed Mar 18, 2016
View reviewed changes

kiszk added 17 commits March 21, 2016 21:15

avoid a compilation error of generated Java code

e5a60c8

create a fresh variable for each variable that stores ColumnVector

99c5766

generate code based on type of a given variable

5450c4b

manage type of fresh variable

4350a00

fix scala style

4357581

addessed Nong's review comments

f69ed2a

revert changes

2b43ac9

rename/add APIs to have the same APIs among InternalRow and ColumnVec…

e1525bc

…tor for easy code generation rename getIsNull() to isNullAt() add getStruct(int, int)

rename/add APIs to have the same APIs among InternalRow and ColumnVec…

ced8fea

…tor for easy code generation rename getIsNull() to isNullAt()

use InputReference that support accesses both to InternalRow and Colu…

364b56e

…mnVector

fix compilation error

919ac47

fix bugs

edf9521

remove CodeGenerator.INPUT_COL_ORDINAL

make the file smaller and more readable

aecc78e

Customize and simplify Reference for ColumnVector

87010ac

make variables to keep ColumnVector instance variable

25d6a62

revert changes

7f24118

addressed Davies's comments

fb693d2

kiszk force-pushed the SPARK-13805 branch from cdf4333 to fb693d2 Compare March 21, 2016 12:20

fix build error

9ec61ce

davies reviewed Mar 21, 2016
View reviewed changes

not use a local variable for loop invariant variable

6b07e69

asfgit closed this in f35df7d Mar 21, 2016

[SPARK-13805][SQL] Generate code that get a value in each column from ColumnVector when ColumnarBatch is used #11636

[SPARK-13805][SQL] Generate code that get a value in each column from ColumnVector when ColumnarBatch is used #11636

Conversation

kiszk commented Mar 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 10, 2016

nongli commented Mar 11, 2016

kiszk commented Mar 11, 2016

SparkQA commented Mar 12, 2016

SparkQA commented Mar 12, 2016

kiszk commented Mar 12, 2016

Choose a reason for hiding this comment

kiszk commented Mar 16, 2016

nongli commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 19, 2016

SparkQA commented Mar 21, 2016

nongli commented Mar 21, 2016

SparkQA commented Mar 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Mar 21, 2016

davies commented Mar 21, 2016

SparkQA commented Mar 21, 2016

davies commented Mar 21, 2016