[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. #18958

ueshin · 2017-08-16T06:13:57Z

What changes were proposed in this pull request?

This is a refactoring of ColumnVector hierarchy and related classes.

make ColumnVector read-only
introduce WritableColumnVector with write interface
remove ReadOnlyColumnVector

How was this patch tested?

Existing tests.

… introduce MutableColumnVector.

ueshin · 2017-08-16T06:15:30Z

cc @cloud-fan @BryanCutler

SparkQA · 2017-08-16T07:04:50Z

Test build #80720 has finished for PR 18958 at commit cd0de39.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-16T07:15:14Z

Jenkins, retest this please.

SparkQA · 2017-08-16T09:57:14Z

Test build #80723 has finished for PR 18958 at commit cd0de39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-08-16T11:02:28Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

      assert (!columns[ordinal].isConstant);
-      columns[ordinal].putNotNull(rowId);
-      columns[ordinal].putBoolean(rowId, value);
+      ((MutableColumnVector) columns[ordinal]).putNotNull(rowId);


Maybe move the assertion and the cast in a private getter?

Sure, I'll add a private getter and update these.

hvanhovell · 2017-08-16T11:06:09Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnVector.java

+        " to false.";
+
+    if (cause != null) {
+      throw new RuntimeException(message, cause);


You are allowed to pass null as a cause to the RuntimeException constructor.

Thanks. I'll update it.

hvanhovell · 2017-08-16T11:13:44Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java

+  public OffHeapColumnVector(int capacity, DataType type) {
+    super(capacity, type);
+
+    if (type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType


Can you try to move this initialization logic into the parent class? We should be able to factor out the on/off-heap specific initialization logic into a separate method.

Sure, I'll try it.

hvanhovell · 2017-08-16T11:14:07Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java

+   * Reserve a integer column for ids of dictionary.
+   */
+  @Override
+  public OffHeapColumnVector reserveDictionaryIds(int capacity) {


Same comment as in the constructor.

Sure, I'll try it.

hvanhovell · 2017-08-16T11:20:05Z

...ore/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala

-       |      aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
-       |    for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
-       |       aggregateBufferBatch.setColumn(i, batch.column(i+${groupingKeys.length}));
+       |    batchVectors = new org.apache.spark.sql.execution.vectorized


This happens quite a few times. It might be better to create a static util method that creates the vectors for you.

Sure, I'll try it.

hvanhovell · 2017-08-16T11:20:33Z

...ore/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala

       |    }
+       |    // TODO: Possibly generate this projection in HashAggregate directly


Can you elaborate?

I'm sorry but I'm not sure because this is from original code.

hvanhovell · 2017-08-16T12:32:03Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnVector.java

+ * elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
+ * the lengths are known up front.
+ *
+ * A ColumnVector should be considered immutable once originally created. In other words, it is not


This contradicts the name of this class. Maybe reuseable is a better way of describing what is going on here. Also cc @michal-databricks

How about WritableColumnVector?

hvanhovell · 2017-08-16T12:35:49Z

On a more generic level. We could also choose to make ColumnVectors immutable, and create builder classes to create (reusable) instances; this would create a better separation between the API's and make sure that the mutable vectors are not used incorrectly.

BryanCutler

Thanks for doing this @ueshin! From the perspective of using this for ArrowColumnVector batches, LGTM. I just had one question about removing the capacity var from ColumnarBatch, I think we can get away with just using numRows.

BryanCutler · 2017-08-16T22:04:57Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

-    this.capacity = maxRows;
-    this.columns = new ColumnVector[schema.size()];
+    this.columns = columns;
+    this.capacity = capacity;


Does capacity really mean anything in here anymore since the ColumnVectors are allocated and populated outside now? Could we just initialize this.numRows = 0 and delay initializing of this.filteredRows until setNumRows() is called?

I found some places referring ColumnarBatch.capacity(), so I'd be a little conservative to do that for now.

ueshin · 2017-08-17T05:41:17Z

also cc @kiszk for another column vector pr.

SparkQA · 2017-08-17T07:04:49Z

Test build #80766 has finished for PR 18958 at commit b6ab633.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-17T07:15:09Z

Jenkins, retest this please.

SparkQA · 2017-08-17T09:50:48Z

Test build #80775 has finished for PR 18958 at commit b6ab633.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-20T12:44:08Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

@@ -450,14 +450,13 @@ class CodegenContext {
  /**
   * Returns the specialized code to set a given value in a column vector for a given `DataType`.
   */
-  def setValue(batch: String, row: String, dataType: DataType, ordinal: Int,
-      value: String): String = {
+  def setValue(vector: String, row: String, dataType: DataType, value: String): String = {


nit: row -> rowId

cloud-fan · 2017-08-20T12:45:18Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

@@ -433,7 +434,8 @@ private void readBinaryBatch(int rowId, int num, ColumnVector column) throws IOE
  }

  private void readFixedLenByteArrayBatch(int rowId, int num,
-                                          ColumnVector column, int arrayLen) throws IOException {
+                                          MutableColumnVector column,
+                                          int arrayLen) throws IOException {


nit:

private void xxx( para1: XX, para2: XX)

cloud-fan · 2017-08-20T12:51:10Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

-      anyNullsSet = false;
-    }
-  }
+  public abstract void reset();


if ColumnVector is read-only, why we need a reset API?

cloud-fan · 2017-08-20T12:54:55Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

-      this.resultArray = null;
-      this.resultStruct = null;
-    }
+    this.isConstant = true;


I think isConstant should belong to MutableColumnVector, because it's used to indicate that this column vector should not be updated.

cloud-fan · 2017-08-20T13:09:03Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

-      assert (!columns[ordinal].isConstant);
-      columns[ordinal].putNotNull(rowId);
-      columns[ordinal].putByte(rowId, value);
+      MutableColumnVector column = getColumnAsMutable(ordinal);


I'm a little afraid about this per-call type cast, but JVM should be able to optimize it perfectly, cc @kiszk

In my understanding, cast still occurs at runtime. The cast operation may consist compare and branch.
I am thinking about how we can reduce the cost of operations.

@ueshin @cloud-fan
Since MutableColumnVector in each column in ColumnarBatch is immutable, we can create an array of MutableColumnVector by applying cast from ColumnVector at initialization. If an cast exception occurs, we can ignore it since the column will not call setter APIs. Then, each setter in refers to an element of the array without a cast.

What do you think?

cloud-fan · 2017-08-21T05:17:39Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

-   */
-  protected int elementsAppended;
-
-  /**
   * If this is a nested type (array or struct), the column for the child data.
   */
  protected ColumnVector[] childColumns;


can we move this to WritableColumnVector? I think ColumnVector only need ColumnVector getChildColumn(int ordinal), and WritableColumnVector can overwrite it to WritableColumnVector getChildColumn(int ordinal)

We need this field for ArrowColumnVector to store its child columns for now, too.
Do you want to make the method getChildColumn(int ordinal) abstract and move the field to more concrete classes to manage by themselves?

yea, because mostly the child columns are of the same type of concrete column vector type.

cloud-fan · 2017-08-21T05:22:07Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

@@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {

    @Override
    public void setNullAt(int ordinal) {


one question, does the rows returned by ColumnarBatch.rowIterator have to be mutable?

It seems like the rows returned by ColumnarBatch.rowIterator doesn't need to be mutable with our current tests, but ColumnarBatch.Row still needs to be mutable, the write apis of which are used in HashAggregateExec.

oh, then we really need to think about how to eliminate the per-call type cast...

SparkQA · 2017-08-21T07:04:49Z

Test build #80918 has finished for PR 18958 at commit 4d94655.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-21T07:16:04Z

Jenkins, retest this please.

SparkQA · 2017-08-21T10:00:01Z

Test build #80923 has finished for PR 18958 at commit 4d94655.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-08-21T11:28:54Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

+  /**
+   * Initialize child columns.
+   */
+  protected void initialize() {


We could move this method into the constructor.

SparkQA · 2017-08-22T07:04:48Z

Test build #80962 has finished for PR 18958 at commit 9eb88a8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-22T07:05:48Z

retest this please

SparkQA · 2017-08-22T09:07:53Z

Test build #80968 has finished for PR 18958 at commit 9eb88a8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T11:13:51Z

Test build #80979 has finished for PR 18958 at commit 65cd681.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-22T14:05:21Z

Jenkins, retest this please.

SparkQA · 2017-08-22T16:46:22Z

Test build #80987 has finished for PR 18958 at commit 65cd681.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-23T15:06:26Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java

+      column.putDecimal(rowId, value, precision);
+    }
+
+    private WritableColumnVector getColumnAsWritable(int ordinal) {


nit: getWritableColumn

cloud-fan · 2017-08-23T15:17:02Z

LGTM except some minor comments

SparkQA · 2017-08-23T18:53:36Z

Test build #81038 has finished for PR 18958 at commit 8330870.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-24T09:58:01Z

LGTM

cloud-fan · 2017-08-24T13:14:11Z

thanks, merging to master!

ueshin added 2 commits August 16, 2017 15:09

Refactor ColumnVector hierarchy to make ColumnVector read-only and to…

e4e2241

… introduce MutableColumnVector.

Modify VectorizedHashMapGenerator to use OnHeapColumnVector directly.

cd0de39

hvanhovell reviewed Aug 16, 2017

View reviewed changes

BryanCutler reviewed Aug 16, 2017

View reviewed changes

ueshin added 5 commits August 17, 2017 12:29

Add a private getter method.

05b1aa9

Address a comment about the RuntimeException constructor.

79eaecd

Move initialization logics into the parent class.

d2dda9c

Move reserveDictionaryIds logic into the parent class.

2769b39

Add static util methods to create the on/off-heap vectors.

b6ab633

cloud-fan reviewed Aug 20, 2017

View reviewed changes

Use rowId instead of row.

ae317f6

ueshin added 5 commits August 21, 2017 12:05

Fix style.

de6a87d

Remove reset() from ColumnVector.

4cd9c77

Move isConstant to MutableColumnVector.

a52d717

Remove unneeded cast.

d7b77f7

Rename MutableColumnVector to WritableColumnVector.

4d94655

ueshin changed the title ~~[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector.~~ [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. Aug 21, 2017

cloud-fan reviewed Aug 21, 2017

View reviewed changes

hvanhovell reviewed Aug 21, 2017

View reviewed changes

ueshin added 3 commits August 22, 2017 13:26

Move childColumns to more concrete class.

ccb2d59

Move initializing child columns into constructor.

5782cc4

Make ArrowColumnVector reusable.

9eb88a8

Cast ColumnVector to WritableColumnVector when initializing.

65cd681

cloud-fan reviewed Aug 23, 2017

View reviewed changes

Rename a method getColumnAsWritable to getWritableColumn.

8330870

asfgit closed this in 9e33954 Aug 24, 2017

		\| }
		\| // TODO: Possibly generate this projection in HashAggregate directly

		@@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {

		@Override
		public void setNullAt(int ordinal) {

[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. #18958

[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. #18958

Conversation

ueshin commented Aug 16, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Aug 16, 2017

SparkQA commented Aug 16, 2017

ueshin commented Aug 16, 2017

SparkQA commented Aug 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell commented Aug 16, 2017

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Aug 17, 2017

SparkQA commented Aug 17, 2017

ueshin commented Aug 17, 2017

SparkQA commented Aug 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk Aug 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Aug 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 21, 2017

ueshin commented Aug 21, 2017

SparkQA commented Aug 21, 2017

Choose a reason for hiding this comment

SparkQA commented Aug 22, 2017

kiszk commented Aug 22, 2017

SparkQA commented Aug 22, 2017

SparkQA commented Aug 22, 2017

ueshin commented Aug 22, 2017

SparkQA commented Aug 22, 2017

Choose a reason for hiding this comment

cloud-fan commented Aug 23, 2017

SparkQA commented Aug 23, 2017

kiszk commented Aug 24, 2017

cloud-fan commented Aug 24, 2017

ueshin commented Aug 16, 2017 •

edited

kiszk Aug 22, 2017 •

edited

ueshin Aug 21, 2017 •

edited