[SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata #35068

Yaohua628 · 2021-12-30T06:35:04Z

What changes were proposed in this pull request?

Implement a new column vector named ConstantColumnVector, which avoids copying the same data for all rows but storing only one copy of the data.

Also, improve performance of hidden file metadata FileScanRDD

Why are the changes needed?

Performance improvements.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new test suite.

Yaohua628 · 2021-12-30T06:39:37Z

Hi @cloud-fan, a small performance improvement PR adding a new method putByteArrays in WritableColumnVector - avoid copying the same byte array for every row. please take a look whenever you have time, thanks a bunch!

c21 · 2021-12-30T08:20:44Z

I am not against this change, it's an improvement on top of current code anyway. Just want to think twice here as we are adding API method to WritableColumnVector which is public.

As we discussed before, just wondering if we change the code path to use constant column vector in the future, won't it render this newly added API not used any more?

btw @Yaohua628 just in case, please let me know if any help needed for constant column vector implementation, I can also help it out to implement it.

cloud-fan · 2021-12-30T09:21:27Z

Yea I think a constant column vector is a better solution.

Yaohua628 · 2021-12-30T09:36:04Z

I am not against this change, it's an improvement on top of current code anyway. Just want to think twice here as we are adding API method to WritableColumnVector which is public.

As we discussed before, just wondering if we change the code path to use constant column vector in the future, won't it render this newly added API not used any more?

btw @Yaohua628 just in case, please let me know if any help needed for constant column vector implementation, I can also help it out to implement it.

Yea I think a constant column vector is a better solution.

got it, ya, make sense! will try to have a simple version of the constant column vector. @c21 thanks for offering, will reach out to you if I need any help. thanks a lot!

AmplabJenkins · 2021-12-30T12:12:21Z

Can one of the admins verify this patch?

Yaohua628 · 2021-12-31T02:35:33Z

@cloud-fan @c21, have a simple version of the constant column vector, please take a look whenever you get a chance, appreciate any feedback and suggestions. will add UTs for the new class if it is generally looking good, thanks!

Happy new year!

c21

Thanks @Yaohua628 for making the change! Having some comments.

c21 · 2022-01-05T07:28:03Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

@@ -0,0 +1,212 @@
+package org.apache.spark.sql.execution.vectorized;


Let's add the Apache license header similar to other files.

c21 · 2022-01-05T07:39:42Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+ * Capacity: The vector only stores one copy of the data, and acts as an unbounded vector
+ * (get from any row will return the same value)
+ */
+public class ConstantColumnVector extends ColumnVector {


I am thinking whether we should extend WritableColumnVector instead, so we can easily leverage constant column vector to represent partition columns.

It seems for partition columns, we are doing copying of same value per row (Parquet and ORC). A future improvement is to use the constant column vector we are introducing here to avoid unnecessary operations.

@cloud-fan WDYT?

I was thinking to extend WritableColumnVector initially, but seems like we needs to implement some unnecessary public methods like: putLongs(rowId, count, value)

c21 · 2022-01-05T07:40:54Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+
+  @Override
+  public int numNulls() {
+    return -1;


why -1 here?

this ConstantColumnVector is 'boundless', all values are the same, and no capacity (no total number of rows), I am also wondering what we should return here - maybe something like an UnimplementedException? cc: @cloud-fan

we can know the number of rows from the context, right? the numNulls should either be 0 or numRows.

c21 · 2022-01-05T07:42:05Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+  public void putChild(ConstantColumnVector value) {
+    childData = value;
+  }


why we need this?

my bad, I need to update this: it is for the constant struct or array type - I also need to take an original, will address this.

c21 · 2022-01-05T07:46:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

-                // while internally, the TimestampType is stored in microsecond
-                metadataRow.update(i, currentFile.modificationTime * 1000L)
-            }
+      private def updateMetadataData(): Unit = {


nit: the name updateMetadataData is kind of hard to read. btw why we delete createMetadataColumnVector, and mix vector and row together here? IMO it's better to separate code paths for non-vectorized and vectorized scan if possible.

make sense!

c21 · 2022-01-05T07:48:38Z

nit: the scope of change is kind of big IMO, as we are introducing a new public column vector. Maybe better to file a new JIRA instead of followup.

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

cloud-fan · 2022-01-13T09:01:21Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+  public ConstantColumnVector(int numRows, DataType type) {
+    super(type);
+    this.numRows = numRows;
+    if (type instanceof StructType) {


why do we handle StructType twice?

cloud-fan · 2022-01-13T09:01:58Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+
+  @Override
+  public int numNulls() {
+    return numRows;


should be 0 if hasNull is false

cloud-fan · 2022-01-13T09:06:10Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+ *
+ * Capacity: The vector only stores one copy of the data, and acts as an unbounded vector
+ * (get from any row will return the same value)
+ */


Can we write a UT for this new vector?

sure - working on it!

cloud-fan · 2022-01-13T09:06:54Z

let's open a new JIRA ticket as adding a new kind of column vector is non-trivial.

c21 · 2022-01-13T09:15:28Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+
+  /**
+   * Sets up the data type of this constant column vector.
+   * @param type


nit: this seems useless.

c21 · 2022-01-13T09:17:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

+        if (metadataColumns.isEmpty || currentFile == null) return
+        val path = new Path(currentFile.filePath)
+        metadataColumns.zipWithIndex.foreach { case (attr, i) =>
+          attr.name match {
+            case FILE_PATH => metadataRow.update(i, UTF8String.fromString(path.toString))
+            case FILE_NAME => metadataRow.update(i, UTF8String.fromString(path.getName))
+            case FILE_SIZE => metadataRow.update(i, currentFile.fileSize)
+            case FILE_MODIFICATION_TIME =>
+              // the modificationTime from the file is in millisecond,
+              // while internally, the TimestampType is stored in microsecond
+              metadataRow.update(i, currentFile.modificationTime * 1000L)


nit: unnecessary change? didn't feel the readability improved much after negating the if condition.

c21 · 2022-01-13T09:18:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

          }
        }
      }

      /**
-       * Create a writable column vector containing all required metadata columns
+       * Create a constant column vector containing all required metadata columns


nit: shouldn't be: Create an array of constant column vectors containing ... ?

c21 · 2022-01-13T09:20:57Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+  private ColumnarArray arrayData;
+  private ColumnarMap mapData;
+
+  private int numRows;


wondering what's the point of storing numRows here? It seems that we don't use numRows at all, e.g. checking rowId in each getXXX method.

the only place is numNulls from Wenchen's suggestion: #35068 (comment)

Nit: make this final

sunchao · 2022-01-13T22:59:14Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+
+  private int numRows;
+
+  public ConstantColumnVector(int numRows, DataType type) {


WritableColumnVector already has a way to set constant via setIsConstant. Have you looked at it?

It seems the setIsConstant only affects reset, but doesn't change how the data is stored.

Yeah I actually looked at it as well. Seems there's more code change needed if we want to utilize setIsConstant from WritableColumnVector. It'd better to start with a separate new class ConstantColumnVector here.

Make sense. Perhaps we can remove setIsConstant later and replace its usage with ConstantColumnVector.

viirya · 2022-01-14T07:20:34Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+import org.apache.spark.sql.types.*;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarArray;
+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+import java.math.BigDecimal;
+import java.math.BigInteger;


This order of imports look not following Spark style. Java imports should be before Spark's.

viirya · 2022-01-14T07:22:49Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+    // copy and modify from WritableColumnVector
+    // could also putChild by users


This comment looks unnecessary.

viirya · 2022-01-14T07:24:45Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+    return childData[ordinal];
+  }
+
+  public void putChild(int ordinal, ConstantColumnVector value) {


suggestion: setChild. put methods are for putting values into the vector.

+1. I'm also in favor of using setXXX for the APIs.

ok sure, was thinking of setXXX, but decided to be consistent with WritableColumnVector.
setXXX definitely makes sense and is more reasonable, changing back to setXXX, thanks!

viirya · 2022-01-14T07:25:07Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+    return childData[ordinal];
+  }
+
+  public void putChild(int ordinal, ConstantColumnVector value) {


For public api, it's better to add a comment.

viirya · 2022-01-14T07:31:28Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+    return childData[ordinal];
+  }
+
+  public void putChild(int ordinal, ConstantColumnVector value) {


Where do you use putChild? I don't find it.

not anywhere, for now (just like many other set methods: setMap, setArray etc).
but added tests for verifying those methods in the ConstantColumnVectorSuite

sunchao · 2022-01-14T17:44:58Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+  private ColumnarArray arrayData;
+  private ColumnarMap mapData;
+
+  private int numRows;


Nit: make this final

sunchao · 2022-01-14T17:55:24Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+    return childData[ordinal];
+  }
+
+  public void putChild(int ordinal, ConstantColumnVector value) {


+1. I'm also in favor of using setXXX for the APIs.

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

sunchao · 2022-01-14T18:04:55Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+  }
+
+  @Override
+  public ColumnarArray getArray(int rowId) {


Not sure if this can work properly. Looking at ColumnarArray, in some cases offset is required from underlying ColumnVector, for instance, copy, toBooleanArray, etc.

sunchao · 2022-01-14T18:08:48Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+  public void putUtf8String(UTF8String value) {
+    putByteArray(value.getBytes());
+  }
+


maybe add putInterval (or setInterval) too.

thanks for the suggestion - just wanna put some minimum supports in this PR (implement all necessary APIs extending from ColumnVector), will add more follow-up PRs to include more high-level APIs (setStruct, setCalendarInterval, set..., etc) thanks!

sunchao · 2022-01-14T18:31:34Z

It'd be nice if we have tests accompanying this PR.

Yaohua628 · 2022-01-14T23:12:24Z

It'd be nice if we have tests accompanying this PR.

thanks for the suggestions! working on the test, and addressing comments

Yaohua628 · 2022-01-18T23:38:17Z

@cloud-fan @viirya @sunchao added a test suite and addressed comments. please take a look whenever you have a chance, thanks a lot!

sunchao

Thanks @Yaohua628 , LGTM with one nit.

sunchao · 2022-01-19T00:10:21Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java

+
+  @Override
+  public UTF8String getUTF8String(int rowId) {
+    return UTF8String.fromBytes(byteArrayData);


nit: we can store a UTF8String too instead of creating a new object each time, which could be expensive if this is used on hot path.

make sense, done!

cloud-fan · 2022-01-19T06:11:58Z

thanks, merging to master!

Yaohua628 · 2022-01-19T06:17:19Z

Thanks to all!

### What changes were proposed in this pull request? This PR fixes the import missing. Logical conflict between #35068 and #35055. ### Why are the changes needed? To fix up the complication. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should test it out in compliation. Closes #35245 from HyukjinKwon/SPARK-37896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR is a followup of #35068 to fix the null pointer exception when calling `ConstantColumnVector.close()`. `ConstantColumnVector.childData` can be null for e.g. non-struct data type. ### Why are the changes needed? Fix the exception when cleaning up column vector. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified unit test in `ConstantColumnVectorSuite.scala` to exercise the code path of `ConstantColumnVector.close()` for every tested data type. Without the fix, the unit test throws NPE. Closes #35324 from c21/constant-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR is a followup of apache#35068 to fix the null pointer exception when calling `ConstantColumnVector.close()`. `ConstantColumnVector.childData` can be null for e.g. non-struct data type. ### Why are the changes needed? Fix the exception when cleaning up column vector. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified unit test in `ConstantColumnVectorSuite.scala` to exercise the code path of `ConstantColumnVector.close()` for every tested data type. Without the fix, the unit test throws NPE. Closes apache#35324 from c21/constant-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

putArrayBytes

3cc7606

github-actions bot added the SQL label Dec 30, 2021

nit comment

2d8beb8

Yaohua628 added 2 commits December 31, 2021 10:24

use constant vector in meta

f5a17de

revert writable vector change

4ca60d7

Yaohua628 changed the title ~~[SPARK-37770][SQL][FOLLOWUP] Implement putByteArrays for WritableColumnVector~~ [SPARK-37770][SQL][FOLLOWUP] Implement the ConstantColumnVector for the metadata columns performance improvements Dec 31, 2021

c21 reviewed Jan 5, 2022

View reviewed changes

Yaohua628 commented Jan 5, 2022

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java Show resolved Hide resolved

Yaohua628 added 2 commits January 13, 2022 16:14

some comments

a6595f0

license

f68be18

Yaohua628 requested a review from cloud-fan January 13, 2022 08:55

cloud-fan reviewed Jan 13, 2022

View reviewed changes

cloud-fan approved these changes Jan 13, 2022

View reviewed changes

c21 reviewed Jan 13, 2022

View reviewed changes

Yaohua628 changed the title ~~[SPARK-37770][SQL][FOLLOWUP] Implement the ConstantColumnVector for the metadata columns performance improvements~~ [SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata Jan 13, 2022

some comments

b50ea9e

sunchao reviewed Jan 13, 2022

View reviewed changes

viirya reviewed Jan 14, 2022

View reviewed changes

sunchao reviewed Jan 14, 2022

View reviewed changes

Yaohua628 added 2 commits January 17, 2022 17:22

comments and new testsuite

cc0eaec

more docs

1ae5696

Yaohua628 requested review from viirya, sunchao and cloud-fan January 17, 2022 10:07

Merge branch 'master' into spark-37770

c9a9b7f

sunchao approved these changes Jan 19, 2022

View reviewed changes

store a utf8string

11139c5

cloud-fan approved these changes Jan 19, 2022

View reviewed changes

cloud-fan closed this in 3a45981 Jan 19, 2022

HyukjinKwon mentioned this pull request Jan 19, 2022

[SPARK-37769][SQL][FOLLOWUP] Add UTF8String import in FileScanRDD.scala #35245

Closed

c21 mentioned this pull request Jan 25, 2022

[SPARK-37896][SQL][FOLLOWUP] Fix NPE in ConstantColumnVector.close() #35324

Closed

		@@ -0,0 +1,212 @@
		package org.apache.spark.sql.execution.vectorized;


		private int numRows;

		public ConstantColumnVector(int numRows, DataType type) {

		// copy and modify from WritableColumnVector
		// could also putChild by users

[SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata #35068

[SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata #35068

Conversation

Yaohua628 commented Dec 30, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Yaohua628 commented Dec 30, 2021

c21 commented Dec 30, 2021 • edited

cloud-fan commented Dec 30, 2021

Yaohua628 commented Dec 30, 2021

AmplabJenkins commented Dec 30, 2021

Yaohua628 commented Dec 31, 2021

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Jan 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao Jan 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Jan 14, 2022

Yaohua628 commented Jan 14, 2022

Yaohua628 commented Jan 18, 2022

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 19, 2022

Yaohua628 commented Jan 19, 2022

Yaohua628 commented Dec 30, 2021 •

edited

c21 commented Dec 30, 2021 •

edited

sunchao Jan 14, 2022 •

edited