New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata #35068
Conversation
Hi @cloud-fan, a small performance improvement PR adding a new method |
I am not against this change, it's an improvement on top of current code anyway. Just want to think twice here as we are adding API method to As we discussed before, just wondering if we change the code path to use constant column vector in the future, won't it render this newly added API not used any more? btw @Yaohua628 just in case, please let me know if any help needed for constant column vector implementation, I can also help it out to implement it. |
Yea I think a constant column vector is a better solution. |
got it, ya, make sense! will try to have a simple version of the constant column vector. @c21 thanks for offering, will reach out to you if I need any help. thanks a lot! |
Can one of the admins verify this patch? |
@cloud-fan @c21, have a simple version of the constant column vector, please take a look whenever you get a chance, appreciate any feedback and suggestions. will add UTs for the new class if it is generally looking good, thanks! Happy new year! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yaohua628 for making the change! Having some comments.
@@ -0,0 +1,212 @@ | |||
package org.apache.spark.sql.execution.vectorized; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the Apache license header similar to other files.
* Capacity: The vector only stores one copy of the data, and acts as an unbounded vector | ||
* (get from any row will return the same value) | ||
*/ | ||
public class ConstantColumnVector extends ColumnVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking whether we should extend WritableColumnVector
instead, so we can easily leverage constant column vector to represent partition columns.
It seems for partition columns, we are doing copying of same value per row (Parquet and ORC). A future improvement is to use the constant column vector we are introducing here to avoid unnecessary operations.
@cloud-fan WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to extend WritableColumnVector
initially, but seems like we needs to implement some unnecessary public methods like: putLongs(rowId, count, value)
|
||
@Override | ||
public int numNulls() { | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why -1
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ConstantColumnVector
is 'boundless', all values are the same, and no capacity (no total number of rows), I am also wondering what we should return here - maybe something like an UnimplementedException
? cc: @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can know the number of rows from the context, right? the numNulls
should either be 0 or numRows
.
public void putChild(ConstantColumnVector value) { | ||
childData = value; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, I need to update this: it is for the constant struct
or array
type - I also need to take an original, will address this.
// while internally, the TimestampType is stored in microsecond | ||
metadataRow.update(i, currentFile.modificationTime * 1000L) | ||
} | ||
private def updateMetadataData(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the name updateMetadataData
is kind of hard to read. btw why we delete createMetadataColumnVector
, and mix vector and row together here? IMO it's better to separate code paths for non-vectorized and vectorized scan if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense!
nit: the scope of change is kind of big IMO, as we are introducing a new public column vector. Maybe better to file a new JIRA instead of followup. |
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java
Show resolved
Hide resolved
public ConstantColumnVector(int numRows, DataType type) { | ||
super(type); | ||
this.numRows = numRows; | ||
if (type instanceof StructType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we handle StructType
twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooops
|
||
@Override | ||
public int numNulls() { | ||
return numRows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 0 if hasNull
is false
* | ||
* Capacity: The vector only stores one copy of the data, and acts as an unbounded vector | ||
* (get from any row will return the same value) | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write a UT for this new vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure - working on it!
let's open a new JIRA ticket as adding a new kind of column vector is non-trivial. |
|
||
/** | ||
* Sets up the data type of this constant column vector. | ||
* @param type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this seems useless.
if (metadataColumns.isEmpty || currentFile == null) return | ||
val path = new Path(currentFile.filePath) | ||
metadataColumns.zipWithIndex.foreach { case (attr, i) => | ||
attr.name match { | ||
case FILE_PATH => metadataRow.update(i, UTF8String.fromString(path.toString)) | ||
case FILE_NAME => metadataRow.update(i, UTF8String.fromString(path.getName)) | ||
case FILE_SIZE => metadataRow.update(i, currentFile.fileSize) | ||
case FILE_MODIFICATION_TIME => | ||
// the modificationTime from the file is in millisecond, | ||
// while internally, the TimestampType is stored in microsecond | ||
metadataRow.update(i, currentFile.modificationTime * 1000L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary change? didn't feel the readability improved much after negating the if condition.
} | ||
} | ||
} | ||
|
||
/** | ||
* Create a writable column vector containing all required metadata columns | ||
* Create a constant column vector containing all required metadata columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldn't be: Create an array of constant column vectors containing ...
?
private ColumnarArray arrayData; | ||
private ColumnarMap mapData; | ||
|
||
private int numRows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering what's the point of storing numRows
here? It seems that we don't use numRows
at all, e.g. checking rowId
in each getXXX
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only place is numNulls
from Wenchen's suggestion: #35068 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: make this final
|
||
private int numRows; | ||
|
||
public ConstantColumnVector(int numRows, DataType type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WritableColumnVector
already has a way to set constant via setIsConstant
. Have you looked at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the setIsConstant
only affects reset
, but doesn't change how the data is stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I actually looked at it as well. Seems there's more code change needed if we want to utilize setIsConstant
from WritableColumnVector
. It'd better to start with a separate new class ConstantColumnVector
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Perhaps we can remove setIsConstant
later and replace its usage with ConstantColumnVector
.
import org.apache.spark.sql.types.*; | ||
import org.apache.spark.sql.vectorized.ColumnVector; | ||
import org.apache.spark.sql.vectorized.ColumnarArray; | ||
import org.apache.spark.sql.vectorized.ColumnarMap; | ||
import org.apache.spark.unsafe.types.UTF8String; | ||
|
||
import java.math.BigDecimal; | ||
import java.math.BigInteger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This order of imports look not following Spark style. Java imports should be before Spark's.
// copy and modify from WritableColumnVector | ||
// could also putChild by users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment looks unnecessary.
return childData[ordinal]; | ||
} | ||
|
||
public void putChild(int ordinal, ConstantColumnVector value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: setChild
. put
methods are for putting values into the vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'm also in favor of using setXXX
for the APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure, was thinking of setXXX
, but decided to be consistent with WritableColumnVector
.
setXXX
definitely makes sense and is more reasonable, changing back to setXXX
, thanks!
return childData[ordinal]; | ||
} | ||
|
||
public void putChild(int ordinal, ConstantColumnVector value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For public api, it's better to add a comment.
return childData[ordinal]; | ||
} | ||
|
||
public void putChild(int ordinal, ConstantColumnVector value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you use putChild
? I don't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not anywhere, for now (just like many other set methods: setMap
, setArray
etc).
but added tests for verifying those methods in the ConstantColumnVectorSuite
private ColumnarArray arrayData; | ||
private ColumnarMap mapData; | ||
|
||
private int numRows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: make this final
return childData[ordinal]; | ||
} | ||
|
||
public void putChild(int ordinal, ConstantColumnVector value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'm also in favor of using setXXX
for the APIs.
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ConstantColumnVector.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public ColumnarArray getArray(int rowId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this can work properly. Looking at ColumnarArray
, in some cases offset
is required from underlying ColumnVector
, for instance, copy
, toBooleanArray
, etc.
public void putUtf8String(UTF8String value) { | ||
putByteArray(value.getBytes()); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add putInterval
(or setInterval
) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion - just wanna put some minimum supports in this PR (implement all necessary APIs extending from ColumnVector
), will add more follow-up PRs to include more high-level APIs (setStruct, setCalendarInterval, set..., etc) thanks!
It'd be nice if we have tests accompanying this PR. |
thanks for the suggestions! working on the test, and addressing comments |
@cloud-fan @viirya @sunchao added a test suite and addressed comments. please take a look whenever you have a chance, thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yaohua628 , LGTM with one nit.
|
||
@Override | ||
public UTF8String getUTF8String(int rowId) { | ||
return UTF8String.fromBytes(byteArrayData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can store a UTF8String
too instead of creating a new object each time, which could be expensive if this is used on hot path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, done!
thanks, merging to master! |
Thanks to all! |
### What changes were proposed in this pull request? This PR fixes the import missing. Logical conflict between #35068 and #35055. ### Why are the changes needed? To fix up the complication. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI should test it out in compliation. Closes #35245 from HyukjinKwon/SPARK-37896. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR is a followup of #35068 to fix the null pointer exception when calling `ConstantColumnVector.close()`. `ConstantColumnVector.childData` can be null for e.g. non-struct data type. ### Why are the changes needed? Fix the exception when cleaning up column vector. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified unit test in `ConstantColumnVectorSuite.scala` to exercise the code path of `ConstantColumnVector.close()` for every tested data type. Without the fix, the unit test throws NPE. Closes #35324 from c21/constant-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR is a followup of apache#35068 to fix the null pointer exception when calling `ConstantColumnVector.close()`. `ConstantColumnVector.childData` can be null for e.g. non-struct data type. ### Why are the changes needed? Fix the exception when cleaning up column vector. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified unit test in `ConstantColumnVectorSuite.scala` to exercise the code path of `ConstantColumnVector.close()` for every tested data type. Without the fix, the unit test throws NPE. Closes apache#35324 from c21/constant-fix. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Implement a new column vector named
ConstantColumnVector
, which avoids copying the same data for all rows but storing only one copy of the data.Also, improve performance of hidden file metadata FileScanRDD
Why are the changes needed?
Performance improvements.
Does this PR introduce any user-facing change?
No
How was this patch tested?
A new test suite.