Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata #35068

Closed
wants to merge 11 commits into from

Conversation

Yaohua628
Copy link
Contributor

@Yaohua628 Yaohua628 commented Dec 30, 2021

What changes were proposed in this pull request?

Implement a new column vector named ConstantColumnVector, which avoids copying the same data for all rows but storing only one copy of the data.

Also, improve performance of hidden file metadata FileScanRDD

Why are the changes needed?

Performance improvements.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new test suite.

@github-actions github-actions bot added the SQL label Dec 30, 2021
@Yaohua628
Copy link
Contributor Author

Hi @cloud-fan, a small performance improvement PR adding a new method putByteArrays in WritableColumnVector - avoid copying the same byte array for every row. please take a look whenever you have time, thanks a bunch!

@c21
Copy link
Contributor

c21 commented Dec 30, 2021

I am not against this change, it's an improvement on top of current code anyway. Just want to think twice here as we are adding API method to WritableColumnVector which is public.

As we discussed before, just wondering if we change the code path to use constant column vector in the future, won't it render this newly added API not used any more?

btw @Yaohua628 just in case, please let me know if any help needed for constant column vector implementation, I can also help it out to implement it.

@cloud-fan
Copy link
Contributor

Yea I think a constant column vector is a better solution.

@Yaohua628
Copy link
Contributor Author

I am not against this change, it's an improvement on top of current code anyway. Just want to think twice here as we are adding API method to WritableColumnVector which is public.

As we discussed before, just wondering if we change the code path to use constant column vector in the future, won't it render this newly added API not used any more?

btw @Yaohua628 just in case, please let me know if any help needed for constant column vector implementation, I can also help it out to implement it.

Yea I think a constant column vector is a better solution.

got it, ya, make sense! will try to have a simple version of the constant column vector. @c21 thanks for offering, will reach out to you if I need any help. thanks a lot!

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@Yaohua628 Yaohua628 changed the title [SPARK-37770][SQL][FOLLOWUP] Implement putByteArrays for WritableColumnVector [SPARK-37770][SQL][FOLLOWUP] Implement the ConstantColumnVector for the metadata columns performance improvements Dec 31, 2021
@Yaohua628
Copy link
Contributor Author

@cloud-fan @c21, have a simple version of the constant column vector, please take a look whenever you get a chance, appreciate any feedback and suggestions. will add UTs for the new class if it is generally looking good, thanks!

Happy new year!

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yaohua628 for making the change! Having some comments.

@@ -0,0 +1,212 @@
package org.apache.spark.sql.execution.vectorized;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the Apache license header similar to other files.

* Capacity: The vector only stores one copy of the data, and acts as an unbounded vector
* (get from any row will return the same value)
*/
public class ConstantColumnVector extends ColumnVector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking whether we should extend WritableColumnVector instead, so we can easily leverage constant column vector to represent partition columns.

It seems for partition columns, we are doing copying of same value per row (Parquet and ORC). A future improvement is to use the constant column vector we are introducing here to avoid unnecessary operations.

@cloud-fan WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to extend WritableColumnVector initially, but seems like we needs to implement some unnecessary public methods like: putLongs(rowId, count, value)


@Override
public int numNulls() {
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why -1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ConstantColumnVector is 'boundless', all values are the same, and no capacity (no total number of rows), I am also wondering what we should return here - maybe something like an UnimplementedException? cc: @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can know the number of rows from the context, right? the numNulls should either be 0 or numRows.

Comment on lines 209 to 211
public void putChild(ConstantColumnVector value) {
childData = value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, I need to update this: it is for the constant struct or array type - I also need to take an original, will address this.

// while internally, the TimestampType is stored in microsecond
metadataRow.update(i, currentFile.modificationTime * 1000L)
}
private def updateMetadataData(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name updateMetadataData is kind of hard to read. btw why we delete createMetadataColumnVector, and mix vector and row together here? IMO it's better to separate code paths for non-vectorized and vectorized scan if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense!

@c21
Copy link
Contributor

c21 commented Jan 5, 2022

nit: the scope of change is kind of big IMO, as we are introducing a new public column vector. Maybe better to file a new JIRA instead of followup.

public ConstantColumnVector(int numRows, DataType type) {
super(type);
this.numRows = numRows;
if (type instanceof StructType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we handle StructType twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooops


@Override
public int numNulls() {
return numRows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 0 if hasNull is false

*
* Capacity: The vector only stores one copy of the data, and acts as an unbounded vector
* (get from any row will return the same value)
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a UT for this new vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - working on it!

@cloud-fan
Copy link
Contributor

let's open a new JIRA ticket as adding a new kind of column vector is non-trivial.


/**
* Sets up the data type of this constant column vector.
* @param type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems useless.

Comment on lines 138 to 148
if (metadataColumns.isEmpty || currentFile == null) return
val path = new Path(currentFile.filePath)
metadataColumns.zipWithIndex.foreach { case (attr, i) =>
attr.name match {
case FILE_PATH => metadataRow.update(i, UTF8String.fromString(path.toString))
case FILE_NAME => metadataRow.update(i, UTF8String.fromString(path.getName))
case FILE_SIZE => metadataRow.update(i, currentFile.fileSize)
case FILE_MODIFICATION_TIME =>
// the modificationTime from the file is in millisecond,
// while internally, the TimestampType is stored in microsecond
metadataRow.update(i, currentFile.modificationTime * 1000L)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change? didn't feel the readability improved much after negating the if condition.

}
}
}

/**
* Create a writable column vector containing all required metadata columns
* Create a constant column vector containing all required metadata columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't be: Create an array of constant column vectors containing ... ?

private ColumnarArray arrayData;
private ColumnarMap mapData;

private int numRows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering what's the point of storing numRows here? It seems that we don't use numRows at all, e.g. checking rowId in each getXXX method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only place is numNulls from Wenchen's suggestion: #35068 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make this final

@Yaohua628 Yaohua628 changed the title [SPARK-37770][SQL][FOLLOWUP] Implement the ConstantColumnVector for the metadata columns performance improvements [SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata Jan 13, 2022

private int numRows;

public ConstantColumnVector(int numRows, DataType type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WritableColumnVector already has a way to set constant via setIsConstant. Have you looked at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the setIsConstant only affects reset, but doesn't change how the data is stored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually looked at it as well. Seems there's more code change needed if we want to utilize setIsConstant from WritableColumnVector. It'd better to start with a separate new class ConstantColumnVector here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Perhaps we can remove setIsConstant later and replace its usage with ConstantColumnVector.

Comment on lines 19 to 26
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.vectorized.ColumnVector;
import org.apache.spark.sql.vectorized.ColumnarArray;
import org.apache.spark.sql.vectorized.ColumnarMap;
import org.apache.spark.unsafe.types.UTF8String;

import java.math.BigDecimal;
import java.math.BigInteger;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This order of imports look not following Spark style. Java imports should be before Spark's.

Comment on lines 56 to 57
// copy and modify from WritableColumnVector
// could also putChild by users
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks unnecessary.

return childData[ordinal];
}

public void putChild(int ordinal, ConstantColumnVector value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: setChild. put methods are for putting values into the vector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm also in favor of using setXXX for the APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure, was thinking of setXXX, but decided to be consistent with WritableColumnVector.
setXXX definitely makes sense and is more reasonable, changing back to setXXX, thanks!

return childData[ordinal];
}

public void putChild(int ordinal, ConstantColumnVector value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For public api, it's better to add a comment.

return childData[ordinal];
}

public void putChild(int ordinal, ConstantColumnVector value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use putChild? I don't find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not anywhere, for now (just like many other set methods: setMap, setArray etc).
but added tests for verifying those methods in the ConstantColumnVectorSuite

private ColumnarArray arrayData;
private ColumnarMap mapData;

private int numRows;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make this final

return childData[ordinal];
}

public void putChild(int ordinal, ConstantColumnVector value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm also in favor of using setXXX for the APIs.

}

@Override
public ColumnarArray getArray(int rowId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this can work properly. Looking at ColumnarArray, in some cases offset is required from underlying ColumnVector, for instance, copy, toBooleanArray, etc.

public void putUtf8String(UTF8String value) {
putByteArray(value.getBytes());
}

Copy link
Member

@sunchao sunchao Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add putInterval (or setInterval) too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion - just wanna put some minimum supports in this PR (implement all necessary APIs extending from ColumnVector), will add more follow-up PRs to include more high-level APIs (setStruct, setCalendarInterval, set..., etc) thanks!

@sunchao
Copy link
Member

sunchao commented Jan 14, 2022

It'd be nice if we have tests accompanying this PR.

@Yaohua628
Copy link
Contributor Author

It'd be nice if we have tests accompanying this PR.

thanks for the suggestions! working on the test, and addressing comments

@Yaohua628
Copy link
Contributor Author

@cloud-fan @viirya @sunchao added a test suite and addressed comments. please take a look whenever you have a chance, thanks a lot!

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yaohua628 , LGTM with one nit.


@Override
public UTF8String getUTF8String(int rowId) {
return UTF8String.fromBytes(byteArrayData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can store a UTF8String too instead of creating a new object each time, which could be expensive if this is used on hot path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, done!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3a45981 Jan 19, 2022
@Yaohua628
Copy link
Contributor Author

Thanks to all!

HyukjinKwon added a commit that referenced this pull request Jan 19, 2022
### What changes were proposed in this pull request?
This PR fixes the import missing. Logical conflict between #35068 and #35055.

### Why are the changes needed?

To fix up the complication.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI should test it out in compliation.

Closes #35245 from HyukjinKwon/SPARK-37896.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
cloud-fan pushed a commit that referenced this pull request Jan 26, 2022
### What changes were proposed in this pull request?

This PR is a followup of #35068 to fix the null pointer exception when calling `ConstantColumnVector.close()`. `ConstantColumnVector.childData` can be null for e.g. non-struct data type.

### Why are the changes needed?

Fix the exception when cleaning up column vector.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified unit test in `ConstantColumnVectorSuite.scala` to exercise the code path of `ConstantColumnVector.close()` for every tested data type. Without the fix, the unit test throws NPE.

Closes #35324 from c21/constant-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
senthh pushed a commit to senthh/spark-1 that referenced this pull request Feb 3, 2022
### What changes were proposed in this pull request?

This PR is a followup of apache#35068 to fix the null pointer exception when calling `ConstantColumnVector.close()`. `ConstantColumnVector.childData` can be null for e.g. non-struct data type.

### Why are the changes needed?

Fix the exception when cleaning up column vector.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified unit test in `ConstantColumnVectorSuite.scala` to exercise the code path of `ConstantColumnVector.close()` for every tested data type. Without the fix, the unit test throws NPE.

Closes apache#35324 from c21/constant-fix.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants