-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36821][SQL] Make class ColumnarBatch extendable - addendum #34087
Conversation
add to whitelist |
okay to test |
/** | ||
* This class wraps an array of {@link ColumnVector} and provides a row view. | ||
*/ | ||
public final class ColumnarBatchRow extends InternalRow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add @Evolving
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #143595 has finished for PR 34087 at commit
|
cc @cloud-fan , @tgravescs |
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java
Show resolved
Hide resolved
cc @aokolnychyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically looks OK to me, although I'm thinking whether it's better to make ColumnarBatch
an abstract class:
public abstract class AbstractColumnarBatch implements AutoCloseable, Iterable<InternalRow> {
protected int numRows;
public abstract ColumnVector[] columns();
@Override
public void close() {
for (ColumnVector c: columns()) {
c.close();
}
}
public void setNumRows(int numRows) {
this.numRows = numRows;
}
public int numRows() { return numRows; }
public int numCols() { return columns().length; }
public ColumnVector column(int ordinal) { return columns()[ordinal]; }
}
This way we'll give data source implementors more flexibilities for the internal stuff and we don't have to be tied with ColumnarBatchRow
and expose it also.
@sunchao, I'm OK with an abstract class. We may still expose |
I'm thinking whether they can just use
Yea. We can have the class implement |
+1 on using |
Actually it may not be so easy to use |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #143614 has finished for PR 34087 at commit
|
/** | ||
* This class wraps an array of {@link ColumnVector} and provides a row view. | ||
*/ | ||
@Evolving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this class is just copied from the original code file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Developerapi as well? seems like it should match the ColumnarBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, exactly.
@tgravescs, I'm OK with both. Will make it a Developer api to be consistent.
Kubernetes integration test starting |
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too.
Merged into master. Thanks! |
Test build #143651 has finished for PR 34087 at commit
|
What changes were proposed in this pull request?
A follow up of #34054. Three things changed:
ColumnarBatch
ColumnarBatchRow
public.Why are the changes needed?
A follow up of #34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.
Does this PR introduce any user-facing change?
No
How was this patch tested?
A new test is added