Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36821][SQL] Make class ColumnarBatch extendable - addendum #34087

Closed
wants to merge 4 commits into from

Conversation

flyrain
Copy link

@flyrain flyrain commented Sep 24, 2021

What changes were proposed in this pull request?

A follow up of #34054. Three things changed:

  1. Add a test for extendable class ColumnarBatch
  2. Make ColumnarBatchRow public.
  3. Change private fields to protected fields.

Why are the changes needed?

A follow up of #34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new test is added

@github-actions github-actions bot added the SQL label Sep 24, 2021
@dbtsai
Copy link
Member

dbtsai commented Sep 24, 2021

add to whitelist

@dbtsai
Copy link
Member

dbtsai commented Sep 24, 2021

okay to test

@dbtsai
Copy link
Member

dbtsai commented Sep 24, 2021

/**
* This class wraps an array of {@link ColumnVector} and provides a row view.
*/
public final class ColumnarBatchRow extends InternalRow {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add @Evolving too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change

@SparkQA
Copy link

SparkQA commented Sep 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48107/

@SparkQA
Copy link

SparkQA commented Sep 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48107/

@SparkQA
Copy link

SparkQA commented Sep 24, 2021

Test build #143595 has finished for PR 34087 at commit e626f9e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao huaxingao changed the title [Spark 36821][SQL] Make class ColumnarBatch extendable - addendum [Spark-36821][SQL] Make class ColumnarBatch extendable - addendum Sep 24, 2021
@dongjoon-hyun
Copy link
Member

cc @cloud-fan , @tgravescs

@viirya viirya changed the title [Spark-36821][SQL] Make class ColumnarBatch extendable - addendum [SPARK-36821][SQL] Make class ColumnarBatch extendable - addendum Sep 24, 2021
@flyrain
Copy link
Author

flyrain commented Sep 24, 2021

cc @aokolnychyi

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically looks OK to me, although I'm thinking whether it's better to make ColumnarBatch an abstract class:

public abstract class AbstractColumnarBatch implements AutoCloseable, Iterable<InternalRow> {
  protected int numRows;
  public abstract ColumnVector[] columns();

  @Override
  public void close() {
    for (ColumnVector c: columns()) {
      c.close();
    }
  }

  public void setNumRows(int numRows) {
    this.numRows = numRows;
  }

  public int numRows() { return numRows; }
  public int numCols() { return columns().length; }
  public ColumnVector column(int ordinal) { return columns()[ordinal]; }
}

This way we'll give data source implementors more flexibilities for the internal stuff and we don't have to be tied with ColumnarBatchRow and expose it also.

@flyrain
Copy link
Author

flyrain commented Sep 24, 2021

This basically looks OK to me, although I'm thinking whether it's better to make ColumnarBatch an abstract class:
This way we'll give data source implementors more flexibilities for the internal stuff and we don't have to be tied with ColumnarBatchRow and expose it also.

@sunchao, I'm OK with an abstract class. We may still expose ColumnarBatchRow since any class extending the abstract class still needs it. Of course, they can implement their own version, but the logic'd most likely be same. One question for your code snippet, we should add the public method rowIterator, right? It is the major interface of the ColumnarBatch.

@sunchao
Copy link
Member

sunchao commented Sep 24, 2021

We may still expose ColumnarBatchRow since any class extending the abstract class still needs it

I'm thinking whether they can just use ColumnarRow instead.

One question for your code snippet, we should add the public method rowIterator, right? It is the major interface of the ColumnarBatch.

Yea. We can have the class implement Iterable<InternalRow> which is a more standard Java API. We'll need to replace all the places that use rowInterator with iterator though.

@dbtsai
Copy link
Member

dbtsai commented Sep 24, 2021

+1 on using Iterable<InternalRow>

@sunchao
Copy link
Member

sunchao commented Sep 24, 2021

Actually it may not be so easy to use ColumnarRow, so I'm fine with exposing ColumnarBatchRow here. Eventually we might want to combine them since they look so similar ..

@SparkQA
Copy link

SparkQA commented Sep 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48126/

@SparkQA
Copy link

SparkQA commented Sep 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48126/

@SparkQA
Copy link

SparkQA commented Sep 25, 2021

Test build #143614 has finished for PR 34087 at commit 334ce1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* This class wraps an array of {@link ColumnVector} and provides a row view.
*/
@Evolving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this class is just copied from the original code file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Developerapi as well? seems like it should match the ColumnarBatch

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, exactly.
@tgravescs, I'm OK with both. Will make it a Developer api to be consistent.

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48164/

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48164/

@dbtsai dbtsai self-requested a review September 27, 2021 18:25
Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too.

@dbtsai dbtsai closed this in d03999a Sep 27, 2021
@dbtsai
Copy link
Member

dbtsai commented Sep 27, 2021

Merged into master. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 27, 2021

Test build #143651 has finished for PR 34087 at commit eac1012.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants