Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36814][SQL] Make class ColumnarBatch extendable #34054

Closed
wants to merge 1 commit into from

Conversation

flyrain
Copy link

@flyrain flyrain commented Sep 21, 2021

What changes were proposed in this pull request?

Change class ColumnarBatch to a non-final class

Why are the changes needed?

To support better vectorized reading in multiple data source, ColumnarBatch need to be extendable. For example, To support row-level delete( apache/iceberg#3141) in Iceberg's vectorized read, we need to filter out deleted rows in a batch, which requires ColumnarBatch to be extendable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

No test needed.

@github-actions github-actions bot added the SQL label Sep 21, 2021
@dongjoon-hyun
Copy link
Member

ok to test

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @flyrain .

cc @aokolnychyi , @cloud-fan , @tgravescs , @sunchao , @viirya

@dbtsai dbtsai self-requested a review September 21, 2021 03:48
Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This adds the flexibility to extend ColumnarBatch with minimal change. Pending CI.

@SparkQA
Copy link

SparkQA commented Sep 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47981/

@SparkQA
Copy link

SparkQA commented Sep 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47981/

@SparkQA
Copy link

SparkQA commented Sep 21, 2021

Test build #143470 has finished for PR 34054 at commit 3c7da3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class ColumnarBatch implements AutoCloseable

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dbtsai dbtsai closed this in 688b95b Sep 21, 2021
@dbtsai
Copy link
Member

dbtsai commented Sep 21, 2021

Merged into master. Thanks. Can you create a followup PR to add tests to show how you want to extend this so future PRs will not break it?

@flyrain
Copy link
Author

flyrain commented Sep 21, 2021

Thanks all for review! Thanks @dbtsai for the commit. Will file a followup PR.

dbtsai pushed a commit that referenced this pull request Sep 27, 2021
### What changes were proposed in this pull request?
A follow up of #34054. Three things changed:
1. Add a test for extendable class `ColumnarBatch`
2. Make `ColumnarBatchRow` public.
3. Change private fields to protected fields.

### Why are the changes needed?
A follow up of #34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
A new test is added

Closes #34087 from flyrain/SPARK-36821.

Authored-by: Yufei Gu <yufei_gu@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 27, 2021
### What changes were proposed in this pull request?
A follow up of apache/spark#34054. Three things changed:
1. Add a test for extendable class `ColumnarBatch`
2. Make `ColumnarBatchRow` public.
3. Change private fields to protected fields.

### Why are the changes needed?
A follow up of apache/spark#34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
A new test is added

Closes #34087 from flyrain/SPARK-36821.

Authored-by: Yufei Gu <yufei_gu@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Dec 8, 2021
### What changes were proposed in this pull request?

A follow up of apache#34054. Three things changed:
1. Add a test for extendable class `ColumnarBatch`
2. Make `ColumnarBatchRow` public.
3. Change private fields to protected fields.

### Why are the changes needed?

A follow up of apache#34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new test is added
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants