[SPARK-36814][SQL] Make class ColumnarBatch extendable #34054

flyrain · 2021-09-21T01:28:50Z

What changes were proposed in this pull request?

Change class ColumnarBatch to a non-final class

Why are the changes needed?

To support better vectorized reading in multiple data source, ColumnarBatch need to be extendable. For example, To support row-level delete( apache/iceberg#3141) in Iceberg's vectorized read, we need to filter out deleted rows in a batch, which requires ColumnarBatch to be extendable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

No test needed.

dongjoon-hyun · 2021-09-21T02:59:59Z

ok to test

dongjoon-hyun

Thank you for making a PR, @flyrain .

cc @aokolnychyi , @cloud-fan , @tgravescs , @sunchao , @viirya

dbtsai

LGTM. This adds the flexibility to extend ColumnarBatch with minimal change. Pending CI.

SparkQA · 2021-09-21T03:49:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47981/

SparkQA · 2021-09-21T04:32:57Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47981/

SparkQA · 2021-09-21T07:47:18Z

Test build #143470 has finished for PR 34054 at commit 3c7da3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class ColumnarBatch implements AutoCloseable

AmplabJenkins · 2021-09-21T12:59:59Z

Can one of the admins verify this patch?

dbtsai · 2021-09-21T15:27:24Z

Merged into master. Thanks. Can you create a followup PR to add tests to show how you want to extend this so future PRs will not break it?

flyrain · 2021-09-21T16:12:17Z

Thanks all for review! Thanks @dbtsai for the commit. Will file a followup PR.

### What changes were proposed in this pull request? A follow up of #34054. Three things changed: 1. Add a test for extendable class `ColumnarBatch` 2. Make `ColumnarBatchRow` public. 3. Change private fields to protected fields. ### Why are the changes needed? A follow up of #34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new test is added Closes #34087 from flyrain/SPARK-36821. Authored-by: Yufei Gu <yufei_gu@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

### What changes were proposed in this pull request? A follow up of apache/spark#34054. Three things changed: 1. Add a test for extendable class `ColumnarBatch` 2. Make `ColumnarBatchRow` public. 3. Change private fields to protected fields. ### Why are the changes needed? A follow up of apache/spark#34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new test is added Closes #34087 from flyrain/SPARK-36821. Authored-by: Yufei Gu <yufei_gu@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

### What changes were proposed in this pull request? A follow up of apache#34054. Three things changed: 1. Add a test for extendable class `ColumnarBatch` 2. Make `ColumnarBatchRow` public. 3. Change private fields to protected fields. ### Why are the changes needed? A follow up of apache#34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new test is added

Make class ColumnarBatch extendable.

3c7da3b

github-actions bot added the SQL label Sep 21, 2021

dongjoon-hyun reviewed Sep 21, 2021

View reviewed changes

huaxingao approved these changes Sep 21, 2021

View reviewed changes

dbtsai self-requested a review September 21, 2021 03:48

dbtsai approved these changes Sep 21, 2021

View reviewed changes

sunchao approved these changes Sep 21, 2021

View reviewed changes

viirya approved these changes Sep 21, 2021

View reviewed changes

dbtsai closed this in 688b95b Sep 21, 2021

flyrain mentioned this pull request Sep 24, 2021

[SPARK-36821][SQL] Make class ColumnarBatch extendable - addendum #34087

Closed

flyrain mentioned this pull request Sep 27, 2021

Support row-level delete in vectorized reader apache/iceberg#3141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36814][SQL] Make class ColumnarBatch extendable #34054

[SPARK-36814][SQL] Make class ColumnarBatch extendable #34054

flyrain commented Sep 21, 2021

dongjoon-hyun commented Sep 21, 2021

dongjoon-hyun left a comment

dbtsai left a comment

SparkQA commented Sep 21, 2021

SparkQA commented Sep 21, 2021

SparkQA commented Sep 21, 2021

AmplabJenkins commented Sep 21, 2021

dbtsai commented Sep 21, 2021

flyrain commented Sep 21, 2021

[SPARK-36814][SQL] Make class ColumnarBatch extendable #34054

[SPARK-36814][SQL] Make class ColumnarBatch extendable #34054

Conversation

flyrain commented Sep 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Sep 21, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dbtsai left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 21, 2021

SparkQA commented Sep 21, 2021

SparkQA commented Sep 21, 2021

AmplabJenkins commented Sep 21, 2021

dbtsai commented Sep 21, 2021

flyrain commented Sep 21, 2021