Skip to content

[SPARK-39634][SQL] Allow file splitting in combination with row index generation#40728

Closed
vkorukanti wants to merge 2 commits intoapache:masterfrom
vkorukanti:SPARK-39634
Closed

[SPARK-39634][SQL] Allow file splitting in combination with row index generation#40728
vkorukanti wants to merge 2 commits intoapache:masterfrom
vkorukanti:SPARK-39634

Conversation

@vkorukanti
Copy link
Member

@vkorukanti vkorukanti commented Apr 10, 2023

What changes were proposed in this pull request?

  • Parquet version 1.13.1 has a fix for PARQUET-2161 which allows splitting the parquet files when row index metadata column is selected. Currently the file splitting is disabled. Enable file splitting with row index column.

Why are the changes needed?

Splitting parquet files allows better parallelization when row index metadata column is selected.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Uncomment the existing unittests.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jul 20, 2023
@yihua
Copy link

yihua commented Jul 20, 2023

Hi @vkorukanti, this is an important performance improvement for using the row index from Parquet. Is the PR targeted for Spark 3.5?

@vkorukanti vkorukanti changed the title [WIP][SPARK-39634][SQL] Allow file splitting in combination with row index generation [SPARK-39634][SQL] Allow file splitting in combination with row index generation Jul 20, 2023
@github-actions github-actions bot removed the BUILD label Jul 20, 2023
@vkorukanti
Copy link
Member Author

@yihua Rebased the PR. If possible, it would be good to include this in the Spark 3.5 release.

Copy link
Contributor

@ala ala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yihua
Copy link

yihua commented Jul 21, 2023

@vkorukanti @ala Great, thanks!

@github-actions github-actions bot closed this Jul 21, 2023
@cloud-fan cloud-fan removed the Stale label Jul 21, 2023
@cloud-fan cloud-fan reopened this Jul 21, 2023
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in 679ea56 Jul 21, 2023
cloud-fan pushed a commit that referenced this pull request Jul 21, 2023
… generation

### What changes were proposed in this pull request?
- Parquet version `1.13.1` has a fix for [PARQUET-2161](https://issues.apache.org/jira/browse/PARQUET-2161) which allows splitting the parquet files when row index metadata column is selected. Currently the file splitting is disabled. Enable file splitting with row index column.

### Why are the changes needed?
Splitting parquet files allows better parallelization when row index metadata column is selected.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Uncomment the existing unittests.

Closes #40728 from vkorukanti/SPARK-39634.

Authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 679ea56)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments