Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request][Spark] Remove dropped columns from Parquet files in REORG TABLE (PURGE) #3228

Closed
2 of 8 tasks
johanl-db opened this issue Jun 6, 2024 · 2 comments · Fixed by #3371
Closed
2 of 8 tasks
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@johanl-db
Copy link
Collaborator

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

REORG TABLE <table_name> (PURGE) removes the soft-deleted data from a table by merging existing deletion vectors with the data files. It can be improved by also finding and removing dropped columns that are still present in the physical data file.
Columns can be dropped using the column mapping feature that powers the ALTER TABLE <table_name> DROP COLUMN <column_name> command

Motivation

This will allow reducing storage space when columns are dropped from a table and may also slightly increase performance.

Further details

This only requires updating the logic to identify files to rewrite in DeltaReorgTableCommand. Dropped columns are automatically removed by the underlying OPTIMIZE run for any files that passes this filter as the dropped columns aren't part of the read set and are ignored when rewriting files.

This could reuse the same mechanism used to by type widening to rewrite files that contain a different type:
https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/commands/DeltaReorgTableCommand.scala#L138
Read parquet footers and identify the ones that have a column that's not present in the table schema.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

This is a good first task for someone not familiar with Delta, I'm happy to give guidance on how to implement this and review PRs.

@johanl-db johanl-db added enhancement New feature or request good first issue Good for newcomers labels Jun 6, 2024
@xzhseh
Copy link
Contributor

xzhseh commented Jul 11, 2024

Hi @johanl-db, I'd like to work on this issue, could you kindly assign it to me? Thanks!

cc @allisonport-db.

@johanl-db
Copy link
Collaborator Author

@xzhseh Done, let me know if I can help with anything. You can ping me if you need any review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants