Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optimize command #98

Closed
xianwill opened this issue Mar 1, 2021 · 7 comments
Closed

Implement optimize command #98

xianwill opened this issue Mar 1, 2021 · 7 comments
Labels
binding/rust Issues for the Rust crate enhancement New feature or request

Comments

@xianwill
Copy link
Collaborator

xianwill commented Mar 1, 2021

Databricks delta lake provides an optimize command that is incredibly useful for compacting small files (especially for files written by streaming jobs which are inevitably small). delta-rs should provide a similar optimize command.

I think a first pass could ignore the bin-pack and zorder features provided by Databricks and simply combine small files into large files while also setting dataChange: false in the delta log entry. An MVP might look like:

  • Query the delta log for file stats relevant to the current optimize run (based on predicate)
  • Group files that may be compacted based on common partition membership and less than optimal size (1 GB)
  • Combine files by looping through each input file and writing its record batches into the new output file.
  • Commit a new delta log entry with a single add action and a separate remove action for each compacted file. All committed add and remove actions must set the dataChange flag as false to prevent re-processing by stream consumers.

Files no longer relavant to the log may be cleaned up later by vacuum (see #97)

@houqp houqp added binding/rust Issues for the Rust crate enhancement New feature or request labels Mar 1, 2021
@MironAtHome
Copy link

While vacuum is a big step forward, it does not obviate need for optimize. These two commands have different purpose, and ( looking further out ) optimize will eventually become part of either batch or background worker, continuously operating with lower priority, as is common to all relational engines.

@MrPowers
Copy link
Collaborator

MrPowers commented Jun 1, 2022

@xianwill - The steps you've outlined sound good to me and think small file compaction via OPTIMIZE would be a great addition to this library. Also agree that the first pass should ignore z order indexing. Do you know if this work is planning on getting prioritized anytime soon?

@wjones127
Copy link
Collaborator

@MrPowers this is actively being worked on. Did you see #607?

@MrPowers
Copy link
Collaborator

MrPowers commented Jun 1, 2022

@wjones127 - oh, wow, looks like amazing progress is being made, so exciting!

@MrPowers
Copy link
Collaborator

MrPowers commented Jun 4, 2022

@wjones127 - looks like #607 was merged!! Will it be relatively easy to expose this functionality via the Python bindings?

@wjones127
Copy link
Collaborator

@MrPowers Yeah shouldn't be too difficult. I created #622 to track that.

@wjones127
Copy link
Collaborator

Resolved by #607.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants