Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

:WIP: Refactor vacuum to commit to log and add builder #669

Closed
wants to merge 11 commits into from

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Jul 1, 2022

Description

Pull the vacuum operation out of the Delta table and commit to the log various metrics.

The structure of this code is similar to Optimize where the operation uses a struct which has a plan and execute component.
It also introduces a builder with reasonable defaults and helps ensure that future enhancements do not break the call API.

The current call API requires the user to specify the retention interval as hours. I have changed this to use Chrono's Duration since I find it more ergonomic then implement time conversions our selves. The Databricks implementation allows the user to pass a float value for periods smaller than a day which brings us in line.

Related Issue(s)

I want to close these related issues. Let me know if there's anything else you would like to see
#97
#667

TODO

  • Update python bindings
  • Test that an actual run deletes stuff
  • Validate log update scenario and data

@mrk-its
Copy link
Contributor

mrk-its commented Jul 3, 2022

Can you implement this commit to the log feature as optional (configurable by builer)?
Asking because I'm planning to use delta-rs as high-performant vacuum replacement for official VACUUM - but currently delta-rs is incompatible with S3 Multi Cluster setup (with DynamoDB, as described here) Correct me if I'm wrong, but it seems vacuum-only without writing to delta log should be safe.

@mrk-its
Copy link
Contributor

mrk-its commented Jul 3, 2022

@Blajda it was fast, thanks!

@Blajda
Copy link
Collaborator Author

Blajda commented Jul 3, 2022

Hi @mrk-its,
I've made the commits during vacuum configurable with the builder as suggested. You are correct that vacuums can be performed without appending to the log.

You might want to hold off on using this implementation of vacuum since the new tests show multiple issues with deleting nested partitions. It may also have integration issues with aws since we have some for Azure.

@mrk-its
Copy link
Contributor

mrk-its commented Jul 3, 2022

Sure, I'm going to be super careful, for now I'm simply running it in dry-run mode.

You might want to hold off on using this implementation of vacuum since the new tests show multiple issues with deleting nested partitions. It may also have integration issues with aws since we have some for Azure.

Can you point me to the test build with these issues for nested partitions? I probably can help with tests on AWS

@Blajda
Copy link
Collaborator Author

Blajda commented Jul 4, 2022

This PR contains new vacuum tests for various scenarios in tests/vacuum_error.rs. Currently the tests that fail with local storage are marked as ignore.

I might split this PR since it also contains new test code to easily test with different backends without having to worry about the clean up and setup. If you look into tests/common you can now specify DELTA_RS_TEST_BACKEND with either LOCALFS or AZURE_GEN2 and then setup the corresponding environment variables for the backend. If the main maintainers are okay with this approach it would be great to add some S3 support.

The main issues with vacuum right now is the list_obj for each backend. For the localfs it only lists the content of the root directory so it misses files to be deleted in partition directories.

@Blajda
Copy link
Collaborator Author

Blajda commented Jul 4, 2022

Closing this to break it down into multiple PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants