-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Deletion Vectors to speed up DML operations #1367
Comments
## Description - This PR makes the concrete changes proposed in #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues. - In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., _logical files_, exact column stat semantics). N/A (document-only). ## Does this PR introduce _any_ user-facing changes? No. Closes #1372 Signed-off-by: Scott Sandre <scott.sandre@databricks.com> GitOrigin-RevId: 3de4c4248db7a6ae3052ea65ccd0d8ebe741c8f2
Update: With #1372 is merged the protocol spec part of this is complete. Actual implementation work is still pending. |
Update: Issue #1485 is created to support reading Delta tables with DVs |
The design doc mention MERGE command. Will it be changed to use DV? 2.3.0 RC release notes mention read support only, the plan is to release MERGE/UPDATE/DELETE in a future release? Thanks. |
@felipepessoto Currently, the focus is on adding DV support to DELETE (see #1591) and making sure that all DML commands can write correctly to tables that already have DVs, without even producing new DVs. We expect this to roll out over the next couple of releases. Once that is done we would like to monitor the feedback (read: what issues we are getting) before we settle on a final design for UPDATE and MERGE. So we don't have any particular timeline in mind for UPDATE and MERGE, but the intention is that they will be supported eventually. |
Marking this as finally done as all 3 - delete, update, merge now uses deletion vectors if enabled on the table. |
This is the umbrella issue for Deletion Vector support.
Motivation
During Delta DML operations, both the semantics of cloud file systems as well as our guarantees about transaction history in Delta prevent us from performing any in-place updates to files. When updates are small compared to the total file size (e.g., a single row per file), this leads to an enormous performance burden of having to rewrite the entire file for a small change (aka Copy-on-write). A large fraction of DML statements that update anything, update a very small % of all the rows in the files they touch. Deletion Vectors (DVs) are a mechanism to deal with the case where updates are stored more efficiently, by avoiding the expensive rewrite of the unmodified data.
Proposal
The solution proposed in this issue is to augment the parquet files of a Delta table with separate “Deletion Vector” (DV) files, instead of rewriting them immediately. A DV is an optimized bitmap that represents a set of rows (of a particular parquet file) that are no longer valid (“deleted”) in a particular version of a Delta Table. The n-th row of a Parquet file has row index n, and is associated with the n-th bit of the DV.
When reading a Delta table at a version that contains DVs, care must be taken to “ignore” (filter out) the invalid rows during scans. As reading files with DVs is going to be somewhat slower than reading a fully compacted file without DVs, the mechanism itself is a tradeoff between write and read performance.
Further details
The detail proposal and the required protocol changes are sketched out in this doc.
https://docs.google.com/document/d/1lv35ZPfioopBbzQ7zT82LOev7qV7x4YNLkMr2-L5E_M/edit?usp=sharing
The text was updated successfully, but these errors were encountered: