Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Deletion Vectors to speed up DML operations #1367

Closed
tdas opened this issue Sep 1, 2022 · 5 comments
Closed

[Feature Request] Deletion Vectors to speed up DML operations #1367

tdas opened this issue Sep 1, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@tdas
Copy link
Contributor

tdas commented Sep 1, 2022

This is the umbrella issue for Deletion Vector support.

Motivation

During Delta DML operations, both the semantics of cloud file systems as well as our guarantees about transaction history in Delta prevent us from performing any in-place updates to files. When updates are small compared to the total file size (e.g., a single row per file), this leads to an enormous performance burden of having to rewrite the entire file for a small change (aka Copy-on-write). A large fraction of DML statements that update anything, update a very small % of all the rows in the files they touch. Deletion Vectors (DVs) are a mechanism to deal with the case where updates are stored more efficiently, by avoiding the expensive rewrite of the unmodified data.

Proposal

The solution proposed in this issue is to augment the parquet files of a Delta table with separate “Deletion Vector” (DV) files, instead of rewriting them immediately. A DV is an optimized bitmap that represents a set of rows (of a particular parquet file) that are no longer valid (“deleted”) in a particular version of a Delta Table. The n-th row of a Parquet file has row index n, and is associated with the n-th bit of the DV.

When reading a Delta table at a version that contains DVs, care must be taken to “ignore” (filter out) the invalid rows during scans. As reading files with DVs is going to be somewhat slower than reading a fully compacted file without DVs, the mechanism itself is a tradeoff between write and read performance.

Further details

The detail proposal and the required protocol changes are sketched out in this doc.

https://docs.google.com/document/d/1lv35ZPfioopBbzQ7zT82LOev7qV7x4YNLkMr2-L5E_M/edit?usp=sharing

@tdas tdas added the enhancement New feature or request label Sep 1, 2022
@tdas tdas pinned this issue Sep 1, 2022
zsxwing pushed a commit that referenced this issue Oct 19, 2022
## Description

- This PR makes the concrete changes proposed in #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues.
- In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., _logical files_, exact column stat semantics).

N/A (document-only).

## Does this PR introduce _any_ user-facing changes?

No.

Closes #1372

Signed-off-by: Scott Sandre <scott.sandre@databricks.com>
GitOrigin-RevId: 3de4c4248db7a6ae3052ea65ccd0d8ebe741c8f2
@larsk-db
Copy link
Contributor

Update: With #1372 is merged the protocol spec part of this is complete. Actual implementation work is still pending.

@vkorukanti
Copy link
Collaborator

Update: Issue #1485 is created to support reading Delta tables with DVs

@felipepessoto
Copy link
Contributor

The design doc mention MERGE command. Will it be changed to use DV?

2.3.0 RC release notes mention read support only, the plan is to release MERGE/UPDATE/DELETE in a future release?

Thanks.

@larsk-db
Copy link
Contributor

The design doc mention MERGE command. Will it be changed to use DV?

2.3.0 RC release notes mention read support only, the plan is to release MERGE/UPDATE/DELETE in a future release?

Thanks.

@felipepessoto Currently, the focus is on adding DV support to DELETE (see #1591) and making sure that all DML commands can write correctly to tables that already have DVs, without even producing new DVs. We expect this to roll out over the next couple of releases. Once that is done we would like to monitor the feedback (read: what issues we are getting) before we settle on a final design for UPDATE and MERGE. So we don't have any particular timeline in mind for UPDATE and MERGE, but the intention is that they will be supported eventually.

@tdas
Copy link
Contributor Author

tdas commented Mar 24, 2024

Marking this as finally done as all 3 - delete, update, merge now uses deletion vectors if enabled on the table.

@tdas tdas closed this as completed Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants