[Feature Request] Deletion Vectors to speed up DML operations #1367

tdas · 2022-09-01T17:49:03Z

This is the umbrella issue for Deletion Vector support.

Motivation

During Delta DML operations, both the semantics of cloud file systems as well as our guarantees about transaction history in Delta prevent us from performing any in-place updates to files. When updates are small compared to the total file size (e.g., a single row per file), this leads to an enormous performance burden of having to rewrite the entire file for a small change (aka Copy-on-write). A large fraction of DML statements that update anything, update a very small % of all the rows in the files they touch. Deletion Vectors (DVs) are a mechanism to deal with the case where updates are stored more efficiently, by avoiding the expensive rewrite of the unmodified data.

Proposal

The solution proposed in this issue is to augment the parquet files of a Delta table with separate “Deletion Vector” (DV) files, instead of rewriting them immediately. A DV is an optimized bitmap that represents a set of rows (of a particular parquet file) that are no longer valid (“deleted”) in a particular version of a Delta Table. The n-th row of a Parquet file has row index n, and is associated with the n-th bit of the DV.

When reading a Delta table at a version that contains DVs, care must be taken to “ignore” (filter out) the invalid rows during scans. As reading files with DVs is going to be somewhat slower than reading a fully compacted file without DVs, the mechanism itself is a tradeoff between write and read performance.

Further details

The detail proposal and the required protocol changes are sketched out in this doc.

https://docs.google.com/document/d/1lv35ZPfioopBbzQ7zT82LOev7qV7x4YNLkMr2-L5E_M/edit?usp=sharing

## Description - This PR makes the concrete changes proposed in #1367 to the Delta protocol specification. For details of what this proposal entails, see that issues. - In addition, this PR makes some clarification changes to the wording in the spec in various places, many of which where necessary to correctly reflect concepts introduced by the proposal (e.g., _logical files_, exact column stat semantics). N/A (document-only). ## Does this PR introduce _any_ user-facing changes? No. Closes #1372 Signed-off-by: Scott Sandre <scott.sandre@databricks.com> GitOrigin-RevId: 3de4c4248db7a6ae3052ea65ccd0d8ebe741c8f2

larsk-db · 2022-10-20T07:24:32Z

Update: With #1372 is merged the protocol spec part of this is complete. Actual implementation work is still pending.

vkorukanti · 2022-11-15T19:20:13Z

Update: Issue #1485 is created to support reading Delta tables with DVs

felipepessoto · 2023-03-24T23:08:40Z

The design doc mention MERGE command. Will it be changed to use DV?

2.3.0 RC release notes mention read support only, the plan is to release MERGE/UPDATE/DELETE in a future release?

Thanks.

larsk-db · 2023-03-27T15:32:02Z

The design doc mention MERGE command. Will it be changed to use DV?

2.3.0 RC release notes mention read support only, the plan is to release MERGE/UPDATE/DELETE in a future release?

Thanks.

@felipepessoto Currently, the focus is on adding DV support to DELETE (see #1591) and making sure that all DML commands can write correctly to tables that already have DVs, without even producing new DVs. We expect this to roll out over the next couple of releases. Once that is done we would like to monitor the feedback (read: what issues we are getting) before we settle on a final design for UPDATE and MERGE. So we don't have any particular timeline in mind for UPDATE and MERGE, but the intention is that they will be supported eventually.

tdas · 2024-03-24T23:41:33Z

Marking this as finally done as all 3 - delete, update, merge now uses deletion vectors if enabled on the table.

tdas added the enhancement New feature or request label Sep 1, 2022

tdas pinned this issue Sep 1, 2022

tdas mentioned this issue Sep 1, 2022

Roadmap 2022 H2 (discussion) #1307

Open

larsk-db mentioned this issue Sep 7, 2022

Update Protocol Spec for Deletion Vectors #1372

Closed

sezruby mentioned this issue Sep 14, 2022

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

Closed

dennyglee mentioned this issue Nov 10, 2022

Support deletionVector field in transaction log delta-io/delta-rs#928

Closed

vkorukanti mentioned this issue Nov 15, 2022

[Feature Request] Support reading Delta tables with Deletion Vectors #1485

Closed

3 tasks

tdas mentioned this issue Dec 1, 2022

[Feature Request] Enable ability to read stream from delta table without duplicates. #1490

Closed

3 tasks

vkorukanti mentioned this issue Feb 9, 2023

[Feature Request] Support DELETE command with Deletion Vectors #1591

Closed

3 tasks

saryeHaddadi mentioned this issue May 11, 2023

Are there plans to support merge on read mode #276

Open

tdas unpinned this issue Jun 6, 2023

xupefei mentioned this issue Jul 19, 2023

[Feature Request] Support UPDATE command with Deletion Vectors #1923

Closed

8 tasks

tdas mentioned this issue Nov 15, 2023

[Feature Request][Spark] Support MERGE command with Deletion Vectors #2296

Closed

5 tasks

tdas closed this as completed Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Deletion Vectors to speed up DML operations #1367

[Feature Request] Deletion Vectors to speed up DML operations #1367

tdas commented Sep 1, 2022

larsk-db commented Oct 20, 2022

vkorukanti commented Nov 15, 2022

felipepessoto commented Mar 24, 2023

larsk-db commented Mar 27, 2023

tdas commented Mar 24, 2024

[Feature Request] Deletion Vectors to speed up DML operations #1367

[Feature Request] Deletion Vectors to speed up DML operations #1367

Comments

tdas commented Sep 1, 2022

Motivation

Proposal

Further details

larsk-db commented Oct 20, 2022

vkorukanti commented Nov 15, 2022

felipepessoto commented Mar 24, 2023

larsk-db commented Mar 27, 2023

tdas commented Mar 24, 2024