Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support reading Delta tables with Deletion Vectors #1485

Closed
1 of 3 tasks
vkorukanti opened this issue Nov 15, 2022 · 2 comments
Closed
1 of 3 tasks
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@vkorukanti
Copy link
Collaborator

vkorukanti commented Nov 15, 2022

Feature request

This task is part of the Deletion Vectors support feature. The scope here is support reading Delta tables with deletion vectors (DVs) in the current master.

Overview

Regarding the DV protocol refer to #1367.

At a high level if a Parquet file has one or more records marked for deletion (using DVs), there is a corresponding DV file in the storage that contains the indices of the records marked for deletion in the file. The DV file name is part of the AddFile (see protocol).

Following are the main changes needed for this task:

ID Task description PR Status
1 Custom RoaringBitmap that is optimized for cases that usually fit within a 32-bit bitmap. Reason for adding this bitmap structure will be explained in the PR. #1486 DONE
2 Encoding/Decoding of DV to Base85 variant for storing inline in DeltaLog #1487 DONE
3 Add DeletionVectorDescriptor to parse the AddFile.deletionVector (metadata describing the DV location for the file in AddFile) #1528 DONE
4 DeletionVectorStore to read and write DV content to disk #1534 DONE
5 RowIndexFilter which reads the DV contents and provides an interface to get a DV value for a given index(es) #1536 DONE
6 Update AddFile and RemoveFile with a new field deletionVector #1560 DONE
7 Planner rule to inject a Filter just after the Delta Parquet scan. This rule modifies the plan:
.
  • Before rule: <Parent Node> -> Delta Scan (key, value) . Here we are reading key, value columns from the Delta table
  • After rule: <Parent Node> -> Project(key, value) -> Filter (udf(__skip_row == 0) -> Delta Scan (key, value, __skip_row)
    • Here we insert a new column in Delta scan __skip_row. This value is populated by the Parquet reader (refer to the next change) and it contains 0 if we want to keep the row. The value here is populated by the deletion vector corresponding the Parquet file being read.
    • The scan created also disables Parquet file splitting and filter pushdowns, because in order to generate the __skip_row we need to read the rows in a file consecutively in order to generate the row index. This is a drawback we need to pay until we upgrade to latest Apache Spark which contains Parquet reader changes that automatically generate the row_index irrespective of the file splitting and filter pushdowns.
    • The scan created also contains a broadcast variable of Parquet File -> DV File map. The Parquet reader created uses this map to find the DV file corresponding to the Parquet file.
    • Filter created just filters out rows with __skip_row equals to 0
    • And at the end we have a Project to keep the plan node output same as before the rule is applied
#1560 DONE
8 DeltaParquetFileFormat.buildReaderWithPartitionValues to add a new column __skip_row to the data returned by the Parquet reader. #1542 DONE
9 Integration tests
10 Throw unsupported error when DELETE/MERGE/UPDATE is run on Delta table with DVs. This is temporary until we support the DV write support #1603 DONE
11 Limit pushdown updates - Consider the deletionVector.cardinality in stats to prune the file list. #1577 DONE
12 Aggregation pushdown - Consider the deletionVector.cardinality in stats to prune the file list. #1560 DONE
13 Vacuum support - Include the DV file of the GC survived FileAction in the list of GC survived files. #1676 DONE
14 Optimize support on tables with DVs #1578 DONE
15 Generate manifest file - Throws error if the table has DVs #1595 DONE
16 Shallow clone support on Delta tables with DVs #1733 DONE
17 Restore clone support on Delta tables with DVs #1735 DONE
18 Handle DVs in min/max optimize #1525 LATER
19 Delta checkpoint support - make sure the DV info is preserved #1576 DONE
20 Check if stats recomputation on tables with DVs needs any fixes for correctness DONE

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@zsxwing
Copy link
Member

zsxwing commented Nov 17, 2022

I added one more task OptimizeMetadataOnlyDeltaQuery should support DV to the list as we are going to merge #1377

allisonport-db pushed a commit that referenced this issue Nov 28, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more at #1485)

Deletion vectors are stored either in a separate file or inline as part of `AddFile` struct in DeltaLog. More details of the format are found [here](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-descriptor-schema)

In this PR, add utilities to encode/decode the DV bitmap in Base85 variant [Z85](https://rfc.zeromq.org/spec/32/) for storing it in `AddFile` struct in DeltaLog.

Close #1487

GitOrigin-RevId: e12b67abd7498b174cd3942b7c0ae82ffd362cc6
allisonport-db pushed a commit that referenced this issue Nov 28, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more at #1485)

Adds a new bitmap implementation called `RoaringBitmapArray`. This will be used to encode the deleted row indices. There already exists a `Roaring64Bitmap` provided by the `org.roaringbitmap` library , but this implementation is optimized for use case of handling row indices, which are always clustered between 0 and the index of the last row number of a file, as opposed to being arbitrarily sparse over the whole `Long` space.

Closes #1486

GitOrigin-RevId: c94d0cd1b4d1c179b9947e984705ab81e97a6dec
vkorukanti added a commit to vkorukanti/delta that referenced this issue Dec 20, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details of supporting DV table reads at delta-io#1485)

This PR adds a new class called `DeletionVectorDescriptor` which is used to represent the deletion vector for a given file (`AddFile`) in `DeltaLog`. The format of the metadata is described in the [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-descriptor-schema).

GitOrigin-RevId: 59979dea64d7c938d33b886eaebe0b30b2f8dd6c
vkorukanti added a commit that referenced this issue Dec 21, 2022
## Description
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

It adds new class called `DeletionVectorDescriptor` which is used to represent the deletion vector for a given file (`AddFile`) in `DeltaLog`. The format of the metadata is described in the [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-descriptor-schema).

New test suite is added

Closes #1528

GitOrigin-RevId: 3fe23fcf8b23c9478903e52b82b42f7cc1abf94f
vkorukanti added a commit to vkorukanti/delta that referenced this issue Dec 28, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

It adds a `DeletionVectorStore` which contains APIs to load DVs from and write DVs to Hadoop FS compliant file system.  The format of the DV file is described in the protocol [here](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-file-storage-format).

Added a test suite.

GitOrigin-RevId: 72340c9854f7d0376ea2aeec0c4bbba08ce78259
vkorukanti added a commit that referenced this issue Dec 29, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

It adds a `DeletionVectorStore` which contains APIs to load DVs from and write DVs to Hadoop FS compliant file system.  The format of the DV file is described in the protocol [here](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vector-file-storage-format).

Added a test suite.

Closes #1534

GitOrigin-RevId: 2941dc32e87565a97c3e1d70470a1eaf65e524c7
vkorukanti added a commit to vkorukanti/delta that referenced this issue Dec 29, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

It adds an interface called `RowIndexFilter` which evaluates whether to keep a row in the output or not. `FilterDeletedRows` implements `RowIndexFilter` to filter out rows that are deleted from a deletion vector. In the final integration, this filter is used just after fetching the rows from the data parquet file. Refer to task IDs 7 and 8 in the [project plan.](delta-io#1485)

Test suite is added.

GitOrigin-RevId: 0ba60f880f7a83304142f4b021fd71b170d74356
vkorukanti added a commit to vkorukanti/delta that referenced this issue Dec 29, 2022
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

It adds an interface called `RowIndexFilter` which evaluates whether to keep a row in the output or not. `FilterDeletedRows` implements `RowIndexFilter` to filter out rows that are deleted from a deletion vector. In the final integration, this filter is used just after fetching the rows from the data parquet file. Refer to task IDs 7 and 8 in the [project plan.](delta-io#1485)

Test suite is added.

GitOrigin-RevId: 0ba60f880f7a83304142f4b021fd71b170d74356
scottsand-db pushed a commit that referenced this issue Jan 5, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

It adds an interface called `RowIndexFilter` which evaluates whether to keep a row in the output or not. `FilterDeletedRows` implements `RowIndexFilter` to filter out rows that are deleted from a deletion vector. In the final integration, this filter is used just after fetching the rows from the data parquet file. Refer to task IDs 7 and 8 in the [project plan.](#1485)

Test suite is added.

Closes #1536

GitOrigin-RevId: d2075dba21818e2889d51b4331d88bb62d192328
vkorukanti added a commit to vkorukanti/delta that referenced this issue Jan 5, 2023
…es with DVs

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

It modifies the `DeltaParquetFileFormat` to append an additional column called `__delta_internal_skip_row__`. This column is populated by reading the DV associated with the Parquet file. We assume the rows returned are in order given in the file. To ensure the order we disable file splitting and filter pushdown to Parquet reader. This has performance penalty for Delta tables with deletion vectors until we upgrade Delta to Spark version to 3.4 (which has Parquet reader that can generate row-indexes correctly with file splitting and filter pushdown).

Currently added a single test. There will be e2e tests that cover the code better.

GitOrigin-RevId: 2067958ffc770a89df15fd165c9999d49b2dd1c4
vkorukanti added a commit to vkorukanti/delta that referenced this issue Jan 5, 2023
…es with DVs

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

It modifies the `DeltaParquetFileFormat` to append an additional column called `__delta_internal_skip_row__`. This column is populated by reading the DV associated with the Parquet file. We assume the rows returned are in order given in the file. To ensure the order we disable file splitting and filter pushdown to Parquet reader. This has performance penalty for Delta tables with deletion vectors until we upgrade Delta to Spark version to 3.4 (which has Parquet reader that can generate row-indexes correctly with file splitting and filter pushdown).

Currently added a single test. There will be e2e tests that cover the code better.

GitOrigin-RevId: 2067958ffc770a89df15fd165c9999d49b2dd1c4
vkorukanti added a commit that referenced this issue Jan 10, 2023
…d from DV

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

It modifies the `DeltaParquetFileFormat` to append an additional column called `__delta_internal_is_row_deleted__`. This column is populated by reading the DV associated with the Parquet file. We assume the rows returned are in order given in the file. To ensure the order we disable file splitting and filter pushdown to Parquet reader. This has performance penalty for Delta tables with deletion vectors until we upgrade Delta to Spark version to 3.4 (which has Parquet reader that can generate row-indexes correctly with file splitting and filter pushdown).

Currently added a single test. There will be e2e tests that cover the code better.

Closes #1542

GitOrigin-RevId: b73fa628ad6d04c171f56534b80a894e9cd1220e
vkorukanti added a commit that referenced this issue Jan 26, 2023
…an output

This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

Add a trait (used by `PrepareDeltaScan` to modify its output) to modify DV enabled tables to prune the deleted rows from scan output

Planner trait to inject a Filter just after the Delta Parquet scan. This transformer modifies the plan:
 * Before rule: `<Parent Node> -> Delta Scan (key, value)`
   * Here we are reading `key`, `value` columns from the Delta table
 * After rule: `<Parent Node> -> Project(key, value) -> Filter (udf(__skip_row == 0) -> Delta Scan (key, value, __skip_row)`
   * Here we insert a new column in Delta scan `__skip_row`. This value is populated by the Parquet reader using the DV corresponding to
     the Parquet file read (refer [to the change](#1542)) and it contains `0` if we want to keep the row.
   * The scan created also disables Parquet file splitting and filter pushdowns, because in order to generate the `__skip_row` we need to
     read the rows in a file consecutively in order to generate the row index. This is a drawback we need to pay until we upgrade to latest
     Apache Spark which contains Parquet reader changes that automatically generate the row_index irrespective of the file splitting and
     filter pushdowns.
   * The scan created also contains a broadcast variable of Parquet File -> DV File map. The Parquet reader created uses this map to find
     the DV file corresponding to the Parquet file.
   * Filter created just filters out rows with `__skip_row` equals to 0
   * And at the end we have a `Project` to keep the plan node output same as before the rule is applied

In addition
* it adds the `deletionVector` to DeltaLog protocol objects (`AddFile`, `RemoveFile`)
* It also updates the `OptimizeMetadataOnlyDeltaQuery` to take into consideration of the DVs when calculating the row count.
* end-to-end integration of reading Delta tables with DVs in `DeletionVectorsSuite`

In following up PRs, will be adding extensive tests.

Close #1560

GitOrigin-RevId: 3d67b6240865d880493f1d15a80b00cb079dacdc
vkorukanti added a commit that referenced this issue Jan 26, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

Add support of checkpoints when updating Delta tables containing deletion vectors. In checkpointing job, read the existing deletionVector in file action and write it to checkpoint file.

Closes #1576

GitOrigin-RevId: 9ee23c8a876c45f539fc81cef100382f6efe6fae
vkorukanti added a commit to vkorukanti/delta that referenced this issue Jan 27, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

Add support of checkpoints when updating Delta tables containing deletion vectors. In checkpointing job, read the existing deletionVector in file action and write it to checkpoint file.

Closes delta-io#1576

GitOrigin-RevId: 9ee23c8a876c45f539fc81cef100382f6efe6fae
vkorukanti added a commit that referenced this issue Jan 30, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

Currently the limit pushdown code doesn't take into account the DVs when pruning the list of files based on the `limit` value. Update the code to take `dv.cardinality` when finding the number of rows in a `AddFile`.

It also adds
 * comments around data skipping code and how data skipping continue to work without giving wrong results (with a bit performance overhead) when querying tables with DVs
 * test utilities to associate DVs with `AddFile` for testing purposes.

Closes #1577

GitOrigin-RevId: fdb0c78db9b2d0ac37cce67886110a32688fd531
vkorukanti added a commit to vkorukanti/delta that referenced this issue Jan 30, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at delta-io#1485)

This PR adds support for running OPTIMIZE (file compaction or Z-Order By) on Delta tables with deletion vectors. It changes the following:
* Selection criteria
   * File compaction: earlier we used to select files with size below `optimize.minFileSize` for compaction. Now we also consider the ratio of rows deleted in a file. If the deleted rows ratio is above `optimize.maxDeletedRowsRatio` (default 0.05), then it is also selected for compaction (which removes the DVs)
   * Z-Order: This hasn't been changed. We always select all the files in the selected partitions, so if a file has DV it gets removed as part of the Z-order by
* Reading selected files with DV for OPTIMIZE: We go through the same read path as Delta table read which removes the deleted rows (according to the DV) from the scan output.
* Metrics for deleted DVs

Added tests.

GitOrigin-RevId: b64d8beec8278e6665813642753ef0a19af5c985
vkorukanti added a commit that referenced this issue Jan 31, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

This PR adds support for running OPTIMIZE (file compaction or Z-Order By) on Delta tables with deletion vectors. It changes the following:
* Selection criteria
   * File compaction: earlier we used to select files with size below `optimize.minFileSize` for compaction. Now we also consider the ratio of rows deleted in a file. If the deleted rows ratio is above `optimize.maxDeletedRowsRatio` (default 0.05), then it is also selected for compaction (which removes the DVs)
   * Z-Order: This hasn't been changed. We always select all the files in the selected partitions, so if a file has DV it gets removed as part of the Z-order by
* Reading selected files with DV for OPTIMIZE: We go through the same read path as Delta table read which removes the deleted rows (according to the DV) from the scan output.
* Metrics for deleted DVs

Added tests.

Closes #1578

GitOrigin-RevId: f20d234357fa5b24e56aea098fa60f026ad1f160
vkorukanti added a commit that referenced this issue Feb 14, 2023
This PR is part of the feature: Support reading Delta tables with deletion vectors (more details at #1485)

Manifest file contains the list of Parquet data files paths that can be consumed by clients with symlink table type reading capability. However when DVs are present on top of the Parquet data files, there is no way to expose the DVs to symlink table reader. It is best to block manifest file generation when DVs are present in the table.

Closes #1595

GitOrigin-RevId: 2589bcbd08c59bb65c5dc465d31b5e0e4aabf5c0
@tdas tdas added this to the 2.3.0 milestone Feb 17, 2023
vkorukanti added a commit to vkorukanti/delta that referenced this issue Apr 3, 2023
This is part of [support reading Delta tables with deletion vectors](delta-io#1485)

It adds support for running VACUUM command on Delta tables with deletion vectors. Main change is to include the DV file of the GC survived FileAction in the list of GC survived files, so that the valid DV files are not considered for deletion.

Added tests.

Closes delta-io#1676

Signed-off-by: Venki Korukanti <venki.korukanti@gmail.com>
GitOrigin-RevId: 0262ecade603e808044ca1e1cde8f053332a10d9
allisonport-db pushed a commit that referenced this issue Apr 5, 2023
…1485)

It adds support for running VACUUM command on Delta tables with deletion vectors. Main change is to include the DV file of the GC survived FileAction in the list of GC survived files, so that the valid DV files are not considered for deletion.

Added tests.

Closes #1676

Signed-off-by: Venki Korukanti <venki.korukanti@gmail.com>
GitOrigin-RevId: c5a156779701934366c36b4049648f43c7b97ebe
@allisonport-db
Copy link
Collaborator

vkorukanti added a commit to vkorukanti/delta that referenced this issue May 6, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

This PR adds support for SHALLOW CLONEing a Delta table with deletion vectors. The main change is to convert the relative path of DV file in AddFile to absolute path when cloning the table.

Added tests

Closes delta-io#1733
vkorukanti added a commit to vkorukanti/delta that referenced this issue May 6, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the AddFile.deletionVector when comparing the target version being restored to and the current version to find the list of data files to add and remove.

Added tests

Closes delta-io#1735
vkorukanti added a commit to vkorukanti/delta that referenced this issue May 6, 2023
…ectors

Given that now we have support for writing into DV tables and table utility operations as as part of the delta-io#1485 and delta-io#1591, we should remove the check.

Closes delta-io#1736
vkorukanti added a commit to vkorukanti/delta that referenced this issue May 6, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the AddFile.deletionVector when comparing the target version being restored to and the current version to find the list of data files to add and remove.

Added tests

Closes delta-io#1735
vkorukanti added a commit to vkorukanti/delta that referenced this issue May 6, 2023
…ectors

Given that now we have support for writing into DV tables and table utility operations as as part of the delta-io#1485 and delta-io#1591, we should remove the check.

Closes delta-io#1736
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue May 9, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

This PR adds support for SHALLOW CLONEing a Delta table with deletion vectors. The main change is to convert the relative path of DV file in AddFile to absolute path when cloning the table.

Added tests

Closes delta-io#1733
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue May 9, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the AddFile.deletionVector when comparing the target version being restored to and the current version to find the list of data files to add and remove.

Added tests

Closes delta-io#1735
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue May 9, 2023
…ectors

Given that now we have support for writing into DV tables and table utility operations as as part of the delta-io#1485 and delta-io#1591, we should remove the check.

Closes delta-io#1736
allisonport-db pushed a commit that referenced this issue May 11, 2023
…ectors

Given that now we have support for writing into DV tables and table utility operations as as part of the #1485 and #1591, we should remove the check.

Closes #1736

Signed-off-by: Venki Korukanti <venki.korukanti@databricks.com>
GitOrigin-RevId: 17e7e9c6796229ada77148a730c69348a55890b9
allisonport-db pushed a commit that referenced this issue May 11, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at #1485)

This PR adds support for SHALLOW CLONEing a Delta table with deletion vectors. The main change is to convert the relative path of DV file in `AddFile` to absolute path when cloning the table.

Added tests

Closes #1733

GitOrigin-RevId: b634496b57b93fc4b7a7cc16e33c200e3a83ba64
allisonport-db pushed a commit that referenced this issue May 11, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at #1485)

It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the `AddFile.deletionVector` when comparing the target version being restored to and the current version to find the list of data files to add and remove.

Added tests

Closes #1735

GitOrigin-RevId: b722e0b058ede86f652cd4e4229a7217916511da
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue May 11, 2023
…ectors

Given that now we have support for writing into DV tables and table utility operations as as part of the delta-io#1485 and delta-io#1591, we should remove the check.

Closes delta-io#1736

Signed-off-by: Venki Korukanti <venki.korukanti@databricks.com>
GitOrigin-RevId: 17e7e9c6796229ada77148a730c69348a55890b9
(cherry picked from commit 0331737)
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue May 11, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

This PR adds support for SHALLOW CLONEing a Delta table with deletion vectors. The main change is to convert the relative path of DV file in `AddFile` to absolute path when cloning the table.

Added tests

Closes delta-io#1733

GitOrigin-RevId: b634496b57b93fc4b7a7cc16e33c200e3a83ba64
(cherry picked from commit 6556d6f)
allisonport-db pushed a commit to allisonport-db/delta that referenced this issue May 11, 2023
This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the `AddFile.deletionVector` when comparing the target version being restored to and the current version to find the list of data files to add and remove.

Added tests

Closes delta-io#1735

GitOrigin-RevId: b722e0b058ede86f652cd4e4229a7217916511da
(cherry picked from commit 6ef881f)
sirsha-chatterjee pushed a commit to sirsha-chatterjee/delta that referenced this issue May 16, 2023
Makes changes to support Spark 3.4. These include compile necessary changes, and test _and_ code changes due to changes in Spark behavior.

Some of the bigger changes include
- A lot of changes regarding error classes. These include...
  - Spark 3.4 changed `class ErrorInfo` to private. This means the current approach in `DeltaThrowableHelper` can no longer work. We now use `ErrorClassJsonReader` (these are the changes to `DeltaThrowableHelper` and `DeltaThrowableSuite`
  - Many error functions switched the first argument from `message: String` to `errorClass: String` which **does not** cause a compile error, but instead causes a "SparkException-error not found" when called. Some things affected include `ParseException(...)`, `a.failAnalysis(..)`.
  - Supports error subclasses
- Spark 3.4 supports insert-into-by-name and no longer reorders such queries to be insert-into-by-ordinal. See apache/spark#39334. In `DeltaAnalysis.scala` we need to perform schema validation checks and schema evolution for such queries; right now we only match when `!isByName`
- SPARK-27561 added support for lateral column alias. This broke our generation expression validation checks for generated columns. We now separately check for generated columns that reference other generated columns in `GeneratedColumn.scala`
- `DelegatingCatalogExtension` deprecates `createTable(..., schema: StructType, ...)` in favor of `createTable(..., columns: Array[Column], ...)`
- `_metadata.file_path` is not always encoded. We update `DeleteWithDeletionVectorsHelper.scala` to accomodate for this.
- Support for SQL `REPLACE WHERE`. Tests are added to `DeltaSuite`.
-  Misc test changes due to minor changes in Spark behavior or error messages

Resolves delta-io#1696

Existing tests should suffice since there are no major Delta behavior changes _besides_ support for `REPLACE WHERE` for which we have added tests.

Yes. Spark 3.4 will be supported. `REPLACE WHERE` is supported in SQL.

GitOrigin-RevId: b282c95c4e6a7a1915c2a4ae9841b5e43ed4724d

Fix a test in DeltaVacuumSuite to pass locally

"vacuum after purging deletion vectors" in `DeltaVacuumSuite` fails locally because the local filesystem only writes modification times to second accuracy.

This means a transaction might have timestamp `1683694325000` but the tombstone for a file removed in that transaction could have deletionTimestamp `1683694325372`.

---> The test fails since we set the clock to the transaction timestamp + retention period, which isn't late enough to expire the tombstones in that transaction.

GitOrigin-RevId: 63018c48524edb0f8edd9e40f1b21cc97bc546cc

Add estLogicalFileSize to FileAction

Add estLogicalFileSize to FileAction for easier Deletion Vector processing.

GitOrigin-RevId: c7cf0ad32e378bcfc4e4c046c5d76667bb8659c7

Support insert-into-by-name for generated columns

Spark 3.4 no longer requires users to provide _all_ columns in insert-by-name queries. This means Delta can now support omitting generated columns from the column list in such queries.

This test adds support for this and adds some additional tests related to the changed by-name support.

Resolves delta-io#1215

Adds unit tests.

Yes. Users will be able to omit generated columns from the column list when inserting by name.

Closes delta-io#1743

GitOrigin-RevId: 8694fab3d93b71b4230bf6f5dd0f2a21be6f3634

Implement PURGE to remove DVs from Delta tables

This PR introduces a `REORG TABLE ... APPLY (PURGE)` SQL command that can materialize soft-delete operations by DVs.

The command works by rewriting and bin-packing (if applicable) only files that have DVs attached, which is different from the `OPTIMIZE` command where all files (with and without) DV will be bin-packed. To achieve this, we hack the `OPTIMIZE` logic so files of any size with DVs will be rewritten.

Follow-up:
- Set the correct commit info. Now the resulting version is marked as `optimize` rather than `purge`.
- Clean up DVs from the filesystem.

New tests.

Closes delta-io#1732

Signed-off-by: Venki Korukanti <venki.korukanti@databricks.com>
GitOrigin-RevId: 98ef156d62698986bfb54681e386971e2fec08b8

Unify predicate strings in CommitInfo to record the information in a consistent way.

GitOrigin-RevId: 043a6a4181c112b9c9a45906c1275fbbdbbb1388

Minor refactoring to Delta source.

GitOrigin-RevId: 3625a5c44999139ef4976c62473b233167a4aa83

Add Option.whenNot Scala extension helper and replace usage of Option.when(!cond).

GitOrigin-RevId: e26244544cadeeff1d55862f840d4c6c5570e83b

Introduce DomainMetadata action to delta spec

We propose to introduce a new Action type called DomainMetadata to the Delta spec. In a nutshell, DomainMetadata allows specifying configurations (string key/value pairs) per metadata domain, and a custom conflict handler can be registered to a metadata domain. More details can be found in the design doc [here](https://docs.google.com/document/d/16MHP7P78cq9g8SWhAyfgFlUe-eH0xhnMAymgBVR1fCA/edit?usp=sharing).

The github issue delta-io#1741 was created.

Spec only change and no test is needed.

Closes delta-io#1742

GitOrigin-RevId: 5d33d8b99e33c5c1e689672a8ca2ab3863feab54

DV stress test: Delete from a table of a large number of rows with DVs

This PR tests DELETing from a table of 2 billion rows (`2<<31 + 10`), some of which are marked as deleted by a DV. The goal is to ensure that DV can still be read and manipulated in such a scenario.

We don't `delete a large number of rows` and `materialize DV` because they run too slow to fit in a unit test (9 and 20 minutes respectively).

GitOrigin-RevId: 1273c9372907be0345465c2176a7f76115adbb47

RESTORE support for Delta tables with deletion vectors

This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

It adds running RESTORE on a Delta table with deletion vectors. The main change is to take into consideration of the `AddFile.deletionVector` when comparing the target version being restored to and the current version to find the list of data files to add and remove.

Added tests

Closes delta-io#1735

GitOrigin-RevId: b722e0b058ede86f652cd4e4229a7217916511da

Disallow overwriteSchema with dynamic partitions overwrite

Disallow overwriteSchema when partitionOverwriteMode is set to dynamic.
Otherwise, the table might become corrupted as schemas of newly written partitions would
not match the non-overwritten partitions.

GitOrigin-RevId: 1012793448c1ffed9a3f8bde507d9fe1ee183803

SHALLOW CLONE support for Delta tables with deletion vectors.

This PR is part of the feature: Support Delta tables with deletion vectors (more details at delta-io#1485)

This PR adds support for SHALLOW CLONEing a Delta table with deletion vectors. The main change is to convert the relative path of DV file in `AddFile` to absolute path when cloning the table.

Added tests

Closes delta-io#1733

GitOrigin-RevId: b634496b57b93fc4b7a7cc16e33c200e3a83ba64

Adds tests for REPLACE WHERE SQL syntax

Spark 3.4 added RELACE WHERE SQL support for insert. This PR adds tests for the feature after upgrading to Spark 3.4.

Closes delta-io#1737

GitOrigin-RevId: 8bf0e7423a6f0846d5f9ef4e637ee9ced9bef8d1

Fix a test in `DeltaThrowableSuite.scala`

Fix a test in `DeltaThrowableSuite.scala`

GitOrigin-RevId: 28acd5fe8d8cadd569c479fe0f02d99dac1c13b3

Fix statistics computation issues with Delta tables with DVs

This PR makes following changes:
- Delta protocol [requires](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirement-for-deletion-vectors) that every `AddFile` with DV must have `numRecords` in file statistics. The current implementation of DELETE with DVs violates this requirement when the source `AddFile` has no statistics to begin with. This PR fixes it by computing stats for `AddFile`s with missing stats and have DVs generated as part of the DELETE with DV operation. The stats are generated by reading the Parquet file footer.
- DELETE with DVs currently has a bug where setting the `tightBounds=false` for `AddFile`s with DVs doesn't correctly set the `NULL_COUNT` for column with all nulls.
- Throw error when stats re-computation command is run on Delta tables with DVs. This is a TODO, we need to implement it but for now throw error to avoid calculating wrong statistics for Delta tables with DVs.

GitOrigin-RevId: f69968961dcf4766b6847a191b66aae7f9ff295d

Remove the check that disables writes to Delta tables with deletion vectors

Given that now we have support for writing into DV tables and table utility operations as as part of the delta-io#1485 and delta-io#1591, we should remove the check.

Closes delta-io#1736

Signed-off-by: Venki Korukanti <venki.korukanti@databricks.com>
GitOrigin-RevId: 17e7e9c6796229ada77148a730c69348a55890b9

Regex based table matching in DeleteScalaSuite

Use a more reliable regex-based approach to getting a `DeltaTable` instance from a sql identifier string in `DeleteScalaSuite`.

GitOrigin-RevId: 1d0e1477a7d22373e8478d7debc3565c092090da

Enable SQL support for WHEN NOT MATCHED BY SOURCE

The SQL syntax for merge with WHEN NOT MATCHED BY SOURCE clauses was shipped with Spark 3.4. Now that Delta picked up Spark 3.4, we can enable SQL support and mix in SQL tests for WHEN NOT MATCHED BY SOURCE.

Existing tests for WHEN NOT MATCHED BY SOURCE are now run in the Merge SQL suite.

Closes delta-io#1740

GitOrigin-RevId: 1ddd1216e13f854901da47896936527618ea4dca

Minor refactor to DeltaCatalog.scala

GitOrigin-RevId: 53b083f9abf92330d253fbdd9208d2783428dd98

Correctly recurse into nested arrays & maps in add/drop columns

It is not possible today in Delta tables to add or drop nested fields under two or more levels of directly nested arrays or maps.
The following is a valid use case but fails today:
```
CREATE TABLE test (data array<array<struct<a: int>>>)
ALTER TABLE test ADD COLUMNS (data.element.element.b string)
```

This change updates helper methods `findColumnPosition`, `addColumn` and `dropColumn` in `SchemaUtils` to correctly recurse into directly nested maps and arrays.

Note that changes in Spark are also required for `ALTER TABLE ADD/CHANGE/DROP COLUMN`  to work: apache/spark#40879. The fix is merged in Spark but will only be available in Delta in the next Spark release.

In addition, `findColumnPosition` which currently both returns the position of nested field and the size of its parent, making it overly complex, is split into two distinct and generic methods: `findColumnPosition` and `getNestedTypeFromPosition`.

- Tests for `findColumnPosition`, `addColumn` and `dropColumn` with two levels of nested maps and arrays are added to `SchemaUtilsSuite`. Other cases for these methods are already covered by existing tests.
- Tested locally that  ALTER TABLE ADD/CHANGE/DROP COLUMN(S) works correctly with Spark fix apache/spark#40879
- Added missing tests coverage for ALTER TABLE ADD/CHANGE/DROP COLUMN(S) with a single map or array.

Closes delta-io#1731

GitOrigin-RevId: 53ed05813f4002ae986926506254d780e2ecddfa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants