[Feature Request] data deduplication on existing delta table #1767

mrtnstk · 2023-05-16T14:06:33Z

Feature request

Overview

In some cases, we would prefer to do data deduplication on a regular basis rather than when upserting. A single command for this purpose would be quite useful.

Motivation

Several approaches are proposed online with regards to data de-duplication on delta tables. Some target deduplication when inserting, others propose complex merge operations for existing tables.

Further details

Propose to implement a command in the form

deltaTable.dropDuplicates(cols, where, keep="first")

such that rows with duplicates on the columns given by the cols parameter are dropped when the given conditions specified by the where parameter. The conditions should be a filter, on partition columns or other.

The text was updated successfully, but these errors were encountered:

tdas · 2023-05-16T15:58:04Z

The proposed API is a little ambiguous ... how is "first" defined.

mrtnstk · 2023-05-16T16:13:07Z

The proposed API is a little ambiguous ... how is "first" defined.

Agree with your comment @tdas, the API needs to be better defined.

A simpler form would be to drop duplicates on all columns,
deltaTable.dropDuplicates(where)

Defining which row to keep only makes sense when when dropping duplicates based on a subset of columns (as specified by the proposed "cols" parameter). Usage of "first" would require some kind of order within each group of duplicates, which would need to be further defined.

mrtnstk · 2023-05-16T21:21:55Z

Attempt at a second iteration:
deltaTable.dropDuplicates(*cols =None, condition: Union[pyspark.sql.column.Column, str, None] = None, select: Optional[Dict[str, Union[str, pyspark.sql.column.Column]]] = None)

Parameters
cols: optional (str or list name of columns) – the columns to screen for duplicates. Search on all columns if None
condition: (str or pyspark.sql.Column) optional condition on where to search for duplicates. Example "partitionCol==2"
select: optional (dict with str or pyspark.sql.Column as keys and str as values) Define rule for which row to keep when setting the cols parameter. Example {"numericCol":"max"} would select the max value of numericCol within each group of duplicates.

allisonport-db · 2023-05-26T18:28:17Z

https://github.com/MrPowers/mack has something like this implemented @MrPowers

mrtnstk · 2023-06-07T18:02:39Z

@allisonport-db , @MrPowers, the drop_duplicate function in mack comes with the following warning:

Warning: This method is overwriting the whole table, thus very inefficient. If you can, use the drop_duplicates_pkey method instead.

A function that drop duplicates without overwriting the whole table would be the target

nimblefox · 2024-01-04T19:25:17Z

+1 on this feature behalf my team

mrtnstk added the enhancement New feature or request label May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] data deduplication on existing delta table #1767

[Feature Request] data deduplication on existing delta table #1767

mrtnstk commented May 16, 2023 •

edited

tdas commented May 16, 2023

mrtnstk commented May 16, 2023

mrtnstk commented May 16, 2023 •

edited

allisonport-db commented May 26, 2023

mrtnstk commented Jun 7, 2023

nimblefox commented Jan 4, 2024

[Feature Request] data deduplication on existing delta table #1767

[Feature Request] data deduplication on existing delta table #1767

Comments

mrtnstk commented May 16, 2023 • edited

Feature request

Overview

Motivation

Further details

tdas commented May 16, 2023

mrtnstk commented May 16, 2023

mrtnstk commented May 16, 2023 • edited

allisonport-db commented May 26, 2023

mrtnstk commented Jun 7, 2023

nimblefox commented Jan 4, 2024

mrtnstk commented May 16, 2023 •

edited

mrtnstk commented May 16, 2023 •

edited