Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] data deduplication on existing delta table #1767

Open
mrtnstk opened this issue May 16, 2023 · 6 comments
Open

[Feature Request] data deduplication on existing delta table #1767

mrtnstk opened this issue May 16, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@mrtnstk
Copy link

mrtnstk commented May 16, 2023

Feature request

Overview

In some cases, we would prefer to do data deduplication on a regular basis rather than when upserting. A single command for this purpose would be quite useful.

Motivation

Several approaches are proposed online with regards to data de-duplication on delta tables. Some target deduplication when inserting, others propose complex merge operations for existing tables.

Further details

Propose to implement a command in the form

deltaTable.dropDuplicates(cols, where, keep="first")

such that rows with duplicates on the columns given by the cols parameter are dropped when the given conditions specified by the where parameter. The conditions should be a filter, on partition columns or other.

@mrtnstk mrtnstk added the enhancement New feature or request label May 16, 2023
@tdas
Copy link
Contributor

tdas commented May 16, 2023

The proposed API is a little ambiguous ... how is "first" defined.

@mrtnstk
Copy link
Author

mrtnstk commented May 16, 2023

The proposed API is a little ambiguous ... how is "first" defined.

Agree with your comment @tdas, the API needs to be better defined.

A simpler form would be to drop duplicates on all columns,
deltaTable.dropDuplicates(where)

Defining which row to keep only makes sense when when dropping duplicates based on a subset of columns (as specified by the proposed "cols" parameter). Usage of "first" would require some kind of order within each group of duplicates, which would need to be further defined.

@mrtnstk
Copy link
Author

mrtnstk commented May 16, 2023

Attempt at a second iteration:
deltaTable.dropDuplicates(*cols =None, condition: Union[pyspark.sql.column.Column, str, None] = None, select: Optional[Dict[str, Union[str, pyspark.sql.column.Column]]] = None)

Parameters
cols: optional (str or list name of columns) – the columns to screen for duplicates. Search on all columns if None
condition: (str or pyspark.sql.Column) optional condition on where to search for duplicates. Example "partitionCol==2"
select: optional (dict with str or pyspark.sql.Column as keys and str as values) Define rule for which row to keep when setting the cols parameter. Example {"numericCol":"max"} would select the max value of numericCol within each group of duplicates.

@allisonport-db
Copy link
Collaborator

https://github.com/MrPowers/mack has something like this implemented @MrPowers

@mrtnstk
Copy link
Author

mrtnstk commented Jun 7, 2023

@allisonport-db , @MrPowers, the drop_duplicate function in mack comes with the following warning:

Warning: This method is overwriting the whole table, thus very inefficient. If you can, use the drop_duplicates_pkey method instead.

A function that drop duplicates without overwriting the whole table would be the target

@nimblefox
Copy link

+1 on this feature behalf my team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants