New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] data deduplication on existing delta table #1767
Comments
The proposed API is a little ambiguous ... how is "first" defined. |
Agree with your comment @tdas, the API needs to be better defined. A simpler form would be to drop duplicates on all columns, Defining which row to keep only makes sense when when dropping duplicates based on a subset of columns (as specified by the proposed "cols" parameter). Usage of "first" would require some kind of order within each group of duplicates, which would need to be further defined. |
Attempt at a second iteration: Parameters |
https://github.com/MrPowers/mack has something like this implemented @MrPowers |
@allisonport-db , @MrPowers, the drop_duplicate function in mack comes with the following warning:
A function that drop duplicates without overwriting the whole table would be the target |
+1 on this feature behalf my team |
Feature request
Overview
In some cases, we would prefer to do data deduplication on a regular basis rather than when upserting. A single command for this purpose would be quite useful.
Motivation
Several approaches are proposed online with regards to data de-duplication on delta tables. Some target deduplication when inserting, others propose complex merge operations for existing tables.
Further details
Propose to implement a command in the form
deltaTable.dropDuplicates(cols, where, keep="first")
such that rows with duplicates on the columns given by the cols parameter are dropped when the given conditions specified by the where parameter. The conditions should be a filter, on partition columns or other.
The text was updated successfully, but these errors were encountered: