-
Notifications
You must be signed in to change notification settings - Fork 4
postgres remove duplicates sql
The standard approach is to GROUP BY on the duplicate columns and keep one remaining row using the MIN(id) or MAX(id) value.
BEGIN;
LOCK TABLE new_cust IN SHARE ROW EXCLUSIVE MODE;
CREATE TEMPORARY TABLE dups_cust AS
SELECT customerid, min(ctid) AS min_ctid
FROM new_cust
GROUP BY customerid
HAVING count(*) > 1;
DELETE FROM new_cust
USING dups_cust
WHERE new_cust.customerid = dups_cust.customerid
AND new_cust.ctid != dups_cust.min_ctid;
COMMIT;
VACUUM new_cust;
This simple way of deleting duplicate rows will not work when additional requirements have to be followed:
- Instead of deleting all duplicate rows, some should be kept. Duplicate rows may be a valid use-case for some applications but they should be limited to e.g. the last five ones created.
- The remaining row should not be the first or last one created. In some cases, additional columns set a priority to keep a row: A verified user should not be deleted to keep an unverified one.
- The rows are partitioned by the columns indicating a duplicate row. For every combination of the specified columns a partition is automatically created to collect the duplicate rows.
- Every partition is sorted by a number of columns to mark their importance. If e.g. only the last five records should be kept, the partition's rows should be sorted by their creation date in descending order.
- The sorted rows within a partition are assigned an incrementing number by the ROW_NUMBER window function.
- Any row can be deleted according to the desired number of remaining rows. When e.g. only the last five rows should be kept, any row with a row number greater than five can be deleted.
Konstrukcja w PostgreSQL, która pozwala usuwać dane z jednej tabeli na podstawie danych z drugiej (lub z CTE). Działa trochę jak JOIN w DELETE. PostgreSQL does not support the DELETE JOIN statement like MySQL . Instead, it offers the USING clause in the DELETE statement that provides similar functionality to the DELETE JOIN.
DELETE FROM table1
USING table2
WHERE condition
RETURNING returning_columns;
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE FROM contacts
USING duplicates
WHERE contacts.id = duplicates.id AND duplicates.rownum > 1;
The USING
clause is not a part of the SQL standard, meaning that it may not be available in other database systems. To ensure compatibility with various database products use subquery
DELETE FROM Users
WHERE ID IN (
SELECT ID FROM (
SELECT ID, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY ID) AS rn
FROM Users
) sub
WHERE rn > 1
);
Zapytanie nie zadziala w Postgres bo nie mozna usuwac rekordow z CTE. Moze zadzialac w SQL Server
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY ID) AS rn
FROM Users
)
DELETE FROM CTE WHERE rn > 1;
MySQL
WITH duplicates AS (
SELECT id, ROW_NUMBER() OVER(
PARTITION BY firstname, lastname, email
ORDER BY age DESC
) AS rownum
FROM contacts
)
DELETE contacts
FROM contacts
JOIN duplicates USING(id)
WHERE duplicates.rownum > 1
Test