### finding duplicate rows

This is a common task (and a common interview question)

In [None]:
# create a list of rows, with some duplicates
# to make it easier to track this visually, duplicates have v1==v2
entries = [
    [0,0], 
    [0,0],
    [1,0], 
    [1,1],
    [1,1],
    [2,1],
    [2,2],
    [2,2],
    [3,2],
    [3,3],
    [3,3],
    [4,3],
    [4,4],
    [4,4],
    [5,4],
    [5,5],
    [5,5],
    [6,5],
] 

headers = ['v1','v2']

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(entries, columns=headers)

In [None]:
df

In [None]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

### Use a COUNT and GROUP BY

You can group by the set of columns that may contain a duplicated, then count the number of results in each group. All results with more than 1 result will contain a duplicate set of values in each of the columns

In [None]:
pysqldf("SELECT v1, v2, COUNT(v1) FROM df group by v1, v2 HAVING COUNT(v1) > 1")

### Use RowID

If you have rowid (in sqlite) or other unique sequential identifier for a row, you can use MIN or MAX to identify rows that have duplicates

In [None]:
# rows with duplicates will have different values for MIN and MAX rowid
pysqldf("SELECT v1, v2, min(rowid), max(rowid) FROM df GROUP BY v1, v2")

In [None]:
# leverage this to find rows with a duplicate (ie., a row value that isn't the MIN for the group)
pysqldf("""
SELECT 
    rowid, * 
FROM 
    df
WHERE 
    rowid 
NOT IN
    (SELECT 
        min(rowid) 
    FROM df 
        GROUP BY v1, v2
    )
""")

### Without a rowid

If you don't have a rowid (or our database doesn't auto-generate one for you), you can use a partition to pick out the duplicates

In [None]:
pysqldf("""
WITH df_1 AS 
(
    SELECT 
        a.v1, 
        a.v2, 
        ROW_NUMBER() OVER (PARTITION BY v1, v2) as row_id 
    FROM 
        df a
)

SELECT 
    * 
FROM 
    df_1
WHERE
    row_id NOT IN 
    (SELECT 
        MIN(row_id)
    FROM
        df_1
    GROUP BY v1, v2
    )
""")