It is a common analytics or data science problem to need "de-dup" or detect duplicate rows in a data source.

I will discuss 5 ways to solve this in a Python/Pandas environment:

  * Knowing the ready made Pandas solution (the `.duplicated()` method).
  * Knowing how to build a solution in Pandas (`.groupby().transform()`).
  * Using Polars.
  * Using an adapter language (the [data algebra](https://github.com/WinVector/data_algebra)).
  * Generating SQL for the same problem.

First we import our packages.

In [1]:
import numpy as np
import pandas as pd
import data_algebra
import data_algebra.db_space
import data_algebra.PostgreSQL
from IPython.display import display, HTML, Markdown
import time
import polars as pl

Now we set up our example data frame.

In [2]:
rng = np.random.default_rng(2022)

In [3]:
def generate_example(*, n_columns: int = 5, n_rows: int = 10):
    assert isinstance(n_columns, int)
    assert isinstance(n_rows, int)
    return pd.DataFrame({
        f"col_{i:03d}": rng.choice(["a", "b", "c", "d"], size=n_rows, replace=True) for i in range(n_columns)
    })

In [4]:
d = generate_example(n_columns=10, n_rows=1000)

d

Unnamed: 0,col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009
0,c,d,d,b,d,a,a,b,d,a
1,a,b,c,d,a,c,c,c,a,b
2,c,c,d,d,a,c,b,d,a,b
3,a,a,d,d,a,c,b,a,a,b
4,a,b,a,a,c,d,a,c,c,b
...,...,...,...,...,...,...,...,...,...,...
995,c,c,d,a,a,c,b,b,b,a
996,d,d,d,a,d,a,c,c,a,b
997,b,a,d,c,c,b,a,d,d,a
998,a,b,c,b,a,d,a,a,b,a


The Pandas solution is: call the `.duplicated()` method to determine which rows are duplicates.  This is the fastest method, as it is designed exactly for this task.

In [5]:
dup_locs_1 = d.duplicated(keep=False)

In [6]:
np.where(dup_locs_1)[0]

array([ 56, 245])

In [7]:

d.loc[dup_locs_1, :]

Unnamed: 0,col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009
56,c,c,d,a,c,a,b,c,d,b
245,c,c,d,a,c,a,b,c,d,b


The build your own solution using Pandas method is as follows. We count the number of items in each group of rows defined by the column values. Notice we follow the `.gorupby()` with a `.transform()` (which returns data with the same number of rows as the input data) instead of the more familiar `.agg()` (which returns one row per data group).

In [8]:
dup_locs_2 = d.groupby(list(d.columns)).transform("size") > 1


In [9]:
assert np.all(dup_locs_1 == dup_locs_2)

d.loc[dup_locs_2, :]

Unnamed: 0,col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009
56,c,c,d,a,c,a,b,c,d,b
245,c,c,d,a,c,a,b,c,d,b


Our third method is to use an adapter language instead of using Pandas directly. In this case we are using the [data algebra](https://github.com/WinVector/data_algebra) which requires treating the data transformation as a, hopefully fun, puzzle to be solved over the [classic Codd relational operators](https://en.wikipedia.org/wiki/Relational_algebra) (extension, selection, projection) plus what SQL calls window functions.

In this case a solution is to take a description of our data and:

  * extend the data with a new column that indicates the size of the group each row is in.
  * selecting the rows where the group is larger than one row.
  * deleting out the count column we used in the above calculations.

Once we have the solution plan above we translate it into data algebra code as follows.

In [10]:
ops = (
    data_algebra.descr(d=d)
        .extend({"count": "(1).sum()"}, partition_by=d.columns)
        .select_rows("count > 1")
        .drop_columns(["count"])
)


We can then apply the operations to the data.

In [11]:
ops_res = ops.transform(d)

The only difference in the result being the data algebra (deliberately) does not preserve row indexes (and like a database isn't concerned with row order or column order).

In [12]:
assert ops_res.shape[0] == np.sum(dup_locs_1)

ops_res

Unnamed: 0,col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009
0,c,c,d,a,c,a,b,c,d,b
1,c,c,d,a,c,a,b,c,d,b


The actual advantages of the data algebra are ease of consistent composition, and solution can also be run in a database.

For example, suppose our data was already in a PostgreSQL database (which we simulate by copying the data into the database). For many data science and analytics problems at scale, the data starts in a database, so being able to manipulate it there has some advantages.

In [13]:
# connect to database
db_handle = data_algebra.PostgreSQL.example_handle()

In [14]:
# build a model of a set of tables in a database
db_tables = data_algebra.db_space.DBSpace(db_handle, drop_tables_on_close=True)

The above abstraction is called a "data space." 

The data algebra natively executes over Pandas, and can generate SQL queries. With a data space adapter it *could* be made to appear to be executing over other data systems such as Dask, Nvidia Rapids, modin, datatable, Polars, or others. Using a new data realization requires is an adapter that implements the usual Codd relational data transformations plus window functions. Note, we don't currently have such adaptors, the method to build such would be to copy the structure of the current Pandas adapter.

In [15]:
# simulate the data already being in the database by inserting it.
_ = db_tables.insert(key="d", value=d)

We can now apply our solution directly in the database. This creates a new table directly in the database, without copying data to or from the database in this step.

In [16]:
res_description = db_tables.execute(ops)

We can now retrieve the result. If our task had been an aggregation or projection we would have the benefit that moving the result would be much less expensive than moving the original data.

In [17]:
db_res = db_tables.retrieve(res_description.table_name)
assert db_res.shape[0] == np.sum(dup_locs_1)

db_res

Unnamed: 0,col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009
0,c,c,d,a,c,a,b,c,d,b
1,c,c,d,a,c,a,b,c,d,b


We can also try a Polars chained solution, directly.

In [18]:
d_polar = pl.DataFrame(d)

In [19]:
orig_cols = list(d_polar.columns)
res_polar = (
    d_polar
        .lazy()
        .with_column(pl.lit(1).alias("count"))
        .with_column(pl.col("count").sum().over(orig_cols))
        .filter(pl.col("count") > 1)
        .select(pl.col(orig_cols))
        .collect()
)


In [20]:
assert res_polar.shape[0] == np.sum(dup_locs_1)

In [21]:
res_polar

col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009
str,str,str,str,str,str,str,str,str,str
"""c""","""c""","""d""","""a""","""c""","""a""","""b""","""c""","""d""","""b"""
"""c""","""c""","""d""","""a""","""c""","""a""","""b""","""c""","""d""","""b"""


For fun, we try the examples again on a larger example. In all cases we are selecting the duplicated rows.

In [22]:
big_example = generate_example(n_columns=20, n_rows=5000000)


In [23]:
t0 = time.perf_counter()
big_res_duplicated = big_example.loc[big_example.duplicated(keep=False), :]
t1 = time.perf_counter()


In [24]:
big_shape_count = big_res_duplicated.shape[0]

big_shape_count

28

In [25]:
display(Markdown(
        f'And, the Pandas .duplicated() result took {(t1 - t0):0.1f} seconds to calculate.'
    ))


And, the Pandas .duplicated() result took 4.0 seconds to calculate.

In [26]:
t0 = time.perf_counter()
big_res_transform = big_example.loc[big_example.groupby(list(big_example.columns)).transform("size") > 1, :]
t1 = time.perf_counter()

In [27]:

assert big_res_transform.shape[0] == big_shape_count

In [28]:
display(Markdown(
        f'The Pandas .transform() result took {(t1 - t0):0.1f} seconds to calculate.'
    ))

The Pandas .transform() result took 8.4 seconds to calculate.

In [29]:
dbig_opsbig = pl.DataFrame(big_example)

In [30]:
t0 = time.perf_counter()
orig_cols_big = list(dbig_opsbig.columns)
resbig_opsbig = (
    dbig_opsbig
        .lazy()
        .with_column(pl.lit(1).alias("count"))
        .with_column(pl.col("count").sum().over(orig_cols_big))
        .filter(pl.col("count") > 1)
        .select(pl.col(orig_cols_big))
        .collect()
)
t1 = time.perf_counter()

In [31]:
assert resbig_opsbig.shape[0] == big_shape_count

In [32]:
display(Markdown(
        f'The Polars result took {(t1 - t0):0.1f} seconds to calculate.'
    ))

The Polars result took 9.8 seconds to calculate.

Polars not being faster likely comes down to this being a single very direct row-selection step in Pandas. For longer sequences of operations we expect to see Polars perform faster than Pandas.

We can demonstrate a data algebra pipeline on the new large Pandas data.

In [33]:
big_ops = (
    data_algebra.descr(big_example=big_example)
        .extend({"count": "(1).sum()"}, partition_by=big_example.columns)
        .select_rows("count > 1")
        .drop_columns(["count"])
)

In [34]:
t0 = time.perf_counter()
big_res_data_algebra = big_ops.transform(big_example)
t1 = time.perf_counter()

In [35]:
assert big_res_data_algebra.shape[0] == big_shape_count

In [36]:
display(Markdown(
        f'The result data algebra Pandas result took {(t1 - t0):0.1f} seconds to calculate.'
    ))

The result data algebra Pandas result took 32.5 seconds to calculate.

This slowdown is because the data algebra is attempting to model immutable semantics over Pandas, causing a bit of an impedance mismatch and some extra copying. The effect can be much less for longer pipelines.

And, we can run the data algebra pipeline on the new large data in database, using SQL.

In [37]:
_ = db_tables.insert(key="big_example", value=big_example)

In [38]:
t0 = time.perf_counter()
big_res_description = db_tables.execute(big_ops)
t1 = time.perf_counter()

In [39]:
big_db_res = db_tables.retrieve(big_res_description.table_name)
assert big_db_res.shape[0] == big_shape_count

In [40]:
display(Markdown(
        f'The data algebra SQL result took {(t1 - t0):0.1f} seconds to calculate.'
    ))

The data algebra SQL result took 37.1 seconds to calculate.

And we are experimenting with a (not yet complete, and not in the PyPi release yet) data algebra to Polars adapter.

In [41]:
t0 = time.perf_counter()
big_res_data_algebra_polars = big_ops.transform(dbig_opsbig)
t1 = time.perf_counter()

In [42]:
assert isinstance(big_res_data_algebra_polars, pl.DataFrame)
assert big_res_data_algebra_polars.shape[0] == big_shape_count

big_res_data_algebra_polars

col_000,col_001,col_002,col_003,col_004,col_005,col_006,col_007,col_008,col_009,col_010,col_011,col_012,col_013,col_014,col_015,col_016,col_017,col_018,col_019
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""d""","""b""","""c""","""d""","""b""","""b""","""d""","""c""","""d""","""b""","""d""","""a""","""c""","""a""","""c""","""a""","""d""","""c""","""a""","""c"""
"""a""","""b""","""d""","""b""","""a""","""d""","""a""","""d""","""d""","""a""","""c""","""d""","""d""","""a""","""a""","""c""","""b""","""c""","""d""","""a"""
"""d""","""c""","""b""","""b""","""b""","""d""","""a""","""a""","""c""","""d""","""b""","""c""","""a""","""a""","""a""","""c""","""d""","""a""","""c""","""c"""
"""d""","""c""","""c""","""a""","""c""","""b""","""d""","""b""","""d""","""b""","""b""","""b""","""c""","""b""","""a""","""a""","""b""","""d""","""a""","""a"""
"""a""","""d""","""b""","""b""","""d""","""a""","""d""","""b""","""d""","""a""","""b""","""d""","""c""","""d""","""b""","""b""","""d""","""d""","""d""","""b"""
"""c""","""d""","""a""","""d""","""c""","""b""","""c""","""a""","""a""","""d""","""c""","""c""","""c""","""a""","""b""","""a""","""c""","""a""","""b""","""d"""
"""a""","""d""","""a""","""b""","""d""","""b""","""b""","""c""","""b""","""c""","""d""","""b""","""a""","""d""","""c""","""d""","""b""","""d""","""d""","""b"""
"""a""","""b""","""d""","""b""","""a""","""d""","""a""","""d""","""d""","""a""","""c""","""d""","""d""","""a""","""a""","""c""","""b""","""c""","""d""","""a"""
"""d""","""c""","""d""","""d""","""c""","""b""","""b""","""a""","""c""","""d""","""c""","""d""","""a""","""b""","""a""","""a""","""d""","""d""","""a""","""d"""
"""a""","""d""","""b""","""b""","""d""","""a""","""d""","""b""","""d""","""a""","""b""","""d""","""c""","""d""","""b""","""b""","""d""","""d""","""d""","""b"""


In [43]:
display(Markdown(
        f'The data algebra Polars result took {(t1 - t0):0.1f} seconds to calculate.'
    ))

The data algebra Polars result took 7.9 seconds to calculate.

And here is a great payoff: programming of Polars (such as with the data algebra) is about as fast as using Polars directly! Our Polars adapter is currently incomplete, and not yet ready for use. But these initial favorable results are motivating work on it.

We can also take a look at the SQL generated by the data algebra. SQL is generated with different dialects for different databases. The SQL is long, but we didn't have to write it.

In [44]:
print(ops.to_sql(db_handle))

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: PostgreSQLModel 1.5.0
--       string quote: '
--   identifier quote: "
WITH
 "extend_0" AS (
  SELECT  -- .extend({ 'count': '(1).sum()'}, partition_by=['col_000', 'col_001', 'col_002', 'col_003', 'col_004', 'col_005', 'col_006', 'col_007', 'col_008', 'col_009'])
   "col_000" ,
   "col_001" ,
   "col_002" ,
   "col_003" ,
   "col_004" ,
   "col_005" ,
   "col_006" ,
   "col_007" ,
   "col_008" ,
   "col_009" ,
   SUM(1) OVER ( PARTITION BY "col_000", "col_001", "col_002", "col_003", "col_004", "col_005", "col_006", "col_007", "col_008", "col_009"  )  AS "count"
  FROM
   "d"
 )
SELECT  -- .select_rows('count > 1')
 "col_000" ,
 "col_001" ,
 "col_002" ,
 "col_003" ,
 "col_004" ,
 "col_005" ,
 "col_006" ,
 "col_007" ,
 "col_008" ,
 "col_009"
FROM
 "extend_0"
WHERE
 "count" > 1



And that is how to find duplicated rows in Pandas, Polars, the data algebra, and in SQL.

I feel the data algebra supplies a versatile development environment that can be the right tool in a number of situations.

In [45]:
db_tables.close()

In [46]:
db_handle.close()