In this note compare a simple grouped calculation in Python using [Pandas](https://pandas.pydata.org), [Polars](https://www.pola.rs), and the [data algebra](https://github.com/WinVector/data_algebra) (over Pandas and over Polars).

What is really neat is:

  * Polars is indeed faster than Pandas
  * Programming over Polars with the data algebra is very low overhead!
  * The same data algebra operations can be used over Pandas, Polars (still under development), or SQL.

First, let's import our packages and set up our example data.

In [1]:
import sys
import time
import numpy as np
import pandas as pd
import polars as pl
import pyarrow
import data_algebra
import data_algebra.test_util


In [2]:
rng = np.random.default_rng(2022)

In [3]:
def mk_example(*, n_rows: int, n_groups: int):
    assert n_rows > 0
    assert n_groups > 0
    groups = [f"group_{i:04d}" for i in range(n_groups)]
    d = pd.DataFrame({
        "group": rng.choice(groups, size=n_rows, replace=True),
        "value": rng.normal(size=n_rows)
    })
    return d

In [4]:
d_Pandas = mk_example(n_rows=10, n_groups=2)


In [5]:
d_Polars = pl.DataFrame(d_Pandas)

Our task: compute the minimum and maximum of `value` for each group specified by `group`.

First in Pandas.

In [6]:
res_pandas = (
    d_Pandas
        .groupby(["group"])
        .agg({"value": ["min", "max"]})
)

res_pandas

Unnamed: 0_level_0,value,value
Unnamed: 0_level_1,min,max
group,Unnamed: 1_level_2,Unnamed: 2_level_2
group_0000,-2.931249,1.667716
group_0001,-1.440234,0.078888


Now in Polars.

In [7]:
res_Polars = (
    d_Polars
        .groupby(["group"])
        .agg([
            pl.col("value").min().alias("min_value"),
            pl.col("value").max().alias("max_value"),
        ])
)

res_Polars

group,min_value,max_value
str,f64,f64
"""group_0001""",-1.440234,0.078888
"""group_0000""",-2.931249,1.667716


Now in the data algebra.

In [8]:
ops = (
    data_algebra.descr(d=d_Pandas)
        .project(
            {
                "max_value": "value.max()",
                "min_value": "value.min()",
            },
            group_by=["group"]
        )
)

We have the data algebra working over Pandas.

In [9]:
res_data_algebra_Pandas = ops.transform(d_Pandas)

res_data_algebra_Pandas

Unnamed: 0,group,max_value,min_value
0,group_0000,1.667716,-2.931249
1,group_0001,0.078888,-1.440234


Or can have the data algebra working over Polars.

In [10]:
res_Polars = ops.transform(d_Polars)
assert data_algebra.test_util.equivalent_frames(res_Polars.to_pandas(), res_data_algebra_Pandas)

res_Polars

group,max_value,min_value
str,f64,f64
"""group_0000""",1.667716,-2.931249
"""group_0001""",0.078888,-1.440234


Let's build a larger example to get some timings.

In [11]:
d_Pandas = mk_example(n_rows=1000000, n_groups=100000)
d_Polars = pl.DataFrame(d_Pandas)
n_repetitions = 10

In [12]:
t0 = time.perf_counter()
for i in range(n_repetitions):
    res_data_algebra_Pandas = ops.transform(d_Pandas)
t1 = time.perf_counter()
time_data_algebra_Pandas = t1 - t0

In [13]:
t0 = time.perf_counter()
for i in range(n_repetitions):
    res_pandas = (
        d_Pandas
            .groupby(["group"])
            .agg({"value": ["min", "max"]})
    )
t1 = time.perf_counter()
time_Pandas = t1 - t0

In [14]:
t0 = time.perf_counter()
for i in range(n_repetitions):
    res_data_algebra_Polars = ops.transform(d_Polars)
t1 = time.perf_counter()
time_data_algebra_Polars = t1 - t0

In [15]:
t0 = time.perf_counter()
for i in range(n_repetitions):
    res_Polars = (
        d_Polars
            .groupby(["group"])
            .agg([
                pl.col("value").min().alias("min_value"),
                pl.col("value").max().alias("max_value"),
                ])
    )
t1 = time.perf_counter()
time_Polars = t1 - t0

In [16]:
t0 = time.perf_counter()
for i in range(n_repetitions):
    res_Polars_lazy = (
        d_Polars
            .lazy()
            .groupby(["group"])
            .agg([
                pl.col("value").min().alias("min_value"),
                pl.col("value").max().alias("max_value"),
                ])
            .collect()
    )
t1 = time.perf_counter()
time_Polars_lazy = t1 - t0

In [17]:
assert data_algebra.test_util.equivalent_frames(res_Polars.to_pandas(), res_data_algebra_Pandas)

In [18]:
timings = pd.DataFrame({
    "method": ["Pandas", "data_algebra_Pandas", "Polars", "Polars (lazy)", "data_algebra_Polars"],
    "time (seconds/run)": [time_Pandas, time_data_algebra_Pandas, time_Polars, time_Polars_lazy, time_data_algebra_Polars],
})
timings["time (seconds/run)"] = timings["time (seconds/run)"] / n_repetitions

timings

Unnamed: 0,method,time (seconds/run)
0,Pandas,0.279023
1,data_algebra_Pandas,0.406919
2,Polars,0.110303
3,Polars (lazy),0.112036
4,data_algebra_Polars,0.12359


What we see includes:

  * data algebra, unfortunately, does have some overhead cost working over Pandas.
  * Polars is faster than Pandas.
  * data algebra, has little overhead working over Polars (and does use the lazy interface).

For a serious study, we would want a longer task, more runs, standard deviations, and to also eliminate possible warm-start issues (though we did run data algebra first to make sure it was cold-started).

And, just for fun, convert the data algebra ops to SQL.

In [19]:
print(ops.to_sql())

-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: SQLiteModel 1.5.1
--       string quote: '
--   identifier quote: "
SELECT  -- .project({ 'max_value': 'value.max()', 'min_value': 'value.min()'}, group_by=['group'])
 MAX("value") AS "max_value" ,
 MIN("value") AS "min_value" ,
 "group"
FROM
 "d"
GROUP BY
 "group"



Or even print the ops themselves.

In [20]:
print(ops.to_python(pretty=True))

(
    TableDescription(table_name="d", column_names=["group", "value"]).project(
        {"max_value": "value.max()", "min_value": "value.min()"}, group_by=["group"]
    )
)



In [21]:
# write summary results out
timings["language"] = "Python"
timings.to_csv("tgc_python_timings.csv", index=False)

The system and package versions used for this demonstration are as follows.

In [22]:
sys.version

'3.10.8 (main, Nov 24 2022, 08:09:04) [Clang 14.0.6 ]'

In [23]:
pl.__version__

'0.15.3'

In [24]:
pd.__version__

'1.5.2'

In [25]:
data_algebra.__version__

'1.5.1'

In [26]:
pyarrow.__version__

'8.0.0'

In [27]:
# compare to R run
r_timings = pd.read_csv("tgc_r_summary.csv")
overall_timings = pd.concat([timings, r_timings], ignore_index=True)
overall_timings.to_csv("tgc_overall_timings.csv", index=False)

overall_timings

Unnamed: 0,method,time (seconds/run),language
0,Pandas,0.279023,Python
1,data_algebra_Pandas,0.406919,Python
2,Polars,0.110303,Python
3,Polars (lazy),0.112036,Python
4,data_algebra_Polars,0.12359,Python
5,base_R,4.941736,R
6,data_table,0.097809,R
7,dplyr,1.054196,R
8,dtplyr,0.133225,R
9,rqdatabable,0.212543,R
