feat(python/sedonadb): add DataFrame.agg for global aggregation by jiayuasu · Pull Request #887 · apache/sedona-db

jiayuasu · 2026-05-29T04:37:58Z

First DataFrame consumer of the function-registry dispatch landed in #885. Adds global (ungrouped) aggregation; grouped aggregation (DataFrame.group_by(*keys).agg(*aggs)) is the next small PR, sharing the same Rust binding.

API

sd = sedonadb.connect()
df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3, 4]}))

df.agg(sd.funcs.sum(col("x")).alias("total"))
df.agg(
    sd.funcs.sum(col("x")).alias("sum_x"),
    sd.funcs.count(col("y")).alias("n"),
    sd.funcs.min(col("x")).alias("lo"),
    sd.funcs.max(col("x")).alias("hi"),
)

Varargs of aggregate Expr values, built via sd.funcs.<name>(args) from feat(python/sedonadb): Expose scalar and aggregate udfs from context registry #885.
Strings are not auto-promoted — a bare column isn't an aggregate.
Empty df.agg() → ValueError; non-Expr arg → TypeError.
Returns a one-row DataFrame.

Why this is so small

The function-registry dispatch in #885 means sd.funcs.sum, sd.funcs.count, sd.funcs.min, sd.funcs.max, sd.funcs.avg — and every other built-in / plugin / Python-registered aggregate — are already callable. This PR doesn't need any per-aggregate plumbing on either the Rust or Python side. One Rust binding, one Python method, a test file.

Implementation

File	Change
`python/sedonadb/src/dataframe.rs`	New `InternalDataFrame::aggregate(group_exprs, agg_exprs)`. Generic wrapper over DataFusion's `DataFrame::aggregate`. Shared with the upcoming `group_by` PR — that path passes a populated `group_exprs`.
`python/sedonadb/python/sedonadb/dataframe.py`	`DataFrame.agg(*exprs)`. Calls the Rust binding with an empty `group_exprs`.

Test plan

9 tests in tests/expr/test_dataframe_agg.py:

Positive: single sum; single count; paired min/max; avg over a compound expression col("x") + col("y"); four aggregates yielding a one-row four-column result.
Lazy return: isinstance(out, DataFrame).
Errors: empty agg() → ValueError; non-Expr arg → TypeError.
Plan composition: chained filter().agg() produces the right result.

All assertions use pd.testing.assert_frame_equal for outputs.

Local: 9 unit + 22 doctests + ruff format + ruff check all clean.

Copilot

Pull request overview

Adds Python DataFrame.agg(*exprs) support for global, ungrouped aggregation, using the existing function-registry expression dispatch and a new Rust binding over DataFusion aggregation.

Changes:

Adds InternalDataFrame::aggregate(group_exprs, agg_exprs) in Rust.
Adds Python DataFrame.agg() validation and lazy DataFrame return path.
Adds coverage for aggregate execution, errors, lazy return, and filter composition.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`python/sedonadb/src/dataframe.rs`	Adds the Rust aggregate binding over DataFusion `DataFrame::aggregate`.
`python/sedonadb/python/sedonadb/dataframe.py`	Adds the public Python `DataFrame.agg(*exprs)` API.
`python/sedonadb/tests/expr/test_dataframe_agg.py`	Adds tests for global aggregation behavior and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jiayuasu · 2026-05-29T05:14:01Z

+        df.agg("x")
+
+
+def test_agg_chains_with_select(con):


Renamed to test_agg_chains_with_filter in 6fdc21c.

jiayuasu · 2026-05-29T05:14:03Z

+    /// The Python side guarantees `agg_exprs` is non-empty. Argument
+    /// shape validation (every entry being an aggregate-shaped `Expr`)
+    /// happens Python-side. DataFusion's plan-build raises a clear
+    /// error if a non-aggregate Expr is passed in `agg_exprs`, so we
+    /// don't try to enforce that here.


Rewrote the comment in 6fdc21c to drop the contradictory "Python-side validation" sentence — the Python wrapper only checks isinstance(e, Expr), not aggregate-shapedness, and DataFusion's plan-build catches the rest.

paleolimbot

Exciting...thank you!

Mostly nits...I'm hoping we can rename to aggregate (which is what DuckDB and Ibis call this).

paleolimbot · 2026-05-30T01:18:44Z

+        For grouped aggregation use `DataFrame.group_by(...).agg(...)`
+        (lands in a follow-up PR).
+


Suggested change

For grouped aggregation use `DataFrame.group_by(...).agg(...)`

(lands in a follow-up PR).

Dropped in 2196d03 — no more forward-reference to grouped agg in the docstring.

paleolimbot · 2026-05-30T01:19:11Z

+            >>> from sedonadb.expr import col
+            >>> sd = sedona.db.connect()
+            >>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
+            >>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()


Suggested change

>>> from sedonadb.expr import col

>>> sd = sedona.db.connect()

>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")

>>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()

>>> sd = sedona.db.connect()

>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")

>>> df.agg(sd.funcs.sum(sd.col("x")).alias("total")).show()

Switched to sd.col("x") in 2196d03 — the from sedonadb.expr import col line is gone from the doctest.

paleolimbot · 2026-05-30T01:43:09Z

+    def agg(self, *exprs: Expr) -> "DataFrame":
+        """Aggregate the entire DataFrame to a single row.


Can we call this aggregate()? (Ibis, DuckDB)

Can we expose **kwargs like is done in select()? df.aggregate(x_sum=df.x.sum()) is much more compact than df.aggregate(df.x.sum().alias("x_sum")) and is allowed by Ibis.

PySpark, Pandas and Polars all use agg. I'd like to keep it that way.

kwargs added in 2196d03 — df.agg(total=sd.funcs.sum(sd.col("x"))) now desugars to …sum(sd.col("x")).alias("total"), and positional + named can mix. Three new tests cover the kwarg path, mixed positional/kwarg, and the non-Expr kwarg value rejection.

paleolimbot · 2026-05-30T01:49:48Z

+    /// from `DataFrame.agg`) and grouped aggregation (called from
+    /// `DataFrame.group_by(...).agg(...)` once that lands).


Suggested change

/// from `DataFrame.agg`) and grouped aggregation (called from

/// `DataFrame.group_by(...).agg(...)` once that lands).

/// from `DataFrame.agg`) and grouped aggregation.

Simplified in 2196d03.

First DataFrame consumer of the function-registry dispatch landed in apache#885. Builds the call site that grouped aggregation will also use. API: df.agg(sd.funcs.sum(col("x")).alias("total")) df.agg( sd.funcs.sum(col("x")).alias("sum_x"), sd.funcs.count(col("y")).alias("n"), sd.funcs.min(col("x")).alias("lo"), sd.funcs.max(col("x")).alias("hi"), ) - Varargs of aggregate `Expr` values. Aggregate exprs come from `sd.funcs.<name>(args)` via apache#885; no per-aggregate plumbing in this PR (or any future PR — that's the whole point of the registry dispatch). - Strings rejected — `df.agg("x")` has no meaning since a bare column isn't an aggregate. No auto-promotion. - Empty `df.agg()` → ValueError; non-Expr arg → TypeError. - Returns a one-row DataFrame. Rust side: `InternalDataFrame::aggregate(group_exprs, agg_exprs)` is the generic binding for both `DataFrame.agg` (this PR — passes an empty `group_exprs`) and `DataFrame.group_by(*keys).agg(*aggs)` (next PR — same Rust call, with `group_exprs` populated). One binding serves both surfaces. Tests: 9 covering single-aggregate (sum/count), min+max paired, avg over a compound expression, multiple-aggregates-one-row, lazy return, both error paths, and chained `filter().agg()` for plan composition.

paleolimbot

PySpark, Pandas and Polars all use agg. I'd like to keep it that way

At this point our API has little in common with PySpark, Pandas, and Polars but we can always alias it later if LLMs or their humans get confused. Either works for me 🙂

(Later we can add sedona.db.pyspark.connect() if that's a compatibility layer that is important)

Grouped aggregation on top of the registry-driven function dispatch (apache#885) and the global-aggregation binding (apache#887). API: df.group_by("k").agg(total=sd.funcs.sum(sd.col("v"))) df.group_by("k1", "k2").agg( sd.funcs.sum(col("x")).alias("sum_x"), n=sd.funcs.count(col("y")), ) df.group_by(col("x") + col("y")).agg(...) df.group_by(col("k"), "other_key").agg(...) - `df.group_by(*keys)` — varargs of `str | Expr`. Strings auto-promote to `col(name)`; arbitrary `Expr` values are accepted as computed group keys. Empty keys → ValueError; non-str/non-Expr → TypeError. - Returns a new `GroupedDataFrame` — a thin holder for the parent df plus the resolved group exprs. Single method `.agg(*exprs, **named_exprs)` with the same shape as `DataFrame.agg`. Pure Python — the Rust `InternalDataFrame::aggregate(group_exprs, agg_exprs)` from apache#887 already handles the grouped case; this PR just populates `group_exprs` when constructing the aggregation. The `GroupedDataFrame` intermediate is kept minimal (one method beyond `__init__`) so it stays a clean place to add convenience aggregates (`count`, `size`, etc.) later without polluting `DataFrame`. Tests: 12 covering single/multi string keys, Expr keys, computed Expr keys, mixed str/Expr, positional + kwarg agg, lazy return type, and the empty/bad-type error paths for both `group_by` and its `.agg`.

github-actions Bot requested a review from prantogg May 29, 2026 04:38

jiayuasu requested a review from Copilot May 29, 2026 04:39

Copilot started reviewing on behalf of jiayuasu May 29, 2026 04:39 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

jiayuasu force-pushed the feature/df-agg branch from f42bb35 to 6fdc21c Compare May 29, 2026 05:13

paleolimbot reviewed May 30, 2026

View reviewed changes

jiayuasu force-pushed the feature/df-agg branch from 6fdc21c to 2196d03 Compare May 30, 2026 03:24

paleolimbot approved these changes Jun 1, 2026

View reviewed changes

jiayuasu marked this pull request as ready for review June 1, 2026 04:50

jiayuasu merged commit ea969b7 into apache:main Jun 1, 2026
5 checks passed

jiayuasu mentioned this pull request Jun 1, 2026

feat(python/sedonadb): add DataFrame.group_by + GroupedDataFrame.agg #893

Open

		For grouped aggregation use `DataFrame.group_by(...).agg(...)`
		(lands in a follow-up PR).

		def agg(self, *exprs: Expr) -> "DataFrame":
		"""Aggregate the entire DataFrame to a single row.

		/// from `DataFrame.agg`) and grouped aggregation (called from
		/// `DataFrame.group_by(...).agg(...)` once that lands).

Conversation

jiayuasu commented May 29, 2026

API

Why this is so small

Implementation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiayuasu May 30, 2026 •

edited

Loading