Skip to content

feat(python/sedonadb): add DataFrame.agg for global aggregation#887

Merged
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-agg
Jun 1, 2026
Merged

feat(python/sedonadb): add DataFrame.agg for global aggregation#887
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-agg

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

First DataFrame consumer of the function-registry dispatch landed in #885. Adds global (ungrouped) aggregation; grouped aggregation (DataFrame.group_by(*keys).agg(*aggs)) is the next small PR, sharing the same Rust binding.

API

sd = sedonadb.connect()
df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3, 4]}))

df.agg(sd.funcs.sum(col("x")).alias("total"))
df.agg(
    sd.funcs.sum(col("x")).alias("sum_x"),
    sd.funcs.count(col("y")).alias("n"),
    sd.funcs.min(col("x")).alias("lo"),
    sd.funcs.max(col("x")).alias("hi"),
)

Why this is so small

The function-registry dispatch in #885 means sd.funcs.sum, sd.funcs.count, sd.funcs.min, sd.funcs.max, sd.funcs.avg — and every other built-in / plugin / Python-registered aggregate — are already callable. This PR doesn't need any per-aggregate plumbing on either the Rust or Python side. One Rust binding, one Python method, a test file.

Implementation

File Change
python/sedonadb/src/dataframe.rs New InternalDataFrame::aggregate(group_exprs, agg_exprs). Generic wrapper over DataFusion's DataFrame::aggregate. Shared with the upcoming group_by PR — that path passes a populated group_exprs.
python/sedonadb/python/sedonadb/dataframe.py DataFrame.agg(*exprs). Calls the Rust binding with an empty group_exprs.

Test plan

9 tests in tests/expr/test_dataframe_agg.py:

  • Positive: single sum; single count; paired min/max; avg over a compound expression col("x") + col("y"); four aggregates yielding a one-row four-column result.
  • Lazy return: isinstance(out, DataFrame).
  • Errors: empty agg()ValueError; non-Expr arg → TypeError.
  • Plan composition: chained filter().agg() produces the right result.

All assertions use pd.testing.assert_frame_equal for outputs.

Local: 9 unit + 22 doctests + ruff format + ruff check all clean.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Python DataFrame.agg(*exprs) support for global, ungrouped aggregation, using the existing function-registry expression dispatch and a new Rust binding over DataFusion aggregation.

Changes:

  • Adds InternalDataFrame::aggregate(group_exprs, agg_exprs) in Rust.
  • Adds Python DataFrame.agg() validation and lazy DataFrame return path.
  • Adds coverage for aggregate execution, errors, lazy return, and filter composition.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/sedonadb/src/dataframe.rs Adds the Rust aggregate binding over DataFusion DataFrame::aggregate.
python/sedonadb/python/sedonadb/dataframe.py Adds the public Python DataFrame.agg(*exprs) API.
python/sedonadb/tests/expr/test_dataframe_agg.py Adds tests for global aggregation behavior and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

df.agg("x")


def test_agg_chains_with_select(con):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_agg_chains_with_filter in 6fdc21c.

Comment thread python/sedonadb/src/dataframe.rs Outdated
Comment on lines +244 to +248
/// The Python side guarantees `agg_exprs` is non-empty. Argument
/// shape validation (every entry being an aggregate-shaped `Expr`)
/// happens Python-side. DataFusion's plan-build raises a clear
/// error if a non-aggregate Expr is passed in `agg_exprs`, so we
/// don't try to enforce that here.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the comment in 6fdc21c to drop the contradictory "Python-side validation" sentence — the Python wrapper only checks isinstance(e, Expr), not aggregate-shapedness, and DataFusion's plan-build catches the rest.

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting...thank you!

Mostly nits...I'm hoping we can rename to aggregate (which is what DuckDB and Ibis call this).

Comment on lines +550 to +552
For grouped aggregation use `DataFrame.group_by(...).agg(...)`
(lands in a follow-up PR).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For grouped aggregation use `DataFrame.group_by(...).agg(...)`
(lands in a follow-up PR).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 2196d03 — no more forward-reference to grouped agg in the docstring.

Comment on lines +559 to +562
>>> from sedonadb.expr import col
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> from sedonadb.expr import col
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(col("x")).alias("total")).show()
>>> sd = sedona.db.connect()
>>> df = sd.sql("SELECT * FROM (VALUES (1), (2), (3), (4)) AS t(x)")
>>> df.agg(sd.funcs.sum(sd.col("x")).alias("total")).show()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to sd.col("x") in 2196d03 — the from sedonadb.expr import col line is gone from the doctest.

Comment on lines +542 to +543
def agg(self, *exprs: Expr) -> "DataFrame":
"""Aggregate the entire DataFrame to a single row.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can we call this aggregate()? (Ibis, DuckDB)
  • Can we expose **kwargs like is done in select()? df.aggregate(x_sum=df.x.sum()) is much more compact than df.aggregate(df.x.sum().alias("x_sum")) and is allowed by Ibis.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySpark, Pandas and Polars all use agg. I'd like to keep it that way.

Copy link
Copy Markdown
Member Author

@jiayuasu jiayuasu May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs added in 2196d03df.agg(total=sd.funcs.sum(sd.col("x"))) now desugars to …sum(sd.col("x")).alias("total"), and positional + named can mix. Three new tests cover the kwarg path, mixed positional/kwarg, and the non-Expr kwarg value rejection.

Comment thread python/sedonadb/src/dataframe.rs Outdated
Comment on lines +241 to +242
/// from `DataFrame.agg`) and grouped aggregation (called from
/// `DataFrame.group_by(...).agg(...)` once that lands).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// from `DataFrame.agg`) and grouped aggregation (called from
/// `DataFrame.group_by(...).agg(...)` once that lands).
/// from `DataFrame.agg`) and grouped aggregation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in 2196d03.

First DataFrame consumer of the function-registry dispatch landed in
apache#885. Builds the call site that grouped aggregation will also use.

API:

    df.agg(sd.funcs.sum(col("x")).alias("total"))
    df.agg(
        sd.funcs.sum(col("x")).alias("sum_x"),
        sd.funcs.count(col("y")).alias("n"),
        sd.funcs.min(col("x")).alias("lo"),
        sd.funcs.max(col("x")).alias("hi"),
    )

- Varargs of aggregate `Expr` values. Aggregate exprs come from
  `sd.funcs.<name>(args)` via apache#885; no per-aggregate plumbing in this
  PR (or any future PR — that's the whole point of the registry
  dispatch).
- Strings rejected — `df.agg("x")` has no meaning since a bare column
  isn't an aggregate. No auto-promotion.
- Empty `df.agg()` → ValueError; non-Expr arg → TypeError.
- Returns a one-row DataFrame.

Rust side: `InternalDataFrame::aggregate(group_exprs, agg_exprs)` is
the generic binding for both `DataFrame.agg` (this PR — passes an
empty `group_exprs`) and `DataFrame.group_by(*keys).agg(*aggs)`
(next PR — same Rust call, with `group_exprs` populated). One
binding serves both surfaces.

Tests: 9 covering single-aggregate (sum/count), min+max paired,
avg over a compound expression, multiple-aggregates-one-row,
lazy return, both error paths, and chained `filter().agg()` for
plan composition.
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySpark, Pandas and Polars all use agg. I'd like to keep it that way

At this point our API has little in common with PySpark, Pandas, and Polars but we can always alias it later if LLMs or their humans get confused. Either works for me 🙂

(Later we can add sedona.db.pyspark.connect() if that's a compatibility layer that is important)

@jiayuasu jiayuasu marked this pull request as ready for review June 1, 2026 04:50
@jiayuasu jiayuasu merged commit ea969b7 into apache:main Jun 1, 2026
5 checks passed
jiayuasu added a commit to jiayuasu/sedona-db that referenced this pull request Jun 1, 2026
Grouped aggregation on top of the registry-driven function dispatch
(apache#885) and the global-aggregation binding (apache#887).

API:

    df.group_by("k").agg(total=sd.funcs.sum(sd.col("v")))
    df.group_by("k1", "k2").agg(
        sd.funcs.sum(col("x")).alias("sum_x"),
        n=sd.funcs.count(col("y")),
    )
    df.group_by(col("x") + col("y")).agg(...)
    df.group_by(col("k"), "other_key").agg(...)

- `df.group_by(*keys)` — varargs of `str | Expr`. Strings auto-promote
  to `col(name)`; arbitrary `Expr` values are accepted as computed
  group keys. Empty keys → ValueError; non-str/non-Expr → TypeError.
- Returns a new `GroupedDataFrame` — a thin holder for the parent df
  plus the resolved group exprs. Single method `.agg(*exprs,
  **named_exprs)` with the same shape as `DataFrame.agg`.

Pure Python — the Rust `InternalDataFrame::aggregate(group_exprs,
agg_exprs)` from apache#887 already handles the grouped case; this PR just
populates `group_exprs` when constructing the aggregation.

The `GroupedDataFrame` intermediate is kept minimal (one method
beyond `__init__`) so it stays a clean place to add convenience
aggregates (`count`, `size`, etc.) later without polluting
`DataFrame`.

Tests: 12 covering single/multi string keys, Expr keys, computed
Expr keys, mixed str/Expr, positional + kwarg agg, lazy return type,
and the empty/bad-type error paths for both `group_by` and its
`.agg`.
jiayuasu added a commit to jiayuasu/sedona-db that referenced this pull request Jun 1, 2026
Grouped aggregation on top of the registry-driven function dispatch
(apache#885) and the global-aggregation binding (apache#887).

API:

    df.group_by("k").agg(total=sd.funcs.sum(sd.col("v")))
    df.group_by("k1", "k2").agg(
        sd.funcs.sum(col("x")).alias("sum_x"),
        n=sd.funcs.count(col("y")),
    )
    df.group_by(col("x") + col("y")).agg(...)
    df.group_by(col("k"), "other_key").agg(...)

- `df.group_by(*keys)` — varargs of `str | Expr`. Strings auto-promote
  to `col(name)`; arbitrary `Expr` values are accepted as computed
  group keys. Empty keys → ValueError; non-str/non-Expr → TypeError.
- Returns a new `GroupedDataFrame` — a thin holder for the parent df
  plus the resolved group exprs. Single method `.agg(*exprs,
  **named_exprs)` with the same shape as `DataFrame.agg`.

Pure Python — the Rust `InternalDataFrame::aggregate(group_exprs,
agg_exprs)` from apache#887 already handles the grouped case; this PR just
populates `group_exprs` when constructing the aggregation.

The `GroupedDataFrame` intermediate is kept minimal (one method
beyond `__init__`) so it stays a clean place to add convenience
aggregates (`count`, `size`, etc.) later without polluting
`DataFrame`.

Tests: 12 covering single/multi string keys, Expr keys, computed
Expr keys, mixed str/Expr, positional + kwarg agg, lazy return type,
and the empty/bad-type error paths for both `group_by` and its
`.agg`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants