feat(python/sedonadb): add DataFrame.sort_values by jiayuasu · Pull Request #859 · apache/sedona-db

jiayuasu · 2026-05-19T05:01:55Z

Continues the Phase P2 work of #791 with pandas-style sort_values on the lazy DataFrame.

API

df.sort_values(by="x")
df.sort_values(by="x", ascending=False)
df.sort_values(by=["x", "y"])
df.sort_values(by=["x", "y"], ascending=[True, False])   # mixed
df.sort_values(by=col("x") + col("y"))                    # Expr key
df.sort_values(by=[col("x"), "y"])                        # mixed str / Expr

by accepts str | Expr | list[str | Expr]. ascending is bool | list[bool] (broadcast to by length if scalar, must match if list).

Two semantic calls worth flagging

Nulls last in both directions. Pandas's default is na_position='last' regardless of direction; DataFusion's SQL-style default is nulls-first on descending. We override DataFusion's default so pandas users porting code get the placement they expect. If the SQL behavior is needed, sd.sql("... ORDER BY ... NULLS FIRST") is still the escape hatch.

No inplace= kwarg. Our DataFrame wraps an immutable DataFusion LogicalPlan, so true in-place mutation is impossible. Rather than silently warn-and-ignore (the design-doc default, which leaves callers with an unsorted frame they ignored the return value of), the kwarg is simply not defined — Python raises the standard TypeError: sort_values() got an unexpected keyword argument 'inplace'. A test locks that contract. The design doc will be updated to match in a follow-up.

Test plan

17 tests in tests/expr/test_dataframe_sort_values.py:

Positive: single key asc/desc, by col(...) Expr, by computed Expr (col("x") + col("y")), multi-key all-asc, multi-key mixed asc/desc with deliberately scrambled input so a broken implementation cannot pass, ascending-scalar-broadcast, nulls-last asc, nulls-last desc.
Lazy return: isinstance(out, DataFrame) (not hasattr, per Dewey's policy).
Errors: empty by, length mismatch, bad by type, bad by element type (matched on a phrase only that error path uses), bad ascending type, bad ascending list element, inplace= rejected by Python.

All assertions use pd.testing.assert_frame_equal for outputs and pytest.raises(..., match="discriminating phrase") for errors. No substring-on-a-single-character patterns.

Local: 17 unit + 18 doctests + ruff format + ruff check all green.

Pandas-style sort_values over the lazy DataFrame. API: - `df.sort_values(by, ascending=True)` - `by`: str, Expr, or list of those (column name or arbitrary expression key). - `ascending`: bool or list-of-bool (broadcast to `by` length if scalar, must match length if list). Null placement: nulls go last regardless of direction, matching pandas' default `na_position='last'`. This overrides DataFusion's SQL-style nulls-first-on-descending convention; if the SQL behavior is needed it remains available via `sd.sql("... ORDER BY ... NULLS FIRST")`. No `inplace=` kwarg: SedonaDB's DataFrame wraps an immutable DataFusion LogicalPlan, so true in-place mutation is impossible. Rather than silently warn-and-ignore (which would leave callers with an unsorted frame they ignored the return value of), the kwarg is simply not defined — Python raises the standard unexpected-keyword TypeError. The test suite locks that contract. Rust side: `InternalDataFrame::sort_by_keys(exprs, ascending)` pairs the two equal-length vectors into `SortExpr`s with `nulls_first=false` for every key, then delegates to DataFusion's `DataFrame::sort`.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

paleolimbot

Since Pandas invented sort_values() there have been more elegant/composable ways to handle this that have evolved.

For our purposes, I think we should:

Expose all the bells and whistles of DataFusion's SortExpr via sedonadb.expr.sort_expr(expr, <options like asc/dsc/nulls>)
Add methods to Expr (.asc(nulls), .desc(nulls)) that return SortExpr
Accept str, Expr, or SortExpr in sort_values
Name it either sort() (DataFusion python, DuckDB) or order_by() (Ibis)
Accept multiple arguments instead of a list (auto formats with Ruff into multiple lines better, all three newer interfaces do this)

Some more elegant interfaces for reference:

jiayuasu requested a review from Copilot May 19, 2026 05:15

Copilot started reviewing on behalf of jiayuasu May 19, 2026 05:15 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

github-actions Bot requested a review from prantogg May 19, 2026 05:51

paleolimbot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python/sedonadb): add DataFrame.sort_values#859

feat(python/sedonadb): add DataFrame.sort_values#859
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-sort-values

jiayuasu commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

paleolimbot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiayuasu commented May 19, 2026

API

Two semantic calls worth flagging

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paleolimbot left a comment •

edited

Loading