Skip to content

feat(python/sedonadb): add DataFrame.sort_values#859

Draft
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-sort-values
Draft

feat(python/sedonadb): add DataFrame.sort_values#859
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-sort-values

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Continues the Phase P2 work of #791 with pandas-style sort_values on the lazy DataFrame.

API

df.sort_values(by="x")
df.sort_values(by="x", ascending=False)
df.sort_values(by=["x", "y"])
df.sort_values(by=["x", "y"], ascending=[True, False])   # mixed
df.sort_values(by=col("x") + col("y"))                    # Expr key
df.sort_values(by=[col("x"), "y"])                        # mixed str / Expr

by accepts str | Expr | list[str | Expr]. ascending is bool | list[bool] (broadcast to by length if scalar, must match if list).

Two semantic calls worth flagging

Nulls last in both directions. Pandas's default is na_position='last' regardless of direction; DataFusion's SQL-style default is nulls-first on descending. We override DataFusion's default so pandas users porting code get the placement they expect. If the SQL behavior is needed, sd.sql("... ORDER BY ... NULLS FIRST") is still the escape hatch.

No inplace= kwarg. Our DataFrame wraps an immutable DataFusion LogicalPlan, so true in-place mutation is impossible. Rather than silently warn-and-ignore (the design-doc default, which leaves callers with an unsorted frame they ignored the return value of), the kwarg is simply not defined — Python raises the standard TypeError: sort_values() got an unexpected keyword argument 'inplace'. A test locks that contract. The design doc will be updated to match in a follow-up.

Test plan

17 tests in tests/expr/test_dataframe_sort_values.py:

  • Positive: single key asc/desc, by col(...) Expr, by computed Expr (col("x") + col("y")), multi-key all-asc, multi-key mixed asc/desc with deliberately scrambled input so a broken implementation cannot pass, ascending-scalar-broadcast, nulls-last asc, nulls-last desc.
  • Lazy return: isinstance(out, DataFrame) (not hasattr, per Dewey's policy).
  • Errors: empty by, length mismatch, bad by type, bad by element type (matched on a phrase only that error path uses), bad ascending type, bad ascending list element, inplace= rejected by Python.

All assertions use pd.testing.assert_frame_equal for outputs and pytest.raises(..., match="discriminating phrase") for errors. No substring-on-a-single-character patterns.

Local: 17 unit + 18 doctests + ruff format + ruff check all green.

Pandas-style sort_values over the lazy DataFrame.

API:
- `df.sort_values(by, ascending=True)`
- `by`: str, Expr, or list of those (column name or arbitrary
  expression key).
- `ascending`: bool or list-of-bool (broadcast to `by` length if
  scalar, must match length if list).

Null placement: nulls go last regardless of direction, matching
pandas' default `na_position='last'`. This overrides DataFusion's
SQL-style nulls-first-on-descending convention; if the SQL behavior
is needed it remains available via `sd.sql("... ORDER BY ... NULLS
FIRST")`.

No `inplace=` kwarg: SedonaDB's DataFrame wraps an immutable
DataFusion LogicalPlan, so true in-place mutation is impossible.
Rather than silently warn-and-ignore (which would leave callers
with an unsorted frame they ignored the return value of), the
kwarg is simply not defined — Python raises the standard
unexpected-keyword TypeError. The test suite locks that contract.

Rust side: `InternalDataFrame::sort_by_keys(exprs, ascending)`
pairs the two equal-length vectors into `SortExpr`s with
`nulls_first=false` for every key, then delegates to DataFusion's
`DataFrame::sort`.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions github-actions Bot requested a review from prantogg May 19, 2026 05:51
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Pandas invented sort_values() there have been more elegant/composable ways to handle this that have evolved.

For our purposes, I think we should:

  • Expose all the bells and whistles of DataFusion's SortExpr via sedonadb.expr.sort_expr(expr, <options like asc/dsc/nulls>)
  • Add methods to Expr (.asc(nulls), .desc(nulls)) that return SortExpr
  • Accept str, Expr, or SortExpr in sort_values
  • Name it either sort() (DataFusion python, DuckDB) or order_by() (Ibis)
  • Accept multiple arguments instead of a list (auto formats with Ruff into multiple lines better, all three newer interfaces do this)

Some more elegant interfaces for reference:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants