feat(python/sedonadb): add DataFrame.sort_values#859
Draft
jiayuasu wants to merge 1 commit into
Draft
Conversation
Pandas-style sort_values over the lazy DataFrame.
API:
- `df.sort_values(by, ascending=True)`
- `by`: str, Expr, or list of those (column name or arbitrary
expression key).
- `ascending`: bool or list-of-bool (broadcast to `by` length if
scalar, must match length if list).
Null placement: nulls go last regardless of direction, matching
pandas' default `na_position='last'`. This overrides DataFusion's
SQL-style nulls-first-on-descending convention; if the SQL behavior
is needed it remains available via `sd.sql("... ORDER BY ... NULLS
FIRST")`.
No `inplace=` kwarg: SedonaDB's DataFrame wraps an immutable
DataFusion LogicalPlan, so true in-place mutation is impossible.
Rather than silently warn-and-ignore (which would leave callers
with an unsorted frame they ignored the return value of), the
kwarg is simply not defined — Python raises the standard
unexpected-keyword TypeError. The test suite locks that contract.
Rust side: `InternalDataFrame::sort_by_keys(exprs, ascending)`
pairs the two equal-length vectors into `SortExpr`s with
`nulls_first=false` for every key, then delegates to DataFusion's
`DataFrame::sort`.
paleolimbot
reviewed
May 19, 2026
Member
There was a problem hiding this comment.
Since Pandas invented sort_values() there have been more elegant/composable ways to handle this that have evolved.
For our purposes, I think we should:
- Expose all the bells and whistles of DataFusion's
SortExprviasedonadb.expr.sort_expr(expr, <options like asc/dsc/nulls>) - Add methods to
Expr(.asc(nulls),.desc(nulls)) that returnSortExpr - Accept
str,Expr, orSortExprin sort_values - Name it either
sort()(DataFusion python, DuckDB) ororder_by()(Ibis) - Accept multiple arguments instead of a list (auto formats with Ruff into multiple lines better, all three newer interfaces do this)
Some more elegant interfaces for reference:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continues the Phase P2 work of #791 with pandas-style
sort_valueson the lazyDataFrame.API
byacceptsstr | Expr | list[str | Expr].ascendingisbool | list[bool](broadcast tobylength if scalar, must match if list).Two semantic calls worth flagging
Nulls last in both directions. Pandas's default is
na_position='last'regardless of direction; DataFusion's SQL-style default is nulls-first on descending. We override DataFusion's default so pandas users porting code get the placement they expect. If the SQL behavior is needed,sd.sql("... ORDER BY ... NULLS FIRST")is still the escape hatch.No
inplace=kwarg. OurDataFramewraps an immutable DataFusionLogicalPlan, so true in-place mutation is impossible. Rather than silently warn-and-ignore (the design-doc default, which leaves callers with an unsorted frame they ignored the return value of), the kwarg is simply not defined — Python raises the standardTypeError: sort_values() got an unexpected keyword argument 'inplace'. A test locks that contract. The design doc will be updated to match in a follow-up.Test plan
17 tests in
tests/expr/test_dataframe_sort_values.py:col(...)Expr, by computed Expr (col("x") + col("y")), multi-key all-asc, multi-key mixed asc/desc with deliberately scrambled input so a broken implementation cannot pass, ascending-scalar-broadcast, nulls-last asc, nulls-last desc.isinstance(out, DataFrame)(nothasattr, per Dewey's policy).by, length mismatch, badbytype, badbyelement type (matched on a phrase only that error path uses), badascendingtype, badascendinglist element,inplace=rejected by Python.All assertions use
pd.testing.assert_frame_equalfor outputs andpytest.raises(..., match="discriminating phrase")for errors. No substring-on-a-single-character patterns.Local: 17 unit + 18 doctests +
ruff format+ruff checkall green.