feat(python/sedonadb): add DataFrame.rename#878
Conversation
Pandas-style column rename on the lazy DataFrame, following the
single-dict / `{old: new}` direction confirmed in the design
discussion.
API:
df.rename({"a": "x"})
df.rename({"a": "x", "b": "y"})
- Single `dict[str, str]` arg, no `columns=` kwarg (Python's standard
unexpected-keyword TypeError covers misuse).
- Direction is `{old: new}` — matches Polars and the inner shape of
pandas' `rename(columns={...})`. Not the Ibis kwarg-flipped style.
- Strings only for both keys and values.
Validation (all Python-side, locked with exact-message tests where
the message is the feature):
- Empty dict → `ValueError`.
- Non-dict arg → `TypeError`.
- Non-str key or value → `TypeError`.
- Unknown old-name → `KeyError` listing available columns. Forced by
DataFusion's `with_column_renamed` being permissive — it silently
no-ops on an unknown name, hiding typos. Same Python-side guard
pattern as `drop` from apache#871.
- Final-state collisions (e.g. `{"a": "z", "b": "z"}` or renaming
onto an already-present column) → `ValueError("duplicate column
names")`.
- Two-cycle swaps (e.g. `{"a": "b", "b": "a"}`) have a unique final
schema but DataFusion applies renames sequentially and the
intermediate state collides. Surfaces as `SedonaError` from
plan-build; locked by a test rather than caught Python-side.
Rust side: `InternalDataFrame::rename` folds DataFusion's per-pair
`with_column_renamed` over the mapping. Step-by-step comments
explain why we don't try to be cleverer than DataFusion's sequential
application — the per-step uniqueness check is exactly what
prevents the swap case, and trying to reorder Python-side would
either reimplement the check or miss edge cases.
Tests: 13 covering single/multi/order-preservation, lazy return,
each error path with pinned messages, the kwarg rejection, and the
sequential-application contract for swaps.
There was a problem hiding this comment.
My preference would be to skip this one for now...it can be replicated with a one liner as a workaround (df.select(*[col(k).alias(v) for k, v in mapping.items()])) and there is more important APIs to surface like grouping, aggregation, join, and UDFs.
I also very much dislike the Pandas rename syntax (rename(a="b") is easier to type for those of us still typing this stuff)
Fine by me. Moving on to the next operator then. |
Continues Phase P2 of #791 with
DataFrame.rename. Same shape asdropfrom #871 — small schema op, strings only, varargs-style (well, dict-style) API instead of pandas'columns=keyword.API
dict[str, str]arg. Nocolumns=kwarg.{old: new}— matches Polars and the inner shape ofpandas.rename(columns={...}). Not Ibis's kwarg-flipped style.Expr.Validation
All Python-side, locked with exact-message tests where the message is the feature being verified:
ValueError.TypeError.KeyErrorlisting available columns. Forced by DataFusion'swith_column_renamedbeing permissive (same trap asdrop_columns).{"a": "z", "b": "z"}or{"a": "b"}whenbalready exists) →ValueError("duplicate column names").Swap-pair behavior (worth flagging)
{"a": "b", "b": "a"}has a unique final schema[b, a]so the Python-side collision check passes. But DataFusion applies renames sequentially, and the intermediate state aftera→bcollides with the originalb. The error surfaces asSedonaErrorfrom plan-build with messageProjections require unique expression names....I considered building a rename-graph and detecting cycles Python-side, but the DataFusion error is correct, clear-enough, and trying to reorder Python-side would either reimplement DataFusion's check or miss edge cases. Locked in
test_rename_swap_pair_raises_at_plan_build. Users wanting a swap route through a temporary name explicitly.Tests
13 in
tests/expr/test_dataframe_rename.py:isinstance(out, DataFrame).columns=kwarg; unknown old-name (exactKeyErrormessage); rename-onto-existing collision; new-name-to-new-name collision; sequential-application swap.Local: 13 unit + 20 doctests +
ruff format+ruff checkall clean.