Skip to content

feat(python/sedonadb): add DataFrame.rename#878

Closed
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-rename
Closed

feat(python/sedonadb): add DataFrame.rename#878
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-rename

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Continues Phase P2 of #791 with DataFrame.rename. Same shape as drop from #871 — small schema op, strings only, varargs-style (well, dict-style) API instead of pandas' columns= keyword.

API

df.rename({"a": "x"})
df.rename({"a": "x", "b": "y"})
  • Single dict[str, str] arg. No columns= kwarg.
  • Direction is {old: new} — matches Polars and the inner shape of pandas.rename(columns={...}). Not Ibis's kwarg-flipped style.
  • Strings only for both keys and values; no Expr.
  • Multi-rename in one call, applied as a single plan transformation.

Validation

All Python-side, locked with exact-message tests where the message is the feature being verified:

  • Empty dict → ValueError.
  • Non-dict arg / non-str key or value → TypeError.
  • Unknown old-name → KeyError listing available columns. Forced by DataFusion's with_column_renamed being permissive (same trap as drop_columns).
  • Final-state collisions ({"a": "z", "b": "z"} or {"a": "b"} when b already exists) → ValueError("duplicate column names").

Swap-pair behavior (worth flagging)

{"a": "b", "b": "a"} has a unique final schema [b, a] so the Python-side collision check passes. But DataFusion applies renames sequentially, and the intermediate state after a→b collides with the original b. The error surfaces as SedonaError from plan-build with message Projections require unique expression names....

I considered building a rename-graph and detecting cycles Python-side, but the DataFusion error is correct, clear-enough, and trying to reorder Python-side would either reimplement DataFusion's check or miss edge cases. Locked in test_rename_swap_pair_raises_at_plan_build. Users wanting a swap route through a temporary name explicitly.

Tests

13 in tests/expr/test_dataframe_rename.py:

  • Positive: single, multi, column-order preservation.
  • Lazy return: isinstance(out, DataFrame).
  • Errors: empty dict; non-dict; non-str key; non-str value; columns= kwarg; unknown old-name (exact KeyError message); rename-onto-existing collision; new-name-to-new-name collision; sequential-application swap.

Local: 13 unit + 20 doctests + ruff format + ruff check all clean.

Pandas-style column rename on the lazy DataFrame, following the
single-dict / `{old: new}` direction confirmed in the design
discussion.

API:

    df.rename({"a": "x"})
    df.rename({"a": "x", "b": "y"})

- Single `dict[str, str]` arg, no `columns=` kwarg (Python's standard
  unexpected-keyword TypeError covers misuse).
- Direction is `{old: new}` — matches Polars and the inner shape of
  pandas' `rename(columns={...})`. Not the Ibis kwarg-flipped style.
- Strings only for both keys and values.

Validation (all Python-side, locked with exact-message tests where
the message is the feature):

- Empty dict → `ValueError`.
- Non-dict arg → `TypeError`.
- Non-str key or value → `TypeError`.
- Unknown old-name → `KeyError` listing available columns. Forced by
  DataFusion's `with_column_renamed` being permissive — it silently
  no-ops on an unknown name, hiding typos. Same Python-side guard
  pattern as `drop` from apache#871.
- Final-state collisions (e.g. `{"a": "z", "b": "z"}` or renaming
  onto an already-present column) → `ValueError("duplicate column
  names")`.
- Two-cycle swaps (e.g. `{"a": "b", "b": "a"}`) have a unique final
  schema but DataFusion applies renames sequentially and the
  intermediate state collides. Surfaces as `SedonaError` from
  plan-build; locked by a test rather than caught Python-side.

Rust side: `InternalDataFrame::rename` folds DataFusion's per-pair
`with_column_renamed` over the mapping. Step-by-step comments
explain why we don't try to be cleverer than DataFusion's sequential
application — the per-step uniqueness check is exactly what
prevents the swap case, and trying to reorder Python-side would
either reimplement the check or miss edge cases.

Tests: 13 covering single/multi/order-preservation, lazy return,
each error path with pinned messages, the kwarg rejection, and the
sequential-application contract for swaps.
@github-actions github-actions Bot requested a review from zhangfengcdt May 26, 2026 07:17
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to skip this one for now...it can be replicated with a one liner as a workaround (df.select(*[col(k).alias(v) for k, v in mapping.items()])) and there is more important APIs to surface like grouping, aggregation, join, and UDFs.

I also very much dislike the Pandas rename syntax (rename(a="b") is easier to type for those of us still typing this stuff)

@jiayuasu
Copy link
Copy Markdown
Member Author

My preference would be to skip this one for now...it can be replicated with a one liner as a workaround (df.select(*[col(k).alias(v) for k, v in mapping.items()])) and there is more important APIs to surface like grouping, aggregation, join, and UDFs.

I also very much dislike the Pandas rename syntax (rename(a="b") is easier to type for those of us still typing this stuff)

Fine by me. Moving on to the next operator then.

@jiayuasu jiayuasu closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants