Skip to content

feat(python/sedonadb): restrict DataFrame __getitem__ to single-column lookup#852

Open
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-getitem-restrict
Open

feat(python/sedonadb): restrict DataFrame __getitem__ to single-column lookup#852
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-getitem-restrict

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Post-merge follow-up to #846 addressing @paleolimbot's review feedback.

What changes

DataFrame.__getitem__ is now strictly single-column lookup:

df["x"]    # → Expr referencing column x
df[0]      # → Expr referencing the first column
df[-1]     # → Expr referencing the last column

Previously the same method also handled df[["x","y"]] (list projection) and df[bool_expr] (filter). Those are dropped. Users go through the explicit df.select(...) and df.filter(...) entry points instead.

The motivation, mirroring Ibis's same deprecation: a single return type (Expr) lets IDEs and type checkers resolve df["x"].<method> cleanly. With the old polymorphic shape, the return type was DataFrame | Expr, which broke autocomplete on the common case.

Error paths

Key Behavior
Unknown column name KeyError, message lists available columns
Out-of-range int IndexError
bool TypeError (guarded explicitly; bool is a subclass of int in Python)
list, slice, Expr, anything else TypeError with a message pointing at select / filter

While here — import-discipline cleanup

__getitem__, select, and filter previously had lazy in-function imports of Expr, Literal, col, and _to_expr. Per the policy in this module ("lazy imports are reserved for optional dependencies like pyarrow"), those move to module level (combined) and the method annotations switch from string forward references to the runtime-imported names.

Test plan

  • 11 tests in tests/expr/test_dataframe_getitem.py cover name / positive / negative / first / out-of-range / unknown / bool / list / slice / Expr keys plus the operator-composition path on df["x"].
  • All exact repr() == ... for the positive cases, per the test-policy convention locked in earlier PRs.
  • Existing test_dataframe_select.py and test_dataframe_filter.py still pass (35 expr-dataframe tests total).
  • pytest --doctest-modules dataframe.py passes (17 doctests including the updated __getitem__ example).
  • ruff check and ruff format clean.

@github-actions github-actions Bot requested a review from paleolimbot May 17, 2026 07:08
Address post-merge feedback on apache#846.

- Restrict `DataFrame.__getitem__` to `(key: Union[str, int]) -> Expr`.
  Drops the polymorphic list-projection and bool-Expr-filter forms
  that apache#846 added. Reasons mirror Ibis's deprecation of the same
  forms: a single return type keeps IDE/type-checker inference on
  `df["x"].<method>` clean, and the multi-column / filter cases have
  obvious named entry points (`select`, `filter`).
- Integer indexing now resolves a column by 0-based position, with
  negative indexing supported. Out-of-range integers raise
  `IndexError`; unknown string names raise `KeyError` listing the
  available columns. `bool` is rejected explicitly so a stray
  `df[True]` doesn't silently mean `df[1]` (bool is a subclass of
  int in Python). List, slice, and Expr keys raise `TypeError` with
  messages pointing at `select` / `filter`.
- Move the lazy `Expr` / `Literal` / `col` / `_to_expr` imports
  inside `__getitem__`, `select`, and `filter` to module level
  (combined). Per project policy in this module, lazy imports are
  reserved for optional dependencies like pyarrow. The fix while
  here keeps the import discipline consistent across the three
  Expr-aware methods rather than churning them separately later.
- Update annotations on `select` and `filter` to use the new
  module-level names directly (no more string forward references).

Tests rewritten for the new shape: 11 cases covering name and
position lookup, negative indexing, operator composition, and the
`KeyError` / `IndexError` / `TypeError` error paths.

cc @paleolimbot
@jiayuasu jiayuasu force-pushed the feature/df-getitem-restrict branch from 6f6c081 to 1a319e1 Compare May 17, 2026 07:16
@jiayuasu jiayuasu requested a review from Copilot May 17, 2026 07:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR narrows Python DataFrame.__getitem__ to a single-column lookup API returning Expr, aligning bracket access with predictable typing while directing projection/filtering to explicit methods.

Changes:

  • Restricts df[key] to string column names and integer column positions, including negative indices.
  • Adds explicit error handling for unknown names, out-of-range indices, bools, lists, slices, and Expr keys.
  • Updates tests to cover the revised lookup behavior and rejected legacy shortcut forms.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/sedonadb/python/sedonadb/dataframe.py Updates __getitem__, import placement, annotations, and related select/filter type checks.
python/sedonadb/tests/expr/test_dataframe_getitem.py Replaces polymorphic getitem tests with single-column lookup and error-path coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as ready for review May 18, 2026 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants