Skip to content

feat(python/sedonadb): add DataFrame.select#832

Draft
jiayuasu wants to merge 4 commits into
apache:mainfrom
jiayuasu:feature/df-select
Draft

feat(python/sedonadb): add DataFrame.select#832
jiayuasu wants to merge 4 commits into
apache:mainfrom
jiayuasu:feature/df-select

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Wire the Expr layer into the existing lazy DataFrame so users can project columns without writing SQL strings.

This is the third small PR in the Phase P1 series of #791, building on #807 (Expr foundation) and #823 (Expr operators).

What's new

from sedonadb.expr import col

df.select("x", "y")                                # bare column names
df.select(col("x"), (col("y") + 1).alias("y1"))    # Expr objects
df.select("x", (col("y") * 2).alias("y2"))         # mix
  • Strings are converted to column references via col(name) internally; the same plan is produced as the all-Expr form.
  • Empty argument list → ValueError. Non-str/non-Expr argument → TypeError. Unknown column → DataFusion plan-build error (same behavior locked in the foundation PR).

Implementation

InternalDataFrame::select (Rust) is a thin wrapper that unwraps Vec<PyExpr> to Vec<Expr> and calls DataFusion's DataFrame::select directly. No new query-engine code.

Test plan

  • 11 tests in tests/expr/test_dataframe_select.py cover string projection, Expr projection, mixed args, arithmetic Expr, alias output, literal-coercion via operators, lazy return, and the three error paths.
  • Assertions are exact column_names / to_pylist() — no substring matching — so any change in projection semantics fails loudly.
  • All 11 pass locally; no regressions in existing test_dataframe.py.

DataFrame.filter / .where come in the next small PR.

Wire the Expr layer into the existing lazy DataFrame so users can
project without writing SQL strings.

- `DataFrame.select(*exprs)` accepts a mix of column-name strings and
  Expr objects. Strings are converted to column references via
  `sedonadb.expr.col` internally, so `df.select("x", "y")` and
  `df.select(col("x"), col("y"))` produce the same plan.
- Empty argument list raises ValueError; non-str/non-Expr arguments
  raise TypeError. Column-validity errors surface at plan-build time
  from DataFusion, matching the behavior locked in the foundation PR.

Rust side: `InternalDataFrame::select` is a thin wrapper that unwraps
`Vec<PyExpr>` to `Vec<Expr>` and calls DataFusion's `DataFrame::select`
directly. No new query-engine code.

Tests assert exact `column_names` and `to_pylist()` output to lock the
projection semantics rather than substring-matching.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Python DataFrame.select() API that projects columns/expressions using the existing Python Expr layer, delegating projection to DataFusion via the Rust InternalDataFrame wrapper.

Changes:

  • Add DataFrame.select(*exprs) on the Python side, accepting str column names and/or Expr objects with input validation.
  • Expose InternalDataFrame.select(Vec<PyExpr>) in Rust and forward to DataFusion’s DataFrame::select.
  • Add a dedicated test module covering string/Expr/mixed projections, aliasing, arithmetic expressions, laziness, and error paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
python/sedonadb/python/sedonadb/dataframe.py Adds the user-facing DataFrame.select() method, coercing strings to col() and calling into _impl.select().
python/sedonadb/src/dataframe.rs Adds the PyO3-exposed InternalDataFrame.select() that unwraps PyExpr to datafusion_expr::Expr and delegates to DataFusion.
python/sedonadb/tests/expr/test_dataframe_select.py Introduces tests for projection behavior, output columns, laziness, and validation/error cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +36
import sedonadb
from sedonadb.expr import col


@pytest.fixture
def sd():
return sedonadb.connect()


@pytest.fixture
def df_xy(sd):
return sd.sql(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 098393c — fixtures now consume the shared con fixture from tests/conftest.py and the local sd fixture is gone.

# DataFusion validates column references at plan-build time. The Expr
# itself is unbound (col("nonexistent") alone is fine), but selecting
# it against a frame that doesn't have that column fails immediately.
with pytest.raises(Exception, match="nonexistent"):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 098393cpytest.raises now catches sedonadb._lib.SedonaError specifically.

Comment on lines +77 to +81
def test_select_literal_via_isin_path(df_xy):
# No public lit() -> Expr in this PR; literals come in via operator
# coercion (and inside isin). This test exercises a literal-on-the-LHS
# arithmetic operation, where the int is auto-coerced to Expr.
out = df_xy.select((col("x") * 0 + 7).alias("seven")).to_arrow_table()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_select_literal_via_operator_coercion and rewrote the comment in 098393c to describe what's actually being exercised (right-hand scalar coercion through _to_expr), not the isin path the old name implied.

Comment on lines +88 to +90
def select(self, *exprs) -> "DataFrame":
"""Project a set of columns or expressions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 098393c*exprs: "Expr | str" (string annotation so the forward reference works without restructuring imports). This also fixes the docs-and-deploy CI failure, which was griffe rejecting the untyped vararg under strict mode.

CI:
- doctest: the `show()` example used the wrong inner column separator.
  The actual render uses `┆` (dotted vertical) between adjacent data
  columns and `│` only on the outer borders. Replaced the three lines
  with the verbatim output so the doctest passes.
- docs strict-mode: griffe rejected the untyped `*exprs`. Annotated as
  `*exprs: "Expr | str"` (string form so the forward reference works
  without restructuring imports).

Review:
- Tests now consume the shared `con` fixture from `tests/conftest.py`
  instead of spinning up a fresh `SedonaContext` per file.
- Plan-build error test catches the concrete `sedonadb._lib.SedonaError`
  rather than bare `Exception`.
- Renamed `test_select_literal_via_isin_path` to
  `test_select_literal_via_operator_coercion` and rewrote the comment
  to describe what it actually exercises (right-hand scalar coercion
  through `_to_expr`), not the `isin` path the old name implied.
Comment on lines +120 to +129
coerced = []
for e in exprs:
if isinstance(e, Expr):
coerced.append(e._impl)
elif isinstance(e, str):
coerced.append(_col(e)._impl)
else:
raise TypeError(
f"select() expects str or Expr arguments, got {type(e).__name__}"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Literals are missing here: we can check isinstance(e, Literal) (so that literals wrapped in lit() work).

I think it will also be useful to ensure that the error that occurs when str is an invalid column lists the valid columns (DataFusion will do this for you in SQL and the error we get might already include it but also might not when calling via the DataFrame API).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 98bf0bdselect() now accepts Literal and routes it through the same _to_expr path operators use. On the error message, I checked what DataFusion already emits via plan-build: No field named nonexistent. Valid fields are .... Strengthened the test to lock that contract (asserts both Valid fields and the actual column names appear). No Python-side rewrap needed.

Comment on lines +29 to +31
@pytest.fixture
def sd():
return sedonadb.connect()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this (the fixture is called con, because it predates inventing of sd = sedona.db.connect() 🙂 )

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to the shared con fixture in 098393c (and now in 98bf0bd each test inlines its own df from con.create_data_frame per your other comment).

Comment on lines +85 to +90
def test_select_returns_lazy_dataframe(df_xy):
out = df_xy.select("x")
# Plan should be lazy until materialization.
assert hasattr(out, "to_arrow_table")


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can skip this one

Suggested change
def test_select_returns_lazy_dataframe(df_xy):
out = df_xy.select("x")
# Plan should be lazy until materialization.
assert hasattr(out, "to_arrow_table")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in 98bf0bd. Replaced with test_select_with_aliased_lit / test_select_with_unaliased_lit that exercise lit() directly now that it's exposed.

Comment on lines +78 to +80
# No public lit() -> Expr in this PR; literals come in via operator
# coercion (and inside isin). This test exercises a literal-on-the-LHS
# arithmetic operation, where the int is auto-coerced to Expr.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no problem exposing lit() and accepting it here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 98bf0bd. sedonadb.expr.lit is re-exported at package level, select() accepts Literal, and Literal.alias(name) promotes the literal to an Expr so lit(7).alias("seven") works ergonomically.

Comment on lines +34 to +38
@pytest.fixture
def df_xy(sd):
return sd.sql(
"SELECT * FROM (VALUES (1, 10), (2, 20), (3, 30), (4, 40)) AS t(x, y)"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it seems repetitive, but it's worth the one extra line to keep all of the test context within the test_ function for the day you have to debug a random failure:

df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3]})

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined in 98bf0bd — each test now builds its own df via con.create_data_frame(pd.DataFrame({...})). The shared con fixture from tests/conftest.py stays per your other comment.

Comment on lines +42 to +44
out = df_xy.select("x").to_arrow_table()
assert out.column_names == ["x"]
assert out.column("x").to_pylist() == [1, 2, 3, 4]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a good reason not to, use pd.testing.assert_dataframe_equal() for these (the failures are generally very informative). GeoPandas has a similar API for the GeoDataFrame.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched in 98bf0bd — every output assertion now uses pd.testing.assert_frame_equal.

jiayuasu added 2 commits May 10, 2026 22:30
CI:
- ruff F821: forward-ref `Expr` in `select()`'s annotation is now
  resolvable through a TYPE_CHECKING import. Annotation widened to
  `Expr | str | _SedonaLit` (Literal under an alias to avoid colliding
  with typing.Literal already imported in this module).

Functionality:
- `DataFrame.select()` now accepts a third argument shape: `Literal`
  (the existing parameterized-query wrapper from `expr.literal`).
  Literals are routed through the same `_to_expr` path operators use,
  so metadata is preserved.
- `Literal.alias(name)` promotes the literal to an `Expr` and applies
  the alias there. `sedonadb.expr.lit(value).alias("foo")` is now the
  ergonomic way to project a named constant column.
- `sedonadb.expr.lit` is re-exported at the package level so the
  pattern above lives on a single import.

Tests:
- `df_xy` fixture removed. Each test builds its input DataFrame inline
  via `con.create_data_frame(pd.DataFrame(...))` so the full context
  of a failure lives in the failing function. The shared `con` fixture
  from `tests/conftest.py` stays — same engine across the suite.
- Output assertions switched to `pd.testing.assert_frame_equal` for
  column-by-column diagnostics on mismatch.
- The workaround `test_select_literal_via_operator_coercion` is gone;
  `test_select_with_aliased_lit` and `test_select_with_unaliased_lit`
  replace it by exercising `lit()` directly.
- The unknown-column test is strengthened: we now assert that the
  error message lists the valid column names (DataFusion does this
  already in plan-build; we lock the contract).
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants