feat(python/sedonadb): add DataFrame.select by jiayuasu · Pull Request #832 · apache/sedona-db

jiayuasu · 2026-05-11T03:23:05Z

Wire the Expr layer into the existing lazy DataFrame so users can project columns without writing SQL strings.

This is the third small PR in the Phase P1 series of #791, building on #807 (Expr foundation) and #823 (Expr operators).

What's new

from sedonadb.expr import col

df.select("x", "y")                                # bare column names
df.select(col("x"), (col("y") + 1).alias("y1"))    # Expr objects
df.select("x", (col("y") * 2).alias("y2"))         # mix

Strings are converted to column references via col(name) internally; the same plan is produced as the all-Expr form.
Empty argument list → ValueError. Non-str/non-Expr argument → TypeError. Unknown column → DataFusion plan-build error (same behavior locked in the foundation PR).

Implementation

InternalDataFrame::select (Rust) is a thin wrapper that unwraps Vec<PyExpr> to Vec<Expr> and calls DataFusion's DataFrame::select directly. No new query-engine code.

Test plan

11 tests in tests/expr/test_dataframe_select.py cover string projection, Expr projection, mixed args, arithmetic Expr, alias output, literal-coercion via operators, lazy return, and the three error paths.
Assertions are exact column_names / to_pylist() — no substring matching — so any change in projection semantics fails loudly.
All 11 pass locally; no regressions in existing test_dataframe.py.

DataFrame.filter / .where come in the next small PR.

Wire the Expr layer into the existing lazy DataFrame so users can project without writing SQL strings. - `DataFrame.select(*exprs)` accepts a mix of column-name strings and Expr objects. Strings are converted to column references via `sedonadb.expr.col` internally, so `df.select("x", "y")` and `df.select(col("x"), col("y"))` produce the same plan. - Empty argument list raises ValueError; non-str/non-Expr arguments raise TypeError. Column-validity errors surface at plan-build time from DataFusion, matching the behavior locked in the foundation PR. Rust side: `InternalDataFrame::select` is a thin wrapper that unwraps `Vec<PyExpr>` to `Vec<Expr>` and calls DataFusion's `DataFrame::select` directly. No new query-engine code. Tests assert exact `column_names` and `to_pylist()` output to lock the projection semantics rather than substring-matching.

Copilot

Pull request overview

Adds a Python DataFrame.select() API that projects columns/expressions using the existing Python Expr layer, delegating projection to DataFusion via the Rust InternalDataFrame wrapper.

Changes:

Add DataFrame.select(*exprs) on the Python side, accepting str column names and/or Expr objects with input validation.
Expose InternalDataFrame.select(Vec<PyExpr>) in Rust and forward to DataFusion’s DataFrame::select.
Add a dedicated test module covering string/Expr/mixed projections, aliasing, arithmetic expressions, laziness, and error paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`python/sedonadb/python/sedonadb/dataframe.py`	Adds the user-facing `DataFrame.select()` method, coercing strings to `col()` and calling into `_impl.select()`.
`python/sedonadb/src/dataframe.rs`	Adds the PyO3-exposed `InternalDataFrame.select()` that unwraps `PyExpr` to `datafusion_expr::Expr` and delegates to DataFusion.
`python/sedonadb/tests/expr/test_dataframe_select.py`	Introduces tests for projection behavior, output columns, laziness, and validation/error cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jiayuasu · 2026-05-11T04:03:53Z

+import sedonadb
+from sedonadb.expr import col
+
+
+@pytest.fixture
+def sd():
+    return sedonadb.connect()
+
+
+@pytest.fixture
+def df_xy(sd):
+    return sd.sql(


Done in 098393c — fixtures now consume the shared con fixture from tests/conftest.py and the local sd fixture is gone.

jiayuasu · 2026-05-11T04:03:54Z

+    # DataFusion validates column references at plan-build time. The Expr
+    # itself is unbound (col("nonexistent") alone is fine), but selecting
+    # it against a frame that doesn't have that column fails immediately.
+    with pytest.raises(Exception, match="nonexistent"):


Done in 098393c — pytest.raises now catches sedonadb._lib.SedonaError specifically.

jiayuasu · 2026-05-11T04:03:55Z

+def test_select_literal_via_isin_path(df_xy):
+    # No public lit() -> Expr in this PR; literals come in via operator
+    # coercion (and inside isin). This test exercises a literal-on-the-LHS
+    # arithmetic operation, where the int is auto-coerced to Expr.
+    out = df_xy.select((col("x") * 0 + 7).alias("seven")).to_arrow_table()


Renamed to test_select_literal_via_operator_coercion and rewrote the comment in 098393c to describe what's actually being exercised (right-hand scalar coercion through _to_expr), not the isin path the old name implied.

jiayuasu · 2026-05-11T04:03:56Z

+    def select(self, *exprs) -> "DataFrame":
+        """Project a set of columns or expressions.
+


Done in 098393c — *exprs: "Expr | str" (string annotation so the forward reference works without restructuring imports). This also fixes the docs-and-deploy CI failure, which was griffe rejecting the untyped vararg under strict mode.

CI: - doctest: the `show()` example used the wrong inner column separator. The actual render uses `┆` (dotted vertical) between adjacent data columns and `│` only on the outer borders. Replaced the three lines with the verbatim output so the doctest passes. - docs strict-mode: griffe rejected the untyped `*exprs`. Annotated as `*exprs: "Expr | str"` (string form so the forward reference works without restructuring imports). Review: - Tests now consume the shared `con` fixture from `tests/conftest.py` instead of spinning up a fresh `SedonaContext` per file. - Plan-build error test catches the concrete `sedonadb._lib.SedonaError` rather than bare `Exception`. - Renamed `test_select_literal_via_isin_path` to `test_select_literal_via_operator_coercion` and rewrote the comment to describe what it actually exercises (right-hand scalar coercion through `_to_expr`), not the `isin` path the old name implied.

paleolimbot · 2026-05-11T03:54:24Z

+        coerced = []
+        for e in exprs:
+            if isinstance(e, Expr):
+                coerced.append(e._impl)
+            elif isinstance(e, str):
+                coerced.append(_col(e)._impl)
+            else:
+                raise TypeError(
+                    f"select() expects str or Expr arguments, got {type(e).__name__}"
+                )


Literals are missing here: we can check isinstance(e, Literal) (so that literals wrapped in lit() work).

I think it will also be useful to ensure that the error that occurs when str is an invalid column lists the valid columns (DataFusion will do this for you in SQL and the error we get might already include it but also might not when calling via the DataFrame API).

Done in 98bf0bd — select() now accepts Literal and routes it through the same _to_expr path operators use. On the error message, I checked what DataFusion already emits via plan-build: No field named nonexistent. Valid fields are .... Strengthened the test to lock that contract (asserts both Valid fields and the actual column names appear). No Python-side rewrap needed.

paleolimbot · 2026-05-11T03:56:08Z

+@pytest.fixture
+def sd():
+    return sedonadb.connect()


We already have this (the fixture is called con, because it predates inventing of sd = sedona.db.connect() 🙂 )

Switched to the shared con fixture in 098393c (and now in 98bf0bd each test inlines its own df from con.create_data_frame per your other comment).

paleolimbot · 2026-05-11T03:57:11Z

+def test_select_returns_lazy_dataframe(df_xy):
+    out = df_xy.select("x")
+    # Plan should be lazy until materialization.
+    assert hasattr(out, "to_arrow_table")
+
+


I think you can skip this one

Suggested change

def test_select_returns_lazy_dataframe(df_xy):

out = df_xy.select("x")

# Plan should be lazy until materialization.

assert hasattr(out, "to_arrow_table")

Dropped in 98bf0bd. Replaced with test_select_with_aliased_lit / test_select_with_unaliased_lit that exercise lit() directly now that it's exposed.

paleolimbot · 2026-05-11T03:58:07Z

+    # No public lit() -> Expr in this PR; literals come in via operator
+    # coercion (and inside isin). This test exercises a literal-on-the-LHS
+    # arithmetic operation, where the int is auto-coerced to Expr.


I think there's no problem exposing lit() and accepting it here?

Done in 98bf0bd. sedonadb.expr.lit is re-exported at package level, select() accepts Literal, and Literal.alias(name) promotes the literal to an Expr so lit(7).alias("seven") works ergonomically.

paleolimbot · 2026-05-11T04:02:22Z

+@pytest.fixture
+def df_xy(sd):
+    return sd.sql(
+        "SELECT * FROM (VALUES (1, 10), (2, 20), (3, 30), (4, 40)) AS t(x, y)"
+    )


I know it seems repetitive, but it's worth the one extra line to keep all of the test context within the test_ function for the day you have to debug a random failure:

df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3]})

Inlined in 98bf0bd — each test now builds its own df via con.create_data_frame(pd.DataFrame({...})). The shared con fixture from tests/conftest.py stays per your other comment.

paleolimbot · 2026-05-11T04:04:46Z

+    out = df_xy.select("x").to_arrow_table()
+    assert out.column_names == ["x"]
+    assert out.column("x").to_pylist() == [1, 2, 3, 4]


Unless there is a good reason not to, use pd.testing.assert_dataframe_equal() for these (the failures are generally very informative). GeoPandas has a similar API for the GeoDataFrame.

Switched in 98bf0bd — every output assertion now uses pd.testing.assert_frame_equal.

CI: - ruff F821: forward-ref `Expr` in `select()`'s annotation is now resolvable through a TYPE_CHECKING import. Annotation widened to `Expr | str | _SedonaLit` (Literal under an alias to avoid colliding with typing.Literal already imported in this module). Functionality: - `DataFrame.select()` now accepts a third argument shape: `Literal` (the existing parameterized-query wrapper from `expr.literal`). Literals are routed through the same `_to_expr` path operators use, so metadata is preserved. - `Literal.alias(name)` promotes the literal to an `Expr` and applies the alias there. `sedonadb.expr.lit(value).alias("foo")` is now the ergonomic way to project a named constant column. - `sedonadb.expr.lit` is re-exported at the package level so the pattern above lives on a single import. Tests: - `df_xy` fixture removed. Each test builds its input DataFrame inline via `con.create_data_frame(pd.DataFrame(...))` so the full context of a failure lives in the failing function. The shared `con` fixture from `tests/conftest.py` stays — same engine across the suite. - Output assertions switched to `pd.testing.assert_frame_equal` for column-by-column diagnostics on mismatch. - The workaround `test_select_literal_via_operator_coercion` is gone; `test_select_with_aliased_lit` and `test_select_with_unaliased_lit` replace it by exercising `lit()` directly. - The unknown-column test is strengthened: we now assert that the error message lists the valid column names (DataFusion does this already in plan-build; we lock the contract).

paleolimbot

Thank you!

github-actions Bot requested a review from paleolimbot May 11, 2026 03:23

jiayuasu requested a review from Copilot May 11, 2026 03:25

Copilot started reviewing on behalf of jiayuasu May 11, 2026 03:26 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

paleolimbot reviewed May 11, 2026

View reviewed changes

jiayuasu added 2 commits May 10, 2026 22:30

style: ruff format

295a14f

paleolimbot approved these changes May 11, 2026

View reviewed changes

		def select(self, *exprs) -> "DataFrame":
		"""Project a set of columns or expressions.

Conversation

jiayuasu commented May 11, 2026

What's new

Implementation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants