feat(python/sedonadb): add DataFrame.select#832
Conversation
Wire the Expr layer into the existing lazy DataFrame so users can
project without writing SQL strings.
- `DataFrame.select(*exprs)` accepts a mix of column-name strings and
Expr objects. Strings are converted to column references via
`sedonadb.expr.col` internally, so `df.select("x", "y")` and
`df.select(col("x"), col("y"))` produce the same plan.
- Empty argument list raises ValueError; non-str/non-Expr arguments
raise TypeError. Column-validity errors surface at plan-build time
from DataFusion, matching the behavior locked in the foundation PR.
Rust side: `InternalDataFrame::select` is a thin wrapper that unwraps
`Vec<PyExpr>` to `Vec<Expr>` and calls DataFusion's `DataFrame::select`
directly. No new query-engine code.
Tests assert exact `column_names` and `to_pylist()` output to lock the
projection semantics rather than substring-matching.
There was a problem hiding this comment.
Pull request overview
Adds a Python DataFrame.select() API that projects columns/expressions using the existing Python Expr layer, delegating projection to DataFusion via the Rust InternalDataFrame wrapper.
Changes:
- Add
DataFrame.select(*exprs)on the Python side, acceptingstrcolumn names and/orExprobjects with input validation. - Expose
InternalDataFrame.select(Vec<PyExpr>)in Rust and forward to DataFusion’sDataFrame::select. - Add a dedicated test module covering string/Expr/mixed projections, aliasing, arithmetic expressions, laziness, and error paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
python/sedonadb/python/sedonadb/dataframe.py |
Adds the user-facing DataFrame.select() method, coercing strings to col() and calling into _impl.select(). |
python/sedonadb/src/dataframe.rs |
Adds the PyO3-exposed InternalDataFrame.select() that unwraps PyExpr to datafusion_expr::Expr and delegates to DataFusion. |
python/sedonadb/tests/expr/test_dataframe_select.py |
Introduces tests for projection behavior, output columns, laziness, and validation/error cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import sedonadb | ||
| from sedonadb.expr import col | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def sd(): | ||
| return sedonadb.connect() | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def df_xy(sd): | ||
| return sd.sql( |
There was a problem hiding this comment.
Done in 098393c — fixtures now consume the shared con fixture from tests/conftest.py and the local sd fixture is gone.
| # DataFusion validates column references at plan-build time. The Expr | ||
| # itself is unbound (col("nonexistent") alone is fine), but selecting | ||
| # it against a frame that doesn't have that column fails immediately. | ||
| with pytest.raises(Exception, match="nonexistent"): |
There was a problem hiding this comment.
Done in 098393c — pytest.raises now catches sedonadb._lib.SedonaError specifically.
| def test_select_literal_via_isin_path(df_xy): | ||
| # No public lit() -> Expr in this PR; literals come in via operator | ||
| # coercion (and inside isin). This test exercises a literal-on-the-LHS | ||
| # arithmetic operation, where the int is auto-coerced to Expr. | ||
| out = df_xy.select((col("x") * 0 + 7).alias("seven")).to_arrow_table() |
There was a problem hiding this comment.
Renamed to test_select_literal_via_operator_coercion and rewrote the comment in 098393c to describe what's actually being exercised (right-hand scalar coercion through _to_expr), not the isin path the old name implied.
| def select(self, *exprs) -> "DataFrame": | ||
| """Project a set of columns or expressions. | ||
|
|
There was a problem hiding this comment.
Done in 098393c — *exprs: "Expr | str" (string annotation so the forward reference works without restructuring imports). This also fixes the docs-and-deploy CI failure, which was griffe rejecting the untyped vararg under strict mode.
CI: - doctest: the `show()` example used the wrong inner column separator. The actual render uses `┆` (dotted vertical) between adjacent data columns and `│` only on the outer borders. Replaced the three lines with the verbatim output so the doctest passes. - docs strict-mode: griffe rejected the untyped `*exprs`. Annotated as `*exprs: "Expr | str"` (string form so the forward reference works without restructuring imports). Review: - Tests now consume the shared `con` fixture from `tests/conftest.py` instead of spinning up a fresh `SedonaContext` per file. - Plan-build error test catches the concrete `sedonadb._lib.SedonaError` rather than bare `Exception`. - Renamed `test_select_literal_via_isin_path` to `test_select_literal_via_operator_coercion` and rewrote the comment to describe what it actually exercises (right-hand scalar coercion through `_to_expr`), not the `isin` path the old name implied.
| coerced = [] | ||
| for e in exprs: | ||
| if isinstance(e, Expr): | ||
| coerced.append(e._impl) | ||
| elif isinstance(e, str): | ||
| coerced.append(_col(e)._impl) | ||
| else: | ||
| raise TypeError( | ||
| f"select() expects str or Expr arguments, got {type(e).__name__}" | ||
| ) |
There was a problem hiding this comment.
Literals are missing here: we can check isinstance(e, Literal) (so that literals wrapped in lit() work).
I think it will also be useful to ensure that the error that occurs when str is an invalid column lists the valid columns (DataFusion will do this for you in SQL and the error we get might already include it but also might not when calling via the DataFrame API).
There was a problem hiding this comment.
Done in 98bf0bd — select() now accepts Literal and routes it through the same _to_expr path operators use. On the error message, I checked what DataFusion already emits via plan-build: No field named nonexistent. Valid fields are .... Strengthened the test to lock that contract (asserts both Valid fields and the actual column names appear). No Python-side rewrap needed.
| @pytest.fixture | ||
| def sd(): | ||
| return sedonadb.connect() |
There was a problem hiding this comment.
We already have this (the fixture is called con, because it predates inventing of sd = sedona.db.connect() 🙂 )
| def test_select_returns_lazy_dataframe(df_xy): | ||
| out = df_xy.select("x") | ||
| # Plan should be lazy until materialization. | ||
| assert hasattr(out, "to_arrow_table") | ||
|
|
||
|
|
There was a problem hiding this comment.
I think you can skip this one
| def test_select_returns_lazy_dataframe(df_xy): | |
| out = df_xy.select("x") | |
| # Plan should be lazy until materialization. | |
| assert hasattr(out, "to_arrow_table") |
There was a problem hiding this comment.
Dropped in 98bf0bd. Replaced with test_select_with_aliased_lit / test_select_with_unaliased_lit that exercise lit() directly now that it's exposed.
| # No public lit() -> Expr in this PR; literals come in via operator | ||
| # coercion (and inside isin). This test exercises a literal-on-the-LHS | ||
| # arithmetic operation, where the int is auto-coerced to Expr. |
There was a problem hiding this comment.
I think there's no problem exposing lit() and accepting it here?
There was a problem hiding this comment.
Done in 98bf0bd. sedonadb.expr.lit is re-exported at package level, select() accepts Literal, and Literal.alias(name) promotes the literal to an Expr so lit(7).alias("seven") works ergonomically.
| @pytest.fixture | ||
| def df_xy(sd): | ||
| return sd.sql( | ||
| "SELECT * FROM (VALUES (1, 10), (2, 20), (3, 30), (4, 40)) AS t(x, y)" | ||
| ) |
There was a problem hiding this comment.
I know it seems repetitive, but it's worth the one extra line to keep all of the test context within the test_ function for the day you have to debug a random failure:
df = sd.create_data_frame(pd.DataFrame({"x": [1, 2, 3]})There was a problem hiding this comment.
Inlined in 98bf0bd — each test now builds its own df via con.create_data_frame(pd.DataFrame({...})). The shared con fixture from tests/conftest.py stays per your other comment.
| out = df_xy.select("x").to_arrow_table() | ||
| assert out.column_names == ["x"] | ||
| assert out.column("x").to_pylist() == [1, 2, 3, 4] |
There was a problem hiding this comment.
Unless there is a good reason not to, use pd.testing.assert_dataframe_equal() for these (the failures are generally very informative). GeoPandas has a similar API for the GeoDataFrame.
There was a problem hiding this comment.
Switched in 98bf0bd — every output assertion now uses pd.testing.assert_frame_equal.
CI:
- ruff F821: forward-ref `Expr` in `select()`'s annotation is now
resolvable through a TYPE_CHECKING import. Annotation widened to
`Expr | str | _SedonaLit` (Literal under an alias to avoid colliding
with typing.Literal already imported in this module).
Functionality:
- `DataFrame.select()` now accepts a third argument shape: `Literal`
(the existing parameterized-query wrapper from `expr.literal`).
Literals are routed through the same `_to_expr` path operators use,
so metadata is preserved.
- `Literal.alias(name)` promotes the literal to an `Expr` and applies
the alias there. `sedonadb.expr.lit(value).alias("foo")` is now the
ergonomic way to project a named constant column.
- `sedonadb.expr.lit` is re-exported at the package level so the
pattern above lives on a single import.
Tests:
- `df_xy` fixture removed. Each test builds its input DataFrame inline
via `con.create_data_frame(pd.DataFrame(...))` so the full context
of a failure lives in the failing function. The shared `con` fixture
from `tests/conftest.py` stays — same engine across the suite.
- Output assertions switched to `pd.testing.assert_frame_equal` for
column-by-column diagnostics on mismatch.
- The workaround `test_select_literal_via_operator_coercion` is gone;
`test_select_with_aliased_lit` and `test_select_with_unaliased_lit`
replace it by exercising `lit()` directly.
- The unknown-column test is strengthened: we now assert that the
error message lists the valid column names (DataFusion does this
already in plan-build; we lock the contract).
Wire the
Exprlayer into the existing lazyDataFrameso users can project columns without writing SQL strings.This is the third small PR in the Phase P1 series of #791, building on #807 (Expr foundation) and #823 (Expr operators).
What's new
col(name)internally; the same plan is produced as the all-Expr form.ValueError. Non-str/non-Expr argument →TypeError. Unknown column → DataFusion plan-build error (same behavior locked in the foundation PR).Implementation
InternalDataFrame::select(Rust) is a thin wrapper that unwrapsVec<PyExpr>toVec<Expr>and calls DataFusion'sDataFrame::selectdirectly. No new query-engine code.Test plan
tests/expr/test_dataframe_select.pycover string projection, Expr projection, mixed args, arithmetic Expr, alias output, literal-coercion via operators, lazy return, and the three error paths.column_names/to_pylist()— no substring matching — so any change in projection semantics fails loudly.test_dataframe.py.DataFrame.filter/.wherecome in the next small PR.