feat(python/sedonadb): add Expr foundation#807
Conversation
Introduce a Python expression layer that wraps DataFusion's logical Expr through PyO3, mirroring the pattern used in the R bindings (r/sedonadb/src/rust/src/expression.rs). This commit ships the foundation only: - `sedonadb.expr.Expr` — column-expression wrapper with `alias`, `cast`, `is_null`, `is_not_null`, `isin`, and `negate`. - `sedonadb.expr.col(name)` — reference a column by name. - `sedonadb.expr.lit(value)` — wrap a Python value as a literal, reusing the existing Literal Arrow-array coercion path. Operator overloading and DataFrame integration land in follow-up commits. Expr objects are pure syntax: they are not bound to a DataFrame at construction time, and column-validity errors surface only when an Expr is consumed by a DataFrame method.
There was a problem hiding this comment.
Pull request overview
Adds the initial Python expression layer (sedonadb.expr) that wraps DataFusion logical Expr via PyO3, establishing the base building blocks needed for upcoming pandas-like APIs (per #791 / Phase P1).
Changes:
- Introduces a Rust PyO3
InternalExprwrapper and exports constructorsexpr_col/expr_litfromsedonadb._lib. - Adds Python
sedonadb.expr.Exprpluscol()/lit()helpers and basic expression methods (alias,cast, null checks,isin,negate). - Adds a new unit test suite covering expression construction and method chaining.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/sedonadb/tests/expr/test_expression.py | New unit tests for the Python Expr surface. |
| python/sedonadb/src/lib.rs | Registers the new Rust expression module/functions/classes in the PyO3 module. |
| python/sedonadb/src/expr.rs | Implements InternalExpr and core expression operations + literal/column constructors. |
| python/sedonadb/python/sedonadb/expr/expression.py | Implements the public Python Expr wrapper API and helpers. |
| python/sedonadb/python/sedonadb/expr/init.py | Re-exports Expr, col, lit, and Literal. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- expr.rs: add module-level docs and step-by-step comments on every factory and method, explaining what each constructs, why we clone, why we reject Arrow extension types in cast(), and the intended shape of the layer relative to the R prior art. - expr.rs: route expr_lit() through import_arrow_scalar() instead of duplicating the length check + ScalarValue conversion + metadata handling inline. Future fixes to scalar coercion only need to live in one place. - expression.py: fix is_null() docstring. The previous wording promised both NULL and NaN matching, but the underlying DataFusion Expr::IsNull only matches SQL NULL. The pandas-style NaN-aware helper will live on the future Series type. - test_expression.py: lean on Expr.variant_name() for structural assertions and only check user-supplied identifiers (column names, literal values) inside repr() output. Avoids coupling the suite to DataFusion's Display formatting, which can change between versions without a semantic change.
paleolimbot
left a comment
There was a problem hiding this comment.
Very cool! I know you're still working here so feel free to ignore anything that was still in the works.
| /// 2. Reject Arrow extension types up front. SedonaDB's spatial types | ||
| /// are extension types over WKB; users who reach for `cast` on | ||
| /// those almost certainly want a different operation, so we surface | ||
| /// a clear error rather than silently dropping the extension. |
There was a problem hiding this comment.
👍 (I added support for these in DataFusion 53 that we can transform using an optimizer rule to a scalar function call). We can support these before DataFusion 53 here if we need to by inserting the scalar function call directly (I may give this a go in the near future since it's maybe useful for geometry/geography).
There was a problem hiding this comment.
Thanks for the heads-up. Leaving extension-type cast as a clear error in this PR. Happy to revisit with the DF53 optimizer rule once that's available — geometry/geography is the obvious payoff.
| /// Qualified columns (e.g. `t.x`) are not exposed yet; the Python | ||
| /// `col()` helper takes only a single name. When we add joins and | ||
| /// multi-table references we can grow this to accept an optional | ||
| /// table qualifier, matching the R side's `column(name, qualifier)`. | ||
| #[pyfunction] | ||
| pub fn expr_col(name: &str) -> PyExpr { |
There was a problem hiding this comment.
If it's easy you can probably just add a qualifier argument here while you're at it
There was a problem hiding this comment.
Done in 0d1a4ab — expr_col(name, qualifier=None) on the Rust side and col(name, qualifier=None) on the Python side. Test in test_col_with_qualifier.
| e = col("x") | ||
| assert isinstance(e, Expr) | ||
| assert e._impl.variant_name() == "Column" | ||
| assert "x" in repr(e) |
There was a problem hiding this comment.
Testing the exact repr output is probably easy to do and is a slightly better test
There was a problem hiding this comment.
Reverted in 0d1a4ab — exact repr substrings are back (x AS y, CAST(x AS Int32), etc.), with variant_name() checks kept alongside as structural anchors. Added a module-level comment explaining the policy so the next person doesn't re-loosen them.
| def lit(value: Any) -> Expr: | ||
| """Wrap a Python value as a literal expression. | ||
|
|
||
| Accepts the same value types as `sedonadb.expr.literal.lit`, including | ||
| Python scalars, pyarrow arrays/scalars, and Shapely geometries. Returns an | ||
| `Expr` suitable for composition with column expressions. | ||
| """ |
There was a problem hiding this comment.
FWIW there is already a lit() function that returns Literal. Literal is intentionally lazy so (the python wrappers around) different functions can interpret then differently if they want to (e.g., RS_Intersects() doesn't actually need to convert a whole rasterio object to an Arrow scalar, which is expensive, to compute a correct result). Purely theoretical, not currently used, and can change, but that's why it's like that 🙂 .
Probably you can just fold this logic into _to_expr() and leave the existing lit() (which powers parameterized queries).
There was a problem hiding this comment.
Got it, that makes sense — thanks for the context. Folded into 0d1a4ab: dropped the public lit() -> Expr and moved the Python-value-to-Expr coercion into the private _to_expr() helper. The existing sedonadb.expr.literal.lit returning Literal is untouched and stays re-exported from the package.
| def __init__(self, impl): | ||
| # impl is the underlying _lib.InternalExpr handle. Users normally | ||
| # do not construct Expr directly; use col() / lit() instead. | ||
| self._impl = impl |
There was a problem hiding this comment.
An isinstance() check here would be good so that this errors if used incorrectly
There was a problem hiding this comment.
Added in 0d1a4ab — Expr.__init__ now isinstance-checks its argument against _lib.InternalExpr and raises TypeError with a message pointing at col(). Covered by test_expr_init_rejects_wrong_type.
| def alias(self, name: str) -> "Expr": | ||
| """Return a copy of the expression with a new output name.""" | ||
| return Expr(self._impl.alias(name)) |
There was a problem hiding this comment.
I forget if we have CI checks for this, but we parameter docs and examples for most functions in the Python APIs
There was a problem hiding this comment.
Beefed up in 0d1a4ab. Public Expr methods and col() now have Args: and Examples: docstring blocks matching the style of dataframe.py:head/limit/count.
A. expr_col now accepts an optional qualifier (e.g. col("x", "t") for
t.x), mirroring SedonaDBExprFactory::column in the R bindings.
B. Tests pin the exact rendered Display form again (e.g. "x AS y",
"CAST(x AS Int32)") with a module-level comment noting the policy
so future contributors don't re-loosen them. variant_name() checks
stay alongside as structural anchors.
C. Drop the public lit() that returned an Expr; Python-side coercion
is now folded into the private _to_expr() helper. The pre-existing
lit() in sedonadb.expr.literal (returning Literal) is intentionally
lazy and powers parameterized queries; leave it untouched and keep
it re-exported.
D. Expr.__init__ now isinstance-checks its argument and raises
TypeError on misuse, with a message pointing at col().
E. Public Expr methods and col() grow Args / Examples docstring blocks
matching the style of the existing dataframe.py methods.
| from sedonadb.expr.expression import Expr, col | ||
| from sedonadb.expr.literal import Literal, lit | ||
|
|
||
| __all__ = ["Expr", "Literal", "col", "lit"] |
There was a problem hiding this comment.
Done in f633481 — lit is removed from sedonadb.expr.__init__.py's re-exports. It's a Literal constructor (the lazy parameterized-query helper), not an Expr factory, so surfacing it on the same package as col() was misleading. Users continue to from sedonadb.expr.literal import lit, matching the pre-PR layout. PR description updated to drop the obsolete bullet.
| Construct an `Expr` with `col(name)`. Plain Python values composed with an | ||
| `Expr` via operators are coerced to literal expressions automatically. |
There was a problem hiding this comment.
Good catch — fixed in f633481. The Expr docstring now describes only what this PR ships (coercion happens inside methods like isin, not via operators) and calls out that operator overloading arrives in a follow-up.
- expr/__init__.py: drop `lit` from `__all__` / re-exports. It returns a `Literal` (the parameterized-query helper), not an `Expr`, so surfacing it on the same package as `col()` was confusing. Users who need it continue to `from sedonadb.expr.literal import lit`, matching the pre-PR layout. - expression.py: rewrite the `Expr` class docstring to describe what this PR actually ships (coercion happens inside methods like `isin`, not via operators) and call out that operator overloading arrives in a follow-up.
| assert "IN" in rep | ||
| assert "Int64(1)" in rep | ||
| assert "Int64(3)" in rep |
There was a problem hiding this comment.
A few more of these that should just be checking the repr. (LLMs love to write tests that will almost definitely pass, particularly if they are testing their own implementation 🙂 )
| rep = repr(e) | ||
| assert "a" in rep | ||
| assert "Int64(2)" in rep |
| assert e._impl.variant_name() == "Negative" | ||
| assert "(- x)" in repr(e) |
| assert e._impl.variant_name() == "Alias" | ||
| assert "missing" in repr(e) | ||
| assert "IS NULL" in repr(e) |
| # Constructing an Expr referring to a non-existent column does not error. | ||
| # Errors surface only at DataFrame consumption. | ||
| e = col("nonexistent_column_xyz") | ||
| assert "nonexistent_column_xyz" in repr(e) |
Switch the five flagged tests from substring matches to exact repr() equality so any change in DataFusion's Display output fails the suite deliberately rather than slipping through. - test_isin_python_scalars - test_isin_with_expr_values - test_negate - test_chain_alias_after_predicate - test_expr_is_not_bound_to_dataframe
Adds the foundation of a Python expression layer that wraps DataFusion's logical
Exprthrough PyO3, mirroring the pattern used in the R bindings (r/sedonadb/src/rust/src/expression.rs).This is the first of four small stacked PRs that together implement Phase P1 of #791. This one ships only the foundation:
sedonadb.expr.Expr— column-expression wrapper withalias,cast,is_null,is_not_null,isin,negate.sedonadb.expr.col(name, qualifier=None)— reference a column by name (with an optional table qualifier for joins).Operator overloading and DataFrame integration land in follow-up PRs.
Exprobjects are pure syntax: they are not bound to a DataFrame at construction time, and column-validity errors surface only when anExpris consumed by a DataFrame method.The pre-existing
sedonadb.expr.literal.lit(returningLiteral, the lazy parameterized-query helper) is unchanged by this PR and is intentionally kept offsedonadb.expr's package-level surface.Test plan
tests/expr/test_expression.pycovering construction, qualified columns, alias, cast, null checks,isin,negate, chaining, and the__init__type guard.tests/test_dataframe.py.