feat(python/sedonadb): add Expr foundation by jiayuasu · Pull Request #807 · apache/sedona-db

jiayuasu · 2026-05-02T06:09:27Z

Adds the foundation of a Python expression layer that wraps DataFusion's logical Expr through PyO3, mirroring the pattern used in the R bindings (r/sedonadb/src/rust/src/expression.rs).

This is the first of four small stacked PRs that together implement Phase P1 of #791. This one ships only the foundation:

sedonadb.expr.Expr — column-expression wrapper with alias, cast, is_null, is_not_null, isin, negate.
sedonadb.expr.col(name, qualifier=None) — reference a column by name (with an optional table qualifier for joins).

Operator overloading and DataFrame integration land in follow-up PRs. Expr objects are pure syntax: they are not bound to a DataFrame at construction time, and column-validity errors surface only when an Expr is consumed by a DataFrame method.

The pre-existing sedonadb.expr.literal.lit (returning Literal, the lazy parameterized-query helper) is unchanged by this PR and is intentionally kept off sedonadb.expr's package-level surface.

Test plan

Unit tests in tests/expr/test_expression.py covering construction, qualified columns, alias, cast, null checks, isin, negate, chaining, and the __init__ type guard.
No regressions in the existing tests/test_dataframe.py.
CI green.

Introduce a Python expression layer that wraps DataFusion's logical Expr through PyO3, mirroring the pattern used in the R bindings (r/sedonadb/src/rust/src/expression.rs). This commit ships the foundation only: - `sedonadb.expr.Expr` — column-expression wrapper with `alias`, `cast`, `is_null`, `is_not_null`, `isin`, and `negate`. - `sedonadb.expr.col(name)` — reference a column by name. - `sedonadb.expr.lit(value)` — wrap a Python value as a literal, reusing the existing Literal Arrow-array coercion path. Operator overloading and DataFrame integration land in follow-up commits. Expr objects are pure syntax: they are not bound to a DataFrame at construction time, and column-validity errors surface only when an Expr is consumed by a DataFrame method.

Copilot

Pull request overview

Adds the initial Python expression layer (sedonadb.expr) that wraps DataFusion logical Expr via PyO3, establishing the base building blocks needed for upcoming pandas-like APIs (per #791 / Phase P1).

Changes:

Introduces a Rust PyO3 InternalExpr wrapper and exports constructors expr_col / expr_lit from sedonadb._lib.
Adds Python sedonadb.expr.Expr plus col() / lit() helpers and basic expression methods (alias, cast, null checks, isin, negate).
Adds a new unit test suite covering expression construction and method chaining.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
python/sedonadb/tests/expr/test_expression.py	New unit tests for the Python `Expr` surface.
python/sedonadb/src/lib.rs	Registers the new Rust expression module/functions/classes in the PyO3 module.
python/sedonadb/src/expr.rs	Implements `InternalExpr` and core expression operations + literal/column constructors.
python/sedonadb/python/sedonadb/expr/expression.py	Implements the public Python `Expr` wrapper API and helpers.
python/sedonadb/python/sedonadb/expr/init.py	Re-exports `Expr`, `col`, `lit`, and `Literal`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- expr.rs: add module-level docs and step-by-step comments on every factory and method, explaining what each constructs, why we clone, why we reject Arrow extension types in cast(), and the intended shape of the layer relative to the R prior art. - expr.rs: route expr_lit() through import_arrow_scalar() instead of duplicating the length check + ScalarValue conversion + metadata handling inline. Future fixes to scalar coercion only need to live in one place. - expression.py: fix is_null() docstring. The previous wording promised both NULL and NaN matching, but the underlying DataFusion Expr::IsNull only matches SQL NULL. The pandas-style NaN-aware helper will live on the future Series type. - test_expression.py: lean on Expr.variant_name() for structural assertions and only check user-supplied identifiers (column names, literal values) inside repr() output. Avoids coupling the suite to DataFusion's Display formatting, which can change between versions without a semantic change.

paleolimbot

Very cool! I know you're still working here so feel free to ignore anything that was still in the works.

paleolimbot · 2026-05-03T01:34:57Z

+    /// 2. Reject Arrow extension types up front. SedonaDB's spatial types
+    ///    are extension types over WKB; users who reach for `cast` on
+    ///    those almost certainly want a different operation, so we surface
+    ///    a clear error rather than silently dropping the extension.


👍 (I added support for these in DataFusion 53 that we can transform using an optimizer rule to a scalar function call). We can support these before DataFusion 53 here if we need to by inserting the scalar function call directly (I may give this a go in the near future since it's maybe useful for geometry/geography).

Thanks for the heads-up. Leaving extension-type cast as a clear error in this PR. Happy to revisit with the DF53 optimizer rule once that's available — geometry/geography is the obvious payoff.

paleolimbot · 2026-05-03T01:37:43Z

+/// Qualified columns (e.g. `t.x`) are not exposed yet; the Python
+/// `col()` helper takes only a single name. When we add joins and
+/// multi-table references we can grow this to accept an optional
+/// table qualifier, matching the R side's `column(name, qualifier)`.
+#[pyfunction]
+pub fn expr_col(name: &str) -> PyExpr {


If it's easy you can probably just add a qualifier argument here while you're at it

Done in 0d1a4ab — expr_col(name, qualifier=None) on the Rust side and col(name, qualifier=None) on the Python side. Test in test_col_with_qualifier.

paleolimbot · 2026-05-03T01:42:10Z

+    e = col("x")
+    assert isinstance(e, Expr)
+    assert e._impl.variant_name() == "Column"
+    assert "x" in repr(e)


Testing the exact repr output is probably easy to do and is a slightly better test

Reverted in 0d1a4ab — exact repr substrings are back (x AS y, CAST(x AS Int32), etc.), with variant_name() checks kept alongside as structural anchors. Added a module-level comment explaining the policy so the next person doesn't re-loosen them.

paleolimbot · 2026-05-03T01:56:20Z

+def lit(value: Any) -> Expr:
+    """Wrap a Python value as a literal expression.
+
+    Accepts the same value types as `sedonadb.expr.literal.lit`, including
+    Python scalars, pyarrow arrays/scalars, and Shapely geometries. Returns an
+    `Expr` suitable for composition with column expressions.
+    """


FWIW there is already a lit() function that returns Literal. Literal is intentionally lazy so (the python wrappers around) different functions can interpret then differently if they want to (e.g., RS_Intersects() doesn't actually need to convert a whole rasterio object to an Arrow scalar, which is expensive, to compute a correct result). Purely theoretical, not currently used, and can change, but that's why it's like that 🙂 .

Probably you can just fold this logic into _to_expr() and leave the existing lit() (which powers parameterized queries).

Got it, that makes sense — thanks for the context. Folded into 0d1a4ab: dropped the public lit() -> Expr and moved the Python-value-to-Expr coercion into the private _to_expr() helper. The existing sedonadb.expr.literal.lit returning Literal is untouched and stays re-exported from the package.

paleolimbot · 2026-05-03T02:03:44Z

+    def __init__(self, impl):
+        # impl is the underlying _lib.InternalExpr handle. Users normally
+        # do not construct Expr directly; use col() / lit() instead.
+        self._impl = impl


An isinstance() check here would be good so that this errors if used incorrectly

Added in 0d1a4ab — Expr.__init__ now isinstance-checks its argument against _lib.InternalExpr and raises TypeError with a message pointing at col(). Covered by test_expr_init_rejects_wrong_type.

paleolimbot · 2026-05-03T02:07:04Z

+    def alias(self, name: str) -> "Expr":
+        """Return a copy of the expression with a new output name."""
+        return Expr(self._impl.alias(name))


I forget if we have CI checks for this, but we parameter docs and examples for most functions in the Python APIs

Beefed up in 0d1a4ab. Public Expr methods and col() now have Args: and Examples: docstring blocks matching the style of dataframe.py:head/limit/count.

A. expr_col now accepts an optional qualifier (e.g. col("x", "t") for t.x), mirroring SedonaDBExprFactory::column in the R bindings. B. Tests pin the exact rendered Display form again (e.g. "x AS y", "CAST(x AS Int32)") with a module-level comment noting the policy so future contributors don't re-loosen them. variant_name() checks stay alongside as structural anchors. C. Drop the public lit() that returned an Expr; Python-side coercion is now folded into the private _to_expr() helper. The pre-existing lit() in sedonadb.expr.literal (returning Literal) is intentionally lazy and powers parameterized queries; leave it untouched and keep it re-exported. D. Expr.__init__ now isinstance-checks its argument and raises TypeError on misuse, with a message pointing at col(). E. Public Expr methods and col() grow Args / Examples docstring blocks matching the style of the existing dataframe.py methods.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

jiayuasu · 2026-05-06T06:16:30Z

+from sedonadb.expr.expression import Expr, col
+from sedonadb.expr.literal import Literal, lit
+
+__all__ = ["Expr", "Literal", "col", "lit"]


Done in f633481 — lit is removed from sedonadb.expr.__init__.py's re-exports. It's a Literal constructor (the lazy parameterized-query helper), not an Expr factory, so surfacing it on the same package as col() was misleading. Users continue to from sedonadb.expr.literal import lit, matching the pre-PR layout. PR description updated to drop the obsolete bullet.

jiayuasu · 2026-05-06T06:16:32Z

+    Construct an `Expr` with `col(name)`. Plain Python values composed with an
+    `Expr` via operators are coerced to literal expressions automatically.


Good catch — fixed in f633481. The Expr docstring now describes only what this PR ships (coercion happens inside methods like isin, not via operators) and calls out that operator overloading arrives in a follow-up.

- expr/__init__.py: drop `lit` from `__all__` / re-exports. It returns a `Literal` (the parameterized-query helper), not an `Expr`, so surfacing it on the same package as `col()` was confusing. Users who need it continue to `from sedonadb.expr.literal import lit`, matching the pre-PR layout. - expression.py: rewrite the `Expr` class docstring to describe what this PR actually ships (coercion happens inside methods like `isin`, not via operators) and call out that operator overloading arrives in a follow-up.

paleolimbot · 2026-05-07T03:01:41Z

+    assert "IN" in rep
+    assert "Int64(1)" in rep
+    assert "Int64(3)" in rep


A few more of these that should just be checking the repr. (LLMs love to write tests that will almost definitely pass, particularly if they are testing their own implementation 🙂 )