Skip to content

python/sedonadb: Add Pandas/GeoPandas-style DataFrame API #791

@jiayuasu

Description

@jiayuasu

Status: Draft — trimmed to the relational pandas surface; GeoPandas facade deferred pending review feedback below.

0. Scope

Review feedback challenged the GeoPandas compatibility layer: most of SedonaDB's perf comes from rethinking workflows in the relational model, and the locked non-goals (no Index, no .apply, no .iloc) make a pure import-change migration impossible anyway. This issue keeps the relational pandas surface and defers the GeoPandas-specific facade (.geometry, .crs, .to_crs, .sjoin, geometry methods on Series) to a separate, demand-gated effort.

Cookbook entries that show how to rewrite GeoPandas workflows into the relational model run alongside the API build — each entry doubles as a mini design doc for the methods it exercises.

The in-flight prior art is the R bindings: PR #468 added r/sedonadb/R/expression.R (454 LoC) + r/sedonadb/src/rust/src/expression.rs (8.5 KB). The Python expression layer mirrors that pattern.

1. Motivation

SedonaDB's Python package today is SQL-driven: users get a lazy DataFrame from sd.sql("SELECT ..."), sd.read_parquet(...), or sd.create_data_frame(...), and the only way to transform it is to write another SQL string. The target audience — Python data scientists who use pandas — expects a different surface entirely: a DataFrame they can index with df["col"], a Series with .mean() / .str.lower() / .fillna(), a groupby().agg(...) chain, and df.merge(other, on=...).

Goal: make the relational surface of the SedonaDB Python interface as close to pandas as is feasible on top of a SQL-backed lazy engine. Where pandas idioms force a real semantic conflict with the engine, follow pandas at the surface and document the deviation. Where they would gut the engine, refuse, and offer a clear alternative.

Explicit non-goals (locked):

  • No row labels / Index / .loc / .iloc / .reset_index / .set_index.
  • No .apply(func, axis=1) — would force per-row Python over a SQL plan.
  • No GeoPandas facade in this scope. Tracked separately; gated on demand signal.

2. Current state

Relevant files:

  • python/sedonadb/python/sedonadb/context.pySedonaContext.
  • python/sedonadb/python/sedonadb/dataframe.pyDataFrame (lazy SQL-driven; limit, head, count, show, to_arrow_table, to_pandas, to_parquet, to_view, with_params, schema, columns, explain).
  • python/sedonadb/src/dataframe.rsInternalDataFrame PyO3 wrapper around datafusion::prelude::DataFrame.
  • python/sedonadb/python/sedonadb/expr/literal.pyLiteral / lit() (only existing expression type).
  • r/sedonadb/R/expression.R, r/sedonadb/src/rust/src/expression.rs — prior art for the expression-translation pattern.

Working in our favor:

  • The Rust side already holds a full DataFusion DataFrame with select, filter, with_column, join, sort, aggregate, etc. We are exposing existing capability.
  • Schema already tracks geometry columns and CRS (carry-through, not feature work).
  • to_pandas() already returns a usable frame.

3. Design decisions

# Decision
1 Pandas-first surface. Method names, signatures, and behavior follow pandas wherever feasible (groupby, merge, rename(columns=), drop_duplicates, assign, query, tail, describe, info, shape).
2 Series is a first-class type. df["x"] returns Series. Expr becomes an internal detail most users never type.
3 Same DataFrame class for tabular and geospatial. Geometry awareness via dtype + schema metadata only — no GeoDataFrame subclass, no GeoSeries type.
4 Everything is lazy, including reductions. s.mean() returns a deferred Scalar (a 0-D Expr). Auto-materializes only when coerced to a concrete value (float(), print(), if, comparison against a non-Expr). When composed back into another Series / DataFrame expression, stays lazy and folds into the surrounding plan.
5 DataFrame transforms stay lazy. df.query(...).assign(...).groupby(...).agg(...) builds a single plan. Materialization only at Python-coercion of a Scalar, to_pandas(), show(), to_parquet(), etc.
6 Copy the R expression layer. Mirror r/sedonadb/src/rust/src/expression.rs into python/sedonadb/src/expr.rs. Don't pre-factor a shared crate; re-evaluate only if R/Python drift becomes painful.
7 SQL NULL semantics, not pandas NaN. .isna() matches both NULL and NaN; arithmetic uses SQL three-valued logic.
8 GeoPandas facade deferred. Cookbook docs serve the migration story for now. Re-evaluate after we see who adopts the relational pandas surface.
9 Pandas-shaped surface ships as a submodule of the existing sedonadb package, not a separate PyPI distribution. Pandas-flavored methods live directly on sedonadb.DataFrame / sedonadb.Series; sedonadb.pandas is a discoverability namespace that re-exports the same types and toplevel constructors so import sedonadb.pandas as pd; pd.read_csv(...) works for users coming from pandas. Same release cadence. Code lives under python/sedonadb/python/sedonadb/pandas/.
10 Cookbook track runs alongside the API build. Each cookbook entry is a real GeoPandas workflow rewritten in the relational model and doubles as a mini design doc for the methods that workflow exercises. Cookbook entries drive method-prioritization within P3/P4. Lives under docs/cookbook/.

4. The user surface (this scope)

4.1 Top-level

import sedonadb as sd

df = sd.DataFrame({"x": [1, 2, 3]})  # alias for create_data_frame
df = sd.read_parquet("...")
df = sd.read_csv("...")
df = sd.read_json("...")
df = sd.sql("SELECT ...")  # SQL escape hatch — still there

4.2 DataFrame — pandas-spelled methods

# Inspection
df.shape; df.columns; df.dtypes; df.empty; df.info(); df.describe()

# Selection / projection
df["x"]                        # -> Series
df[["x", "y"]]                 # -> DataFrame
df[df["x"] > 0]                # -> DataFrame (boolean mask)
df.query("x > 0 and y < 10")   # -> DataFrame (string predicate)

# Mutation-by-rebind (immutable under the hood)
df["new"] = df["x"] + 1
df = df.assign(new=df["x"] + 1, doubled=df["x"] * 2)
df = df.drop(columns=["a", "b"])
df = df.rename(columns={"a": "x"})

# Sorting / dedup
df = df.sort_values(by="x", ascending=False)
df = df.drop_duplicates()
df = df.head(10); df = df.tail(10)

# Combine
df = df.merge(other, on="key", how="inner")  # alias: join
df = sd.concat([df1, df2])

# Group-aggregate
df.groupby("k").agg({"x": "sum", "y": "mean"})
df.groupby(["k1", "k2"]).sum()
df.groupby("k").size()

# Materialize
df.to_pandas(); df.to_arrow_table(); df.to_parquet(...); df.to_csv(...); df.to_json(...)

4.3 Series — pandas-spelled methods

s = df["x"]

# Reductions — return DEFERRED scalars (auto-materialize on Python coercion)
m = s.mean()                  # Scalar (lazy)
print(m); float(m)            # triggers materialization
df.assign(z=df["x"] - m)      # stays lazy, folds m into the plan, one scan
s.sum(); s.min(); s.max(); s.std(); s.var(); s.median()
s.count(); s.nunique()

# Element-wise (lazy — return Series)
s.isna(); s.notna(); s.fillna(0); s.astype("float64")
s.isin([1, 2, 3]); s.between(0, 10); s.clip(lower=0, upper=100)

# Frequency
s.unique(); s.value_counts()

# Boolean mask via operators
mask = (s > 0) & s.notna()

# Accessors
s.str.lower(); s.str.contains("foo"); s.str.replace("a", "b")
s.dt.year; s.dt.month; s.dt.floor("h")

4.4 What's deliberately not pandas-compatible

These raise NotImplementedError with a one-line pointer to the migration cookbook:

  • df.loc[...], df.iloc[...], df.at[...], df.iat[...]
  • df.index, df.reset_index(), df.set_index()
  • df.apply(func, axis=1) — point users to UDFs / map_batches
  • df.iterrows(), df.itertuples() — point to to_arrow_table() / to_pandas()
  • In-place mutation (inplace=True ignored with a warning)

4.5 The Expr escape hatch

Expr exists for advanced users who want explicit DataFusion expressions, registered as sd.expr.col("x") and sd.expr.lit(v). Documentation foregrounds Series / DataFrame; Expr shows up only in the advanced section.

5. Implementation notes

5.1 Type layering

sd.Series    ─ wraps an internal Expr bound to a DataFrame
sd.Scalar    ─ deferred 0-D Expr; auto-materializes on Python coercion
sd.DataFrame ─ wraps a DataFusion LogicalPlan
sd.expr.Expr ─ thin Python view on datafusion_expr::Expr (escape hatch)

Series is conceptually (DataFrame, Expr). Operations across two Series require they share a parent — clear error otherwise (no implicit join).

5.2 Deferred scalars

A reduction returns a Scalar — a 0-D Expr bound to its parent DataFrame. Two paths:

  1. Composed back into Expr / Series / DataFrame. Stays lazy and folds into the surrounding plan. df.assign(centered=df["x"] - df["x"].mean()) compiles to one scan.
  2. Coerced to a Python value (__float__, __int__, __bool__, __repr__, comparison vs. non-Expr) — triggers materialization via a one-row collect, with per-Scalar memoization so repeated coercions don't re-execute.

Boundary rule: SedonaDB-types-only operands stay lazy; coercion at any Python-protocol boundary materializes.

5.3 Rust side (PyO3)

Add to python/sedonadb/src/:

  • expr.rsInternalExpr holding datafusion_expr::Expr. Operator wrappers, alias, cast, is_null, etc. Pattern lifted from r/sedonadb/src/rust/src/expression.rs.
  • series.rsInternalSeries (DataFrame handle + Expr handle); reductions drive single-column aggregations through InternalDataFrame.
  • Extend InternalDataFrame with: select, filter, with_column, drop, rename, sort, distinct, union, join, aggregate.

Each new method is a thin wrapper over the corresponding DataFusion DataFrame method — no new query-engine code.

5.4 __setitem__ semantics

df["x"] = expr is sugar for df = df.assign(x=expr). All DataFrames are immutable; __setitem__ rebinds the local name. inplace=True is ignored with a warning.

5.5 NULL vs NaN

SQL NULL throughout. Series.isna() matches both NULL and NaN. Arithmetic uses SQL three-valued logic. Documented in the cookbook's "porting from pandas" section.

6. Phases

Phase Content Mirrors R prior art
P1 — Expression layer InternalExpr (PyO3) wrapping datafusion_expr::Expr. col, lit, operator overloads, alias, cast, is_null, isin. Wire into existing InternalDataFrame. r/sedonadb/src/rust/src/expression.rs + R/expression.R
P2 — Core DataFrame ops select via __getitem__, filter via boolean mask, query(str), assign, drop(columns=), rename(columns=), sort_values, head/tail, __setitem__. r/sedonadb/R/dataframe.R
P3 — Series + Scalar Series as (DataFrame, Expr). Element-wise (isna, fillna, astype, isin, between, clip). Lazy reductions returning Scalar with auto-materializing dunders. unique, value_counts. str.*, dt.* accessors. new — no R analog
P4 — Joins / groupby / combine merge/join, groupby().agg(), drop_duplicates, concat, describe, info, shape, empty. partial — see PR #781
P5 — Toplevel + polish sd.DataFrame, sd.read_csv, sd.read_json, sd.concat. NotImplementedError stubs for .loc/.iloc/.apply/.iterrows. Cookbook entries written in parallel with P3–P4.

GeoPandas facade becomes its own future ticket, gated on demand signal after P5 ships.

7. Risks

  • Deferred-scalar surprise. Users may not realize s.mean() is lazy. Mitigate via a repr that prints the materialized value (REPL >>> s.mean() does the right thing) and a clear note in Scalar.__doc__.
  • Coercion-boundary edge cases. Scalar × numpy scalar / Decimal / Expr from a different parent — codify in tests up front.
  • Repeated coercion of the same Scalar must not re-execute — memoize on first coercion.
  • NULL vs NaN edge cases when porting pandas code that relies on NaN propagation. Possible later: compat="pandas" mode at the Arrow boundary.
  • __eq__ overload on Series collides with set / dict membership. Standard DataFrame-library trade-off; document.
  • Breadth vs depth. Pandas surface is huge. Ship the listed methods first; non-listed methods raise NotImplementedError with a clear pointer rather than partial implementations that silently differ from pandas.

8. Non-goals (in this scope)

  • Row labels, Index, RangeIndex, MultiIndex.
  • .loc, .iloc, .at, .iat.
  • .apply(func, axis=1), .applymap, .itertuples, .iterrows.
  • .plot() — users go through to_pandas().
  • True in-place mutation (inplace=True is ignored with a warning).
  • GeoPandas facade.geometry, .crs, .to_crs, .sjoin, geometry methods on Series. Tracked separately; gated on demand after P5.
  • GeoSeries, GeoDataFrame types.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions