python/sedonadb: Add Pandas/GeoPandas-style DataFrame API

> **Status:** Draft — trimmed to the relational pandas surface; GeoPandas facade deferred pending review feedback below.

## 0. Scope

Review feedback challenged the GeoPandas compatibility layer: most of SedonaDB's perf comes from rethinking workflows in the relational model, and the locked non-goals (no `Index`, no `.apply`, no `.iloc`) make a pure import-change migration impossible anyway. This issue keeps the **relational pandas surface** and defers the **GeoPandas-specific facade** (`.geometry`, `.crs`, `.to_crs`, `.sjoin`, geometry methods on `Series`) to a separate, demand-gated effort.

Cookbook entries that show how to rewrite GeoPandas workflows into the relational model run **alongside** the API build — each entry doubles as a mini design doc for the methods it exercises.

The in-flight prior art is the R bindings: PR #468 added `r/sedonadb/R/expression.R` (454 LoC) + `r/sedonadb/src/rust/src/expression.rs` (8.5 KB). The Python expression layer mirrors that pattern.

## 1. Motivation

SedonaDB's Python package today is SQL-driven: users get a lazy `DataFrame` from `sd.sql("SELECT ...")`, `sd.read_parquet(...)`, or `sd.create_data_frame(...)`, and the only way to transform it is to write another SQL string. The target audience — Python data scientists who use pandas — expects a different surface entirely: a `DataFrame` they can index with `df["col"]`, a `Series` with `.mean()` / `.str.lower()` / `.fillna()`, a `groupby().agg(...)` chain, and `df.merge(other, on=...)`.

**Goal:** make the relational surface of the SedonaDB Python interface as close to pandas as is feasible on top of a SQL-backed lazy engine. Where pandas idioms force a real semantic conflict with the engine, follow pandas at the surface and document the deviation. Where they would gut the engine, refuse, and offer a clear alternative.

Explicit non-goals (locked):

- No row labels / `Index` / `.loc` / `.iloc` / `.reset_index` / `.set_index`.
- No `.apply(func, axis=1)` — would force per-row Python over a SQL plan.
- **No GeoPandas facade in this scope.** Tracked separately; gated on demand signal.

## 2. Current state

Relevant files:

- `python/sedonadb/python/sedonadb/context.py` — `SedonaContext`.
- `python/sedonadb/python/sedonadb/dataframe.py` — `DataFrame` (lazy SQL-driven; `limit`, `head`, `count`, `show`, `to_arrow_table`, `to_pandas`, `to_parquet`, `to_view`, `with_params`, `schema`, `columns`, `explain`).
- `python/sedonadb/src/dataframe.rs` — `InternalDataFrame` PyO3 wrapper around `datafusion::prelude::DataFrame`.
- `python/sedonadb/python/sedonadb/expr/literal.py` — `Literal` / `lit()` (only existing expression type).
- `r/sedonadb/R/expression.R`, `r/sedonadb/src/rust/src/expression.rs` — prior art for the expression-translation pattern.

Working in our favor:

- The Rust side already holds a full DataFusion `DataFrame` with `select`, `filter`, `with_column`, `join`, `sort`, `aggregate`, etc. We are exposing existing capability.
- Schema already tracks geometry columns and CRS (carry-through, not feature work).
- `to_pandas()` already returns a usable frame.

## 3. Design decisions

| # | Decision |
|---|---|
| 1 | **Pandas-first surface.** Method names, signatures, and behavior follow pandas wherever feasible (`groupby`, `merge`, `rename(columns=)`, `drop_duplicates`, `assign`, `query`, `tail`, `describe`, `info`, `shape`). |
| 2 | **`Series` is a first-class type.** `df["x"]` returns `Series`. `Expr` becomes an internal detail most users never type. |
| 3 | **Same `DataFrame` class for tabular and geospatial.** Geometry awareness via dtype + schema metadata only — no `GeoDataFrame` subclass, no `GeoSeries` type. |
| 4 | **Everything is lazy, including reductions.** `s.mean()` returns a deferred `Scalar` (a 0-D Expr). Auto-materializes only when coerced to a concrete value (`float()`, `print()`, `if`, comparison against a non-Expr). When composed back into another `Series` / `DataFrame` expression, stays lazy and folds into the surrounding plan. |
| 5 | **DataFrame transforms stay lazy.** `df.query(...).assign(...).groupby(...).agg(...)` builds a single plan. Materialization only at Python-coercion of a `Scalar`, `to_pandas()`, `show()`, `to_parquet()`, etc. |
| 6 | **Copy the R expression layer.** Mirror `r/sedonadb/src/rust/src/expression.rs` into `python/sedonadb/src/expr.rs`. Don't pre-factor a shared crate; re-evaluate only if R/Python drift becomes painful. |
| 7 | **SQL NULL semantics, not pandas NaN.** `.isna()` matches both NULL and NaN; arithmetic uses SQL three-valued logic. |
| 8 | **GeoPandas facade deferred.** Cookbook docs serve the migration story for now. Re-evaluate after we see who adopts the relational pandas surface. |
| 9 | **Pandas-shaped surface ships as a submodule of the existing `sedonadb` package**, not a separate PyPI distribution. Pandas-flavored methods live directly on `sedonadb.DataFrame` / `sedonadb.Series`; `sedonadb.pandas` is a discoverability namespace that re-exports the same types and toplevel constructors so `import sedonadb.pandas as pd; pd.read_csv(...)` works for users coming from pandas. Same release cadence. Code lives under `python/sedonadb/python/sedonadb/pandas/`. |
| 10 | **Cookbook track runs alongside the API build.** Each cookbook entry is a real GeoPandas workflow rewritten in the relational model and doubles as a mini design doc for the methods that workflow exercises. Cookbook entries drive method-prioritization within P3/P4. Lives under `docs/cookbook/`. |

## 4. The user surface (this scope)

### 4.1 Top-level

```python
import sedonadb as sd

df = sd.DataFrame({"x": [1, 2, 3]})  # alias for create_data_frame
df = sd.read_parquet("...")
df = sd.read_csv("...")
df = sd.read_json("...")
df = sd.sql("SELECT ...")  # SQL escape hatch — still there
```

### 4.2 `DataFrame` — pandas-spelled methods

```python
# Inspection
df.shape; df.columns; df.dtypes; df.empty; df.info(); df.describe()

# Selection / projection
df["x"]                        # -> Series
df[["x", "y"]]                 # -> DataFrame
df[df["x"] > 0]                # -> DataFrame (boolean mask)
df.query("x > 0 and y < 10")   # -> DataFrame (string predicate)

# Mutation-by-rebind (immutable under the hood)
df["new"] = df["x"] + 1
df = df.assign(new=df["x"] + 1, doubled=df["x"] * 2)
df = df.drop(columns=["a", "b"])
df = df.rename(columns={"a": "x"})

# Sorting / dedup
df = df.sort_values(by="x", ascending=False)
df = df.drop_duplicates()
df = df.head(10); df = df.tail(10)

# Combine
df = df.merge(other, on="key", how="inner")  # alias: join
df = sd.concat([df1, df2])

# Group-aggregate
df.groupby("k").agg({"x": "sum", "y": "mean"})
df.groupby(["k1", "k2"]).sum()
df.groupby("k").size()

# Materialize
df.to_pandas(); df.to_arrow_table(); df.to_parquet(...); df.to_csv(...); df.to_json(...)
```

### 4.3 `Series` — pandas-spelled methods

```python
s = df["x"]

# Reductions — return DEFERRED scalars (auto-materialize on Python coercion)
m = s.mean()                  # Scalar (lazy)
print(m); float(m)            # triggers materialization
df.assign(z=df["x"] - m)      # stays lazy, folds m into the plan, one scan
s.sum(); s.min(); s.max(); s.std(); s.var(); s.median()
s.count(); s.nunique()

# Element-wise (lazy — return Series)
s.isna(); s.notna(); s.fillna(0); s.astype("float64")
s.isin([1, 2, 3]); s.between(0, 10); s.clip(lower=0, upper=100)

# Frequency
s.unique(); s.value_counts()

# Boolean mask via operators
mask = (s > 0) & s.notna()

# Accessors
s.str.lower(); s.str.contains("foo"); s.str.replace("a", "b")
s.dt.year; s.dt.month; s.dt.floor("h")
```

### 4.4 What's deliberately not pandas-compatible

These raise `NotImplementedError` with a one-line pointer to the migration cookbook:

- `df.loc[...]`, `df.iloc[...]`, `df.at[...]`, `df.iat[...]`
- `df.index`, `df.reset_index()`, `df.set_index()`
- `df.apply(func, axis=1)` — point users to UDFs / `map_batches`
- `df.iterrows()`, `df.itertuples()` — point to `to_arrow_table()` / `to_pandas()`
- In-place mutation (`inplace=True` ignored with a warning)

### 4.5 The `Expr` escape hatch

`Expr` exists for advanced users who want explicit DataFusion expressions, registered as `sd.expr.col("x")` and `sd.expr.lit(v)`. Documentation foregrounds `Series` / `DataFrame`; `Expr` shows up only in the advanced section.

## 5. Implementation notes

### 5.1 Type layering

```
sd.Series    ─ wraps an internal Expr bound to a DataFrame
sd.Scalar    ─ deferred 0-D Expr; auto-materializes on Python coercion
sd.DataFrame ─ wraps a DataFusion LogicalPlan
sd.expr.Expr ─ thin Python view on datafusion_expr::Expr (escape hatch)
```

`Series` is conceptually `(DataFrame, Expr)`. Operations across two Series require they share a parent — clear error otherwise (no implicit join).

### 5.2 Deferred scalars

A reduction returns a `Scalar` — a 0-D Expr bound to its parent DataFrame. Two paths:

1. **Composed back into Expr / Series / DataFrame.** Stays lazy and folds into the surrounding plan. `df.assign(centered=df["x"] - df["x"].mean())` compiles to one scan.
2. **Coerced to a Python value** (`__float__`, `__int__`, `__bool__`, `__repr__`, comparison vs. non-Expr) — triggers materialization via a one-row collect, with per-`Scalar` memoization so repeated coercions don't re-execute.

Boundary rule: SedonaDB-types-only operands stay lazy; coercion at any Python-protocol boundary materializes.

### 5.3 Rust side (PyO3)

Add to `python/sedonadb/src/`:

- `expr.rs` — `InternalExpr` holding `datafusion_expr::Expr`. Operator wrappers, `alias`, `cast`, `is_null`, etc. Pattern lifted from `r/sedonadb/src/rust/src/expression.rs`.
- `series.rs` — `InternalSeries` (DataFrame handle + Expr handle); reductions drive single-column aggregations through `InternalDataFrame`.
- Extend `InternalDataFrame` with: `select`, `filter`, `with_column`, `drop`, `rename`, `sort`, `distinct`, `union`, `join`, `aggregate`.

Each new method is a thin wrapper over the corresponding DataFusion `DataFrame` method — no new query-engine code.

### 5.4 `__setitem__` semantics

`df["x"] = expr` is sugar for `df = df.assign(x=expr)`. All DataFrames are immutable; `__setitem__` rebinds the local name. `inplace=True` is ignored with a warning.

### 5.5 NULL vs NaN

SQL NULL throughout. `Series.isna()` matches both NULL and NaN. Arithmetic uses SQL three-valued logic. Documented in the cookbook's "porting from pandas" section.

## 6. Phases

| Phase | Content | Mirrors R prior art |
|---|---|---|
| **P1 — Expression layer** | `InternalExpr` (PyO3) wrapping `datafusion_expr::Expr`. `col`, `lit`, operator overloads, `alias`, `cast`, `is_null`, `isin`. Wire into existing `InternalDataFrame`. | `r/sedonadb/src/rust/src/expression.rs` + `R/expression.R` |
| **P2 — Core DataFrame ops** | `select` via `__getitem__`, `filter` via boolean mask, `query(str)`, `assign`, `drop(columns=)`, `rename(columns=)`, `sort_values`, `head`/`tail`, `__setitem__`. | `r/sedonadb/R/dataframe.R` |
| **P3 — `Series` + `Scalar`** | `Series` as `(DataFrame, Expr)`. Element-wise (`isna`, `fillna`, `astype`, `isin`, `between`, `clip`). Lazy reductions returning `Scalar` with auto-materializing dunders. `unique`, `value_counts`. `str.*`, `dt.*` accessors. | new — no R analog |
| **P4 — Joins / groupby / combine** | `merge`/`join`, `groupby().agg()`, `drop_duplicates`, `concat`, `describe`, `info`, `shape`, `empty`. | partial — see PR #781 |
| **P5 — Toplevel + polish** | `sd.DataFrame`, `sd.read_csv`, `sd.read_json`, `sd.concat`. `NotImplementedError` stubs for `.loc`/`.iloc`/`.apply`/`.iterrows`. Cookbook entries written in parallel with P3–P4. | — |

GeoPandas facade becomes its own future ticket, gated on demand signal after P5 ships.

## 7. Risks

- **Deferred-scalar surprise.** Users may not realize `s.mean()` is lazy. Mitigate via a `repr` that prints the materialized value (REPL `>>> s.mean()` does the right thing) and a clear note in `Scalar.__doc__`.
- **Coercion-boundary edge cases.** `Scalar` × numpy scalar / `Decimal` / Expr from a different parent — codify in tests up front.
- **Repeated coercion of the same `Scalar`** must not re-execute — memoize on first coercion.
- **NULL vs NaN edge cases** when porting pandas code that relies on NaN propagation. Possible later: `compat="pandas"` mode at the Arrow boundary.
- **`__eq__` overload on `Series`** collides with set / dict membership. Standard DataFrame-library trade-off; document.
- **Breadth vs depth.** Pandas surface is huge. Ship the listed methods first; non-listed methods raise `NotImplementedError` with a clear pointer rather than partial implementations that silently differ from pandas.

## 8. Non-goals (in this scope)

- Row labels, `Index`, `RangeIndex`, `MultiIndex`.
- `.loc`, `.iloc`, `.at`, `.iat`.
- `.apply(func, axis=1)`, `.applymap`, `.itertuples`, `.iterrows`.
- `.plot()` — users go through `to_pandas()`.
- True in-place mutation (`inplace=True` is ignored with a warning).
- **GeoPandas facade** — `.geometry`, `.crs`, `.to_crs`, `.sjoin`, geometry methods on `Series`. Tracked separately; gated on demand after P5.
- `GeoSeries`, `GeoDataFrame` types.


#	Decision
1	Pandas-first surface. Method names, signatures, and behavior follow pandas wherever feasible (`groupby`, `merge`, `rename(columns=)`, `drop_duplicates`, `assign`, `query`, `tail`, `describe`, `info`, `shape`).
2	`Series` is a first-class type. `df["x"]` returns `Series`. `Expr` becomes an internal detail most users never type.
3	Same `DataFrame` class for tabular and geospatial. Geometry awareness via dtype + schema metadata only — no `GeoDataFrame` subclass, no `GeoSeries` type.
4	Everything is lazy, including reductions. `s.mean()` returns a deferred `Scalar` (a 0-D Expr). Auto-materializes only when coerced to a concrete value (`float()`, `print()`, `if`, comparison against a non-Expr). When composed back into another `Series` / `DataFrame` expression, stays lazy and folds into the surrounding plan.
5	DataFrame transforms stay lazy. `df.query(...).assign(...).groupby(...).agg(...)` builds a single plan. Materialization only at Python-coercion of a `Scalar`, `to_pandas()`, `show()`, `to_parquet()`, etc.
6	Copy the R expression layer. Mirror `r/sedonadb/src/rust/src/expression.rs` into `python/sedonadb/src/expr.rs`. Don't pre-factor a shared crate; re-evaluate only if R/Python drift becomes painful.
7	SQL NULL semantics, not pandas NaN. `.isna()` matches both NULL and NaN; arithmetic uses SQL three-valued logic.
8	GeoPandas facade deferred. Cookbook docs serve the migration story for now. Re-evaluate after we see who adopts the relational pandas surface.
9	Pandas-shaped surface ships as a submodule of the existing `sedonadb` package, not a separate PyPI distribution. Pandas-flavored methods live directly on `sedonadb.DataFrame` / `sedonadb.Series`; `sedonadb.pandas` is a discoverability namespace that re-exports the same types and toplevel constructors so `import sedonadb.pandas as pd; pd.read_csv(...)` works for users coming from pandas. Same release cadence. Code lives under `python/sedonadb/python/sedonadb/pandas/`.
10	Cookbook track runs alongside the API build. Each cookbook entry is a real GeoPandas workflow rewritten in the relational model and doubles as a mini design doc for the methods that workflow exercises. Cookbook entries drive method-prioritization within P3/P4. Lives under `docs/cookbook/`.

Phase	Content	Mirrors R prior art
P1 — Expression layer	`InternalExpr` (PyO3) wrapping `datafusion_expr::Expr`. `col`, `lit`, operator overloads, `alias`, `cast`, `is_null`, `isin`. Wire into existing `InternalDataFrame`.	`r/sedonadb/src/rust/src/expression.rs` + `R/expression.R`
P2 — Core DataFrame ops	`select` via `__getitem__`, `filter` via boolean mask, `query(str)`, `assign`, `drop(columns=)`, `rename(columns=)`, `sort_values`, `head`/`tail`, `__setitem__`.	`r/sedonadb/R/dataframe.R`
P3 — `Series` + `Scalar`	`Series` as `(DataFrame, Expr)`. Element-wise (`isna`, `fillna`, `astype`, `isin`, `between`, `clip`). Lazy reductions returning `Scalar` with auto-materializing dunders. `unique`, `value_counts`. `str.`, `dt.` accessors.	new — no R analog
P4 — Joins / groupby / combine	`merge`/`join`, `groupby().agg()`, `drop_duplicates`, `concat`, `describe`, `info`, `shape`, `empty`.	partial — see PR #781
P5 — Toplevel + polish	`sd.DataFrame`, `sd.read_csv`, `sd.read_json`, `sd.concat`. `NotImplementedError` stubs for `.loc`/`.iloc`/`.apply`/`.iterrows`. Cookbook entries written in parallel with P3–P4.	—

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python/sedonadb: Add Pandas/GeoPandas-style DataFrame API #791

0. Scope

1. Motivation

2. Current state

3. Design decisions

4. The user surface (this scope)

4.1 Top-level

4.2 `DataFrame` — pandas-spelled methods

4.3 `Series` — pandas-spelled methods

4.4 What's deliberately not pandas-compatible

4.5 The `Expr` escape hatch

5. Implementation notes

5.1 Type layering

5.2 Deferred scalars

5.3 Rust side (PyO3)

5.4 `setitem` semantics

5.5 NULL vs NaN

6. Phases

7. Risks

8. Non-goals (in this scope)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

python/sedonadb: Add Pandas/GeoPandas-style DataFrame API #791

Description

0. Scope

1. Motivation

2. Current state

3. Design decisions

4. The user surface (this scope)

4.1 Top-level

4.2 DataFrame — pandas-spelled methods

4.3 Series — pandas-spelled methods

4.4 What's deliberately not pandas-compatible

4.5 The Expr escape hatch

5. Implementation notes

5.1 Type layering

5.2 Deferred scalars

5.3 Rust side (PyO3)

5.4 __setitem__ semantics

5.5 NULL vs NaN

6. Phases

7. Risks

8. Non-goals (in this scope)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

4.2 `DataFrame` — pandas-spelled methods

4.3 `Series` — pandas-spelled methods

4.5 The `Expr` escape hatch

5.4 `setitem` semantics