Status: Draft — trimmed to the relational pandas surface; GeoPandas facade deferred pending review feedback below.
0. Scope
Review feedback challenged the GeoPandas compatibility layer: most of SedonaDB's perf comes from rethinking workflows in the relational model, and the locked non-goals (no Index, no .apply, no .iloc) make a pure import-change migration impossible anyway. This issue keeps the relational pandas surface and defers the GeoPandas-specific facade (.geometry, .crs, .to_crs, .sjoin, geometry methods on Series) to a separate, demand-gated effort.
Cookbook entries that show how to rewrite GeoPandas workflows into the relational model run alongside the API build — each entry doubles as a mini design doc for the methods it exercises.
The in-flight prior art is the R bindings: PR #468 added r/sedonadb/R/expression.R (454 LoC) + r/sedonadb/src/rust/src/expression.rs (8.5 KB). The Python expression layer mirrors that pattern.
1. Motivation
SedonaDB's Python package today is SQL-driven: users get a lazy DataFrame from sd.sql("SELECT ..."), sd.read_parquet(...), or sd.create_data_frame(...), and the only way to transform it is to write another SQL string. The target audience — Python data scientists who use pandas — expects a different surface entirely: a DataFrame they can index with df["col"], a Series with .mean() / .str.lower() / .fillna(), a groupby().agg(...) chain, and df.merge(other, on=...).
Goal: make the relational surface of the SedonaDB Python interface as close to pandas as is feasible on top of a SQL-backed lazy engine. Where pandas idioms force a real semantic conflict with the engine, follow pandas at the surface and document the deviation. Where they would gut the engine, refuse, and offer a clear alternative.
Explicit non-goals (locked):
- No row labels /
Index / .loc / .iloc / .reset_index / .set_index.
- No
.apply(func, axis=1) — would force per-row Python over a SQL plan.
- No GeoPandas facade in this scope. Tracked separately; gated on demand signal.
2. Current state
Relevant files:
python/sedonadb/python/sedonadb/context.py — SedonaContext.
python/sedonadb/python/sedonadb/dataframe.py — DataFrame (lazy SQL-driven; limit, head, count, show, to_arrow_table, to_pandas, to_parquet, to_view, with_params, schema, columns, explain).
python/sedonadb/src/dataframe.rs — InternalDataFrame PyO3 wrapper around datafusion::prelude::DataFrame.
python/sedonadb/python/sedonadb/expr/literal.py — Literal / lit() (only existing expression type).
r/sedonadb/R/expression.R, r/sedonadb/src/rust/src/expression.rs — prior art for the expression-translation pattern.
Working in our favor:
- The Rust side already holds a full DataFusion
DataFrame with select, filter, with_column, join, sort, aggregate, etc. We are exposing existing capability.
- Schema already tracks geometry columns and CRS (carry-through, not feature work).
to_pandas() already returns a usable frame.
3. Design decisions
| # |
Decision |
| 1 |
Pandas-first surface. Method names, signatures, and behavior follow pandas wherever feasible (groupby, merge, rename(columns=), drop_duplicates, assign, query, tail, describe, info, shape). |
| 2 |
Series is a first-class type. df["x"] returns Series. Expr becomes an internal detail most users never type. |
| 3 |
Same DataFrame class for tabular and geospatial. Geometry awareness via dtype + schema metadata only — no GeoDataFrame subclass, no GeoSeries type. |
| 4 |
Everything is lazy, including reductions. s.mean() returns a deferred Scalar (a 0-D Expr). Auto-materializes only when coerced to a concrete value (float(), print(), if, comparison against a non-Expr). When composed back into another Series / DataFrame expression, stays lazy and folds into the surrounding plan. |
| 5 |
DataFrame transforms stay lazy. df.query(...).assign(...).groupby(...).agg(...) builds a single plan. Materialization only at Python-coercion of a Scalar, to_pandas(), show(), to_parquet(), etc. |
| 6 |
Copy the R expression layer. Mirror r/sedonadb/src/rust/src/expression.rs into python/sedonadb/src/expr.rs. Don't pre-factor a shared crate; re-evaluate only if R/Python drift becomes painful. |
| 7 |
SQL NULL semantics, not pandas NaN. .isna() matches both NULL and NaN; arithmetic uses SQL three-valued logic. |
| 8 |
GeoPandas facade deferred. Cookbook docs serve the migration story for now. Re-evaluate after we see who adopts the relational pandas surface. |
| 9 |
Pandas-shaped surface ships as a submodule of the existing sedonadb package, not a separate PyPI distribution. Pandas-flavored methods live directly on sedonadb.DataFrame / sedonadb.Series; sedonadb.pandas is a discoverability namespace that re-exports the same types and toplevel constructors so import sedonadb.pandas as pd; pd.read_csv(...) works for users coming from pandas. Same release cadence. Code lives under python/sedonadb/python/sedonadb/pandas/. |
| 10 |
Cookbook track runs alongside the API build. Each cookbook entry is a real GeoPandas workflow rewritten in the relational model and doubles as a mini design doc for the methods that workflow exercises. Cookbook entries drive method-prioritization within P3/P4. Lives under docs/cookbook/. |
4. The user surface (this scope)
4.1 Top-level
import sedonadb as sd
df = sd.DataFrame({"x": [1, 2, 3]}) # alias for create_data_frame
df = sd.read_parquet("...")
df = sd.read_csv("...")
df = sd.read_json("...")
df = sd.sql("SELECT ...") # SQL escape hatch — still there
4.2 DataFrame — pandas-spelled methods
# Inspection
df.shape; df.columns; df.dtypes; df.empty; df.info(); df.describe()
# Selection / projection
df["x"] # -> Series
df[["x", "y"]] # -> DataFrame
df[df["x"] > 0] # -> DataFrame (boolean mask)
df.query("x > 0 and y < 10") # -> DataFrame (string predicate)
# Mutation-by-rebind (immutable under the hood)
df["new"] = df["x"] + 1
df = df.assign(new=df["x"] + 1, doubled=df["x"] * 2)
df = df.drop(columns=["a", "b"])
df = df.rename(columns={"a": "x"})
# Sorting / dedup
df = df.sort_values(by="x", ascending=False)
df = df.drop_duplicates()
df = df.head(10); df = df.tail(10)
# Combine
df = df.merge(other, on="key", how="inner") # alias: join
df = sd.concat([df1, df2])
# Group-aggregate
df.groupby("k").agg({"x": "sum", "y": "mean"})
df.groupby(["k1", "k2"]).sum()
df.groupby("k").size()
# Materialize
df.to_pandas(); df.to_arrow_table(); df.to_parquet(...); df.to_csv(...); df.to_json(...)
4.3 Series — pandas-spelled methods
s = df["x"]
# Reductions — return DEFERRED scalars (auto-materialize on Python coercion)
m = s.mean() # Scalar (lazy)
print(m); float(m) # triggers materialization
df.assign(z=df["x"] - m) # stays lazy, folds m into the plan, one scan
s.sum(); s.min(); s.max(); s.std(); s.var(); s.median()
s.count(); s.nunique()
# Element-wise (lazy — return Series)
s.isna(); s.notna(); s.fillna(0); s.astype("float64")
s.isin([1, 2, 3]); s.between(0, 10); s.clip(lower=0, upper=100)
# Frequency
s.unique(); s.value_counts()
# Boolean mask via operators
mask = (s > 0) & s.notna()
# Accessors
s.str.lower(); s.str.contains("foo"); s.str.replace("a", "b")
s.dt.year; s.dt.month; s.dt.floor("h")
4.4 What's deliberately not pandas-compatible
These raise NotImplementedError with a one-line pointer to the migration cookbook:
df.loc[...], df.iloc[...], df.at[...], df.iat[...]
df.index, df.reset_index(), df.set_index()
df.apply(func, axis=1) — point users to UDFs / map_batches
df.iterrows(), df.itertuples() — point to to_arrow_table() / to_pandas()
- In-place mutation (
inplace=True ignored with a warning)
4.5 The Expr escape hatch
Expr exists for advanced users who want explicit DataFusion expressions, registered as sd.expr.col("x") and sd.expr.lit(v). Documentation foregrounds Series / DataFrame; Expr shows up only in the advanced section.
5. Implementation notes
5.1 Type layering
sd.Series ─ wraps an internal Expr bound to a DataFrame
sd.Scalar ─ deferred 0-D Expr; auto-materializes on Python coercion
sd.DataFrame ─ wraps a DataFusion LogicalPlan
sd.expr.Expr ─ thin Python view on datafusion_expr::Expr (escape hatch)
Series is conceptually (DataFrame, Expr). Operations across two Series require they share a parent — clear error otherwise (no implicit join).
5.2 Deferred scalars
A reduction returns a Scalar — a 0-D Expr bound to its parent DataFrame. Two paths:
- Composed back into Expr / Series / DataFrame. Stays lazy and folds into the surrounding plan.
df.assign(centered=df["x"] - df["x"].mean()) compiles to one scan.
- Coerced to a Python value (
__float__, __int__, __bool__, __repr__, comparison vs. non-Expr) — triggers materialization via a one-row collect, with per-Scalar memoization so repeated coercions don't re-execute.
Boundary rule: SedonaDB-types-only operands stay lazy; coercion at any Python-protocol boundary materializes.
5.3 Rust side (PyO3)
Add to python/sedonadb/src/:
expr.rs — InternalExpr holding datafusion_expr::Expr. Operator wrappers, alias, cast, is_null, etc. Pattern lifted from r/sedonadb/src/rust/src/expression.rs.
series.rs — InternalSeries (DataFrame handle + Expr handle); reductions drive single-column aggregations through InternalDataFrame.
- Extend
InternalDataFrame with: select, filter, with_column, drop, rename, sort, distinct, union, join, aggregate.
Each new method is a thin wrapper over the corresponding DataFusion DataFrame method — no new query-engine code.
5.4 __setitem__ semantics
df["x"] = expr is sugar for df = df.assign(x=expr). All DataFrames are immutable; __setitem__ rebinds the local name. inplace=True is ignored with a warning.
5.5 NULL vs NaN
SQL NULL throughout. Series.isna() matches both NULL and NaN. Arithmetic uses SQL three-valued logic. Documented in the cookbook's "porting from pandas" section.
6. Phases
| Phase |
Content |
Mirrors R prior art |
| P1 — Expression layer |
InternalExpr (PyO3) wrapping datafusion_expr::Expr. col, lit, operator overloads, alias, cast, is_null, isin. Wire into existing InternalDataFrame. |
r/sedonadb/src/rust/src/expression.rs + R/expression.R |
| P2 — Core DataFrame ops |
select via __getitem__, filter via boolean mask, query(str), assign, drop(columns=), rename(columns=), sort_values, head/tail, __setitem__. |
r/sedonadb/R/dataframe.R |
P3 — Series + Scalar |
Series as (DataFrame, Expr). Element-wise (isna, fillna, astype, isin, between, clip). Lazy reductions returning Scalar with auto-materializing dunders. unique, value_counts. str.*, dt.* accessors. |
new — no R analog |
| P4 — Joins / groupby / combine |
merge/join, groupby().agg(), drop_duplicates, concat, describe, info, shape, empty. |
partial — see PR #781 |
| P5 — Toplevel + polish |
sd.DataFrame, sd.read_csv, sd.read_json, sd.concat. NotImplementedError stubs for .loc/.iloc/.apply/.iterrows. Cookbook entries written in parallel with P3–P4. |
— |
GeoPandas facade becomes its own future ticket, gated on demand signal after P5 ships.
7. Risks
- Deferred-scalar surprise. Users may not realize
s.mean() is lazy. Mitigate via a repr that prints the materialized value (REPL >>> s.mean() does the right thing) and a clear note in Scalar.__doc__.
- Coercion-boundary edge cases.
Scalar × numpy scalar / Decimal / Expr from a different parent — codify in tests up front.
- Repeated coercion of the same
Scalar must not re-execute — memoize on first coercion.
- NULL vs NaN edge cases when porting pandas code that relies on NaN propagation. Possible later:
compat="pandas" mode at the Arrow boundary.
__eq__ overload on Series collides with set / dict membership. Standard DataFrame-library trade-off; document.
- Breadth vs depth. Pandas surface is huge. Ship the listed methods first; non-listed methods raise
NotImplementedError with a clear pointer rather than partial implementations that silently differ from pandas.
8. Non-goals (in this scope)
- Row labels,
Index, RangeIndex, MultiIndex.
.loc, .iloc, .at, .iat.
.apply(func, axis=1), .applymap, .itertuples, .iterrows.
.plot() — users go through to_pandas().
- True in-place mutation (
inplace=True is ignored with a warning).
- GeoPandas facade —
.geometry, .crs, .to_crs, .sjoin, geometry methods on Series. Tracked separately; gated on demand after P5.
GeoSeries, GeoDataFrame types.
0. Scope
Review feedback challenged the GeoPandas compatibility layer: most of SedonaDB's perf comes from rethinking workflows in the relational model, and the locked non-goals (no
Index, no.apply, no.iloc) make a pure import-change migration impossible anyway. This issue keeps the relational pandas surface and defers the GeoPandas-specific facade (.geometry,.crs,.to_crs,.sjoin, geometry methods onSeries) to a separate, demand-gated effort.Cookbook entries that show how to rewrite GeoPandas workflows into the relational model run alongside the API build — each entry doubles as a mini design doc for the methods it exercises.
The in-flight prior art is the R bindings: PR #468 added
r/sedonadb/R/expression.R(454 LoC) +r/sedonadb/src/rust/src/expression.rs(8.5 KB). The Python expression layer mirrors that pattern.1. Motivation
SedonaDB's Python package today is SQL-driven: users get a lazy
DataFramefromsd.sql("SELECT ..."),sd.read_parquet(...), orsd.create_data_frame(...), and the only way to transform it is to write another SQL string. The target audience — Python data scientists who use pandas — expects a different surface entirely: aDataFramethey can index withdf["col"], aSerieswith.mean()/.str.lower()/.fillna(), agroupby().agg(...)chain, anddf.merge(other, on=...).Goal: make the relational surface of the SedonaDB Python interface as close to pandas as is feasible on top of a SQL-backed lazy engine. Where pandas idioms force a real semantic conflict with the engine, follow pandas at the surface and document the deviation. Where they would gut the engine, refuse, and offer a clear alternative.
Explicit non-goals (locked):
Index/.loc/.iloc/.reset_index/.set_index..apply(func, axis=1)— would force per-row Python over a SQL plan.2. Current state
Relevant files:
python/sedonadb/python/sedonadb/context.py—SedonaContext.python/sedonadb/python/sedonadb/dataframe.py—DataFrame(lazy SQL-driven;limit,head,count,show,to_arrow_table,to_pandas,to_parquet,to_view,with_params,schema,columns,explain).python/sedonadb/src/dataframe.rs—InternalDataFramePyO3 wrapper arounddatafusion::prelude::DataFrame.python/sedonadb/python/sedonadb/expr/literal.py—Literal/lit()(only existing expression type).r/sedonadb/R/expression.R,r/sedonadb/src/rust/src/expression.rs— prior art for the expression-translation pattern.Working in our favor:
DataFramewithselect,filter,with_column,join,sort,aggregate, etc. We are exposing existing capability.to_pandas()already returns a usable frame.3. Design decisions
groupby,merge,rename(columns=),drop_duplicates,assign,query,tail,describe,info,shape).Seriesis a first-class type.df["x"]returnsSeries.Exprbecomes an internal detail most users never type.DataFrameclass for tabular and geospatial. Geometry awareness via dtype + schema metadata only — noGeoDataFramesubclass, noGeoSeriestype.s.mean()returns a deferredScalar(a 0-D Expr). Auto-materializes only when coerced to a concrete value (float(),print(),if, comparison against a non-Expr). When composed back into anotherSeries/DataFrameexpression, stays lazy and folds into the surrounding plan.df.query(...).assign(...).groupby(...).agg(...)builds a single plan. Materialization only at Python-coercion of aScalar,to_pandas(),show(),to_parquet(), etc.r/sedonadb/src/rust/src/expression.rsintopython/sedonadb/src/expr.rs. Don't pre-factor a shared crate; re-evaluate only if R/Python drift becomes painful..isna()matches both NULL and NaN; arithmetic uses SQL three-valued logic.sedonadbpackage, not a separate PyPI distribution. Pandas-flavored methods live directly onsedonadb.DataFrame/sedonadb.Series;sedonadb.pandasis a discoverability namespace that re-exports the same types and toplevel constructors soimport sedonadb.pandas as pd; pd.read_csv(...)works for users coming from pandas. Same release cadence. Code lives underpython/sedonadb/python/sedonadb/pandas/.docs/cookbook/.4. The user surface (this scope)
4.1 Top-level
4.2
DataFrame— pandas-spelled methods4.3
Series— pandas-spelled methods4.4 What's deliberately not pandas-compatible
These raise
NotImplementedErrorwith a one-line pointer to the migration cookbook:df.loc[...],df.iloc[...],df.at[...],df.iat[...]df.index,df.reset_index(),df.set_index()df.apply(func, axis=1)— point users to UDFs /map_batchesdf.iterrows(),df.itertuples()— point toto_arrow_table()/to_pandas()inplace=Trueignored with a warning)4.5 The
Exprescape hatchExprexists for advanced users who want explicit DataFusion expressions, registered assd.expr.col("x")andsd.expr.lit(v). Documentation foregroundsSeries/DataFrame;Exprshows up only in the advanced section.5. Implementation notes
5.1 Type layering
Seriesis conceptually(DataFrame, Expr). Operations across two Series require they share a parent — clear error otherwise (no implicit join).5.2 Deferred scalars
A reduction returns a
Scalar— a 0-D Expr bound to its parent DataFrame. Two paths:df.assign(centered=df["x"] - df["x"].mean())compiles to one scan.__float__,__int__,__bool__,__repr__, comparison vs. non-Expr) — triggers materialization via a one-row collect, with per-Scalarmemoization so repeated coercions don't re-execute.Boundary rule: SedonaDB-types-only operands stay lazy; coercion at any Python-protocol boundary materializes.
5.3 Rust side (PyO3)
Add to
python/sedonadb/src/:expr.rs—InternalExprholdingdatafusion_expr::Expr. Operator wrappers,alias,cast,is_null, etc. Pattern lifted fromr/sedonadb/src/rust/src/expression.rs.series.rs—InternalSeries(DataFrame handle + Expr handle); reductions drive single-column aggregations throughInternalDataFrame.InternalDataFramewith:select,filter,with_column,drop,rename,sort,distinct,union,join,aggregate.Each new method is a thin wrapper over the corresponding DataFusion
DataFramemethod — no new query-engine code.5.4
__setitem__semanticsdf["x"] = expris sugar fordf = df.assign(x=expr). All DataFrames are immutable;__setitem__rebinds the local name.inplace=Trueis ignored with a warning.5.5 NULL vs NaN
SQL NULL throughout.
Series.isna()matches both NULL and NaN. Arithmetic uses SQL three-valued logic. Documented in the cookbook's "porting from pandas" section.6. Phases
InternalExpr(PyO3) wrappingdatafusion_expr::Expr.col,lit, operator overloads,alias,cast,is_null,isin. Wire into existingInternalDataFrame.r/sedonadb/src/rust/src/expression.rs+R/expression.Rselectvia__getitem__,filtervia boolean mask,query(str),assign,drop(columns=),rename(columns=),sort_values,head/tail,__setitem__.r/sedonadb/R/dataframe.RSeries+ScalarSeriesas(DataFrame, Expr). Element-wise (isna,fillna,astype,isin,between,clip). Lazy reductions returningScalarwith auto-materializing dunders.unique,value_counts.str.*,dt.*accessors.merge/join,groupby().agg(),drop_duplicates,concat,describe,info,shape,empty.sd.DataFrame,sd.read_csv,sd.read_json,sd.concat.NotImplementedErrorstubs for.loc/.iloc/.apply/.iterrows. Cookbook entries written in parallel with P3–P4.GeoPandas facade becomes its own future ticket, gated on demand signal after P5 ships.
7. Risks
s.mean()is lazy. Mitigate via areprthat prints the materialized value (REPL>>> s.mean()does the right thing) and a clear note inScalar.__doc__.Scalar× numpy scalar /Decimal/ Expr from a different parent — codify in tests up front.Scalarmust not re-execute — memoize on first coercion.compat="pandas"mode at the Arrow boundary.__eq__overload onSeriescollides with set / dict membership. Standard DataFrame-library trade-off; document.NotImplementedErrorwith a clear pointer rather than partial implementations that silently differ from pandas.8. Non-goals (in this scope)
Index,RangeIndex,MultiIndex..loc,.iloc,.at,.iat..apply(func, axis=1),.applymap,.itertuples,.iterrows..plot()— users go throughto_pandas().inplace=Trueis ignored with a warning)..geometry,.crs,.to_crs,.sjoin, geometry methods onSeries. Tracked separately; gated on demand after P5.GeoSeries,GeoDataFrametypes.