Skip to content

Add a native Box2D type for bounding boxes #2877

@jiayuasu

Description

@jiayuasu

Tracking issue for adding a native Box2D bounding-box type to Sedona. Each child issue corresponds to one PR.

Background

Sedona has no first-class bounding-box value type. ST_Envelope returns a polygon Geometry, and users reconstruct bboxes via ST_MinX / ST_MaxX / ST_MinY / ST_MaxY. This is awkward for common operations — bbox-from-geometry, dataset extent, GeoParquet covering columns, partition pruning.

Sister project apache/sedona-db has an internal BoundingBox (rust/sedona-geometry/src/bounding_box.rs) but doesn't expose it as a SQL type. PostGIS has box2d / box3d. GeoParquet 1.1 standardizes a struct<xmin, ymin, xmax, ymax> bbox covering column, which Sedona already reads/writes as a raw struct (GeoParquetMetaData.scala, GeoParquetSpatialFilter.scala).

Plan

Add Box2D as a native value type. Phase 1 covers the Spark/JVM side, Python and Flink mirrors, and GeoParquet writer integration. Box3D and geography bboxes are out of scope and tracked as follow-ups.

Type

Box2DUDT is a struct-backed UDT with sqlType = struct<xmin: double, ymin: double, xmax: double, ymax: double> (all non-nullable). Struct-backed (not binary-backed) so values round-trip natively to Parquet and align zero-copy with GeoParquet 1.1 bbox covering columns.

Field names match the GeoParquet 1.1 spec and sedona-db's GeoParquet writer.

A Box2D value is always a valid finite bbox. Absence of a bbox (e.g. ST_Box2D of an empty geometry, ST_Extent over zero rows) is represented by SQL NULL at the column level, not by an in-band sentinel. This matches PostGIS behavior (where Box2D(POINT EMPTY) returns NULL) and leaves xmin > xmax reserved for a future antimeridian-wraparound semantics on geography bboxes (cf. sedona-db's WraparoundInterval, S2's S2LatLngRect).

Split Box2D / Box3D rather than a unified type with optional Z. Reasons:

  1. GeoParquet 1.1 covering columns are 2D-only. A dedicated Box2D matches the spec bit-for-bit.
  2. Storage: 32 bytes/row vs. ~56 bytes for a unified type with nullable Z. Material cost on ST_Extent shuffles.
  3. Static dispatch for dimension-specific functions (ST_Area(box2d) vs ST_Volume(box3d)).
  4. PostGIS familiarity.

Box3D is deferred until a concrete need (point clouds, BIM, voxel data) lands.

SQL surface (Phase 1)

Function Signature
ST_Box2D(geom) Geometry → Box2D (NULL for empty geom)
ST_MakeBox2D(point, point) (Point, Point) → Box2D
ST_Extent(geom) aggregate Geometry → Box2D (NULL over zero rows)
ST_XMin / ST_XMax / ST_YMin / ST_YMax(box2d) Box2D → Double (overload existing accessors)
CAST(box2d AS geometry) Box2D → Polygon
ST_AsText(box2d) Box2D → 'BOX(x1 y1, x2 y2)'

ST_Envelope keeps returning a polygon Geometry (no break). ST_Envelope_Aggr is left untouched.

Sub-issues

Foundation

SQL surface

Storage

Cross-language bindings

Out of scope (future phases)

  • ST_Expand(box, dx, dy)
  • Box predicates (ST_BoxIntersects, ST_BoxContains)
  • Implicit geometry → box2d cast
  • Box3D, ST_3DExtent, ST_3DMakeBox, ST_ZMin/ZMax
  • ST_Box2dFromGeoHash, ST_EstimatedExtent
  • Reader-side auto-materialization of GeoParquet bbox covering columns as Box2D
  • Geography bboxes (likely path: reuse Box2D with antimeridian-wraparound semantics on the X axis, encoded via xmin > xmax)

Coordination with sedona-db

sedona-db's GeoParquet writer uses xmin/ymin/xmax/ymax (Float32), but its st_analyze_agg returns minx/miny/maxx/maxy (Float64). Worth aligning on the Parquet-spec naming as part of this work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions