Skip to content

feat(table): emit geo bounds in manifest DataFile#1371

Open
tanmayrauth wants to merge 2 commits into
apache:mainfrom
tanmayrauth:feat/992-parquet-geo-statistics
Open

feat(table): emit geo bounds in manifest DataFile#1371
tanmayrauth wants to merge 2 commits into
apache:mainfrom
tanmayrauth:feat/992-parquet-geo-statistics

Conversation

@tanmayrauth

@tanmayrauth tanmayrauth commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Compute geometry/geography bounding boxes from WKB during writes and emit WKB single-point lower/upper bounds into the manifest DataFile, which is what Iceberg pruning consumes. arrow-go cannot emit Parquet column-chunk GeospatialStatistics, so bounds are computed in iceberg-go.

Also fixes a panic when writing geo columns through the stats path, and lets geo bounds round-trip via LiteralFromBytes.

Closes: #992

Compute geometry/geography bounding boxes from WKB during writes and emit
WKB single-point lower/upper bounds into the manifest DataFile, which is
what Iceberg pruning consumes. arrow-go cannot emit Parquet column-chunk
GeospatialStatistics, so bounds are computed in iceberg-go.

Also fixes a panic when writing geo columns through the stats path, and
lets geo bounds round-trip via LiteralFromBytes.
@tanmayrauth tanmayrauth requested a review from zeroshade as a code owner July 4, 2026 01:28

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding geo bounds support. I think this needs changes before merge:

  • Geometry/geography manifest lower_bounds/upper_bounds are being serialized as WKB point payloads, but Iceberg v3 geo single-value bounds use concatenated little-endian float64 coordinates (XY=16 bytes, XYZ=24, XYM/XYZM=32 with XYM Z=NaN). This will put WKB headers in manifests and break readers expecting coordinate bytes.
  • Geography bounds are accumulated as raw vertex min/max without type awareness, edge interpolation, or antimeridian wraparound handling. For geography, the correct longitude interval may wrap with xmin > xmax, so these bounds can be incorrect unless computed geodesically/type-aware or omitted when unsafe.

Comment thread table/internal/geo_codec.go Outdated
Comment thread table/internal/parquet_files.go
Emit geometry manifest bounds using the Iceberg geospatial single-value
serialization (little-endian float64 X,Y[,Z][,M]) instead of WKB, and
validate the 16/24/32 byte length in LiteralFromBytes. Omit geography
bounds, whose geodesic edges make vertex min/max an unsafe box.
@tanmayrauth

Copy link
Copy Markdown
Contributor Author

@zeroshade Thanks for both good catches. Switched manifest bounds to the spec coordinate encoding and made geography omit bounds for now.

On table/internal/geo_codec.go (the encodeGeoBound/Bounds area):
Fixed — bounds are now the Iceberg geospatial single-value serialization (little-endian float64 X, Y[, Z][, M]: 16/24/32 bytes, XYM writes Z=NaN), not WKB. LiteralFromBytes for geometry/geography now validates the 16/24/32 length instead of treating the payload as opaque WKB, and the tests assert exact byte lengths and the decoded coordinates (including the XYM NaN slot).

On table/internal/parquet_files.go (the AddWKB accumulation loop):
Made accumulation type-aware: the accumulator now knows geometry vs geography (from the Iceberg field type), and geography bounds are omitted entirely. Since geography edges are geodesics, vertex min/max isn't a safe box — latitude can bulge past the endpoints and longitude may need to wrap the antimeridian — so rather than emit possibly-wrong bounds that silently prune valid rows, I leave geography unbounded until we add geodesic/antimeridian-aware computation. Geometry (planar edges) still gets exact bounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(table): emit Parquet GeoStatistics on writes, parse on reads

2 participants