feat(table): emit geo bounds in manifest DataFile#1371
Conversation
Compute geometry/geography bounding boxes from WKB during writes and emit WKB single-point lower/upper bounds into the manifest DataFile, which is what Iceberg pruning consumes. arrow-go cannot emit Parquet column-chunk GeospatialStatistics, so bounds are computed in iceberg-go. Also fixes a panic when writing geo columns through the stats path, and lets geo bounds round-trip via LiteralFromBytes.
zeroshade
left a comment
There was a problem hiding this comment.
Thanks for adding geo bounds support. I think this needs changes before merge:
- Geometry/geography manifest lower_bounds/upper_bounds are being serialized as WKB point payloads, but Iceberg v3 geo single-value bounds use concatenated little-endian float64 coordinates (XY=16 bytes, XYZ=24, XYM/XYZM=32 with XYM Z=NaN). This will put WKB headers in manifests and break readers expecting coordinate bytes.
- Geography bounds are accumulated as raw vertex min/max without type awareness, edge interpolation, or antimeridian wraparound handling. For geography, the correct longitude interval may wrap with xmin > xmax, so these bounds can be incorrect unless computed geodesically/type-aware or omitted when unsafe.
Emit geometry manifest bounds using the Iceberg geospatial single-value serialization (little-endian float64 X,Y[,Z][,M]) instead of WKB, and validate the 16/24/32 byte length in LiteralFromBytes. Omit geography bounds, whose geodesic edges make vertex min/max an unsafe box.
|
@zeroshade Thanks for both good catches. Switched manifest bounds to the spec coordinate encoding and made geography omit bounds for now. On table/internal/geo_codec.go (the encodeGeoBound/Bounds area): On table/internal/parquet_files.go (the AddWKB accumulation loop): |
Compute geometry/geography bounding boxes from WKB during writes and emit WKB single-point lower/upper bounds into the manifest DataFile, which is what Iceberg pruning consumes. arrow-go cannot emit Parquet column-chunk GeospatialStatistics, so bounds are computed in iceberg-go.
Also fixes a panic when writing geo columns through the stats path, and lets geo bounds round-trip via LiteralFromBytes.
Closes: #992