Parquet: Read and write geometry and geography WKB values#16982
Conversation
Geometry and geography columns are stored as pure WKB in a BINARY Parquet column (the logical-type mapping landed in apache#16765). Wire the value path through ParquetValueReaders/Writers.byteBuffers, the same primitive used for BSON, since the in-memory representation is a WKB ByteBuffer. This replaces the temporary UnsupportedOperationException stubs left for the writer and the unsupported-logical-type failure on the reader. Enable the shared DataTest round-trip coverage for geospatial types in the generic Parquet reader/writer and add an explicit WKB round-trip test, including null values.
d05928f to
7ee817f
Compare
| } | ||
|
|
||
| @Override | ||
| protected boolean supportsGeospatial() { |
There was a problem hiding this comment.
🟡 Should fix: The PR description notes that both generic and internal object models inherit the BaseParquetReaders / BaseParquetWriter fix, but only TestGenericData enables geospatial coverage here. Consider also enabling supportsGeospatial() in TestInternalParquet, and applying the same GEOMETRY / GEOGRAPHY → ByteBuffer.wrap(...) handling in RandomInternalData.primitive() that you added to RandomGenericData. Without that, RandomUtil.generatePrimitive returns byte[], which is the wrong in-memory type for geo columns in the internal path.
| // geospatial value path is a separate follow-up | ||
| throw new UnsupportedOperationException("Cannot write geometry value to Parquet"); | ||
| // geometry values are pure WKB stored in a BINARY column | ||
| return Optional.of(ParquetValueWriters.byteBuffers(desc)); |
There was a problem hiding this comment.
❓ Question: ParquetAvroWriter handles BSON logical types via byteBuffers but does not yet handle geometry/geography logical types. Is that path intentionally out of scope for this PR, or should it be tracked as a follow-up?
| } | ||
| } | ||
|
|
||
| private static ByteBuffer wkbPoint(double xCoord, double yCoord) { |
There was a problem hiding this comment.
🟢 Nit: This wkbPoint helper duplicates the one added in RandomUtil. Consider reusing RandomUtil and wrapping with ByteBuffer.wrap(...) so WKB encoding lives in one place.
|
🟢 Nit (optional): Now that geo values are actually written, the geospatial branch in |
Address review feedback on the geo value path: - Enable supportsGeospatial() in TestInternalParquet and map GEOMETRY/GEOGRAPHY to ByteBuffer in RandomInternalData.primitive() and InternalTestHelpers, so the internal object model exercises the same geo round-trip as the generic model (both inherit the BaseParquetReaders/BaseParquetWriter fix). - Reuse RandomUtil.wkbPoint in TestParquetDataWriter instead of a duplicated local WKB encoder, keeping the encoding in one place.
szehon-ho
left a comment
There was a problem hiding this comment.
lgtm, let's do the unresolved additions as follow-ups, they dont affect this pr
|
Merged, thanks @huan233usc ! |
The Spark type mapping (apache#16851) and Iceberg's own Parquet value path (apache#16982) are in place, but the Spark Parquet reader/writer did not handle geo values: geometry/geography carry a LogicalTypeAnnotation with no legacy OriginalType, so the reader fell through to a raw byte[] (mis-typed for a GeometryVal/GeographyVal column) and the writer threw on the unsupported-logical-type path. Read a geo WKB BINARY column into Spark's GeometryVal/GeographyVal and write those values back as their WKB bytes, mirroring the existing binary handling. Enable the shared geospatial DataTest coverage for the Spark Parquet reader and add a Spark writer round-trip test, including null values.
Geometry and geography columns map to a BINARY Parquet column carrying a
geometry/geography logical type, the schema mapping added in #16765. That PR
deliberately left the value path as a follow-up: the writer threw
UnsupportedOperationExceptionand the reader failed on the unsupported logicaltype. This PR wires up the value path.
Geo values are pure WKB, and Iceberg represents them in memory as a
ByteBuffer(
Type.TypeID.GEOMETRY/GEOGRAPHYmap toByteBuffer.class), so the readerand writer reuse
ParquetValueReaders/ParquetValueWriters.byteBuffers— thesame primitive already used for BSON. The change is in
BaseParquetReaders/BaseParquetWriter, so both the generic and internal object models inherit it.Testing:
DataTestround-trip coverage for geospatial types in thegeneric Parquet reader/writer (
supportsGeospatial()), exercising geometry andgeography across multiple CRS and edge algorithms with randomly generated WKB
values.
TestParquetDataWriter) coveringgeometry, geography, and null values through the
DataWriterpath.