Skip to content

Parquet: Read and write geometry and geography WKB values#16982

Merged
szehon-ho merged 2 commits into
apache:mainfrom
huan233usc:geo-parquet-value-path
Jul 3, 2026
Merged

Parquet: Read and write geometry and geography WKB values#16982
szehon-ho merged 2 commits into
apache:mainfrom
huan233usc:geo-parquet-value-path

Conversation

@huan233usc

Copy link
Copy Markdown
Contributor

Geometry and geography columns map to a BINARY Parquet column carrying a
geometry/geography logical type, the schema mapping added in #16765. That PR
deliberately left the value path as a follow-up: the writer threw
UnsupportedOperationException and the reader failed on the unsupported logical
type. This PR wires up the value path.

Geo values are pure WKB, and Iceberg represents them in memory as a ByteBuffer
(Type.TypeID.GEOMETRY / GEOGRAPHY map to ByteBuffer.class), so the reader
and writer reuse ParquetValueReaders/ParquetValueWriters.byteBuffers — the
same primitive already used for BSON. The change is in BaseParquetReaders /
BaseParquetWriter, so both the generic and internal object models inherit it.

Testing:

  • Enables the shared DataTest round-trip coverage for geospatial types in the
    generic Parquet reader/writer (supportsGeospatial()), exercising geometry and
    geography across multiple CRS and edge algorithms with randomly generated WKB
    values.
  • Adds an explicit WKB round-trip test (TestParquetDataWriter) covering
    geometry, geography, and null values through the DataWriter path.

Geometry and geography columns are stored as pure WKB in a BINARY
Parquet column (the logical-type mapping landed in apache#16765). Wire the
value path through ParquetValueReaders/Writers.byteBuffers, the same
primitive used for BSON, since the in-memory representation is a WKB
ByteBuffer. This replaces the temporary UnsupportedOperationException
stubs left for the writer and the unsupported-logical-type failure on
the reader.

Enable the shared DataTest round-trip coverage for geospatial types in
the generic Parquet reader/writer and add an explicit WKB round-trip
test, including null values.
@huan233usc huan233usc force-pushed the geo-parquet-value-path branch from d05928f to 7ee817f Compare June 27, 2026 22:02

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focused follow-up to #16765 — the byteBuffers approach for geometry/geography WKB looks correct. A few notes below.

}

@Override
protected boolean supportsGeospatial() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Should fix: The PR description notes that both generic and internal object models inherit the BaseParquetReaders / BaseParquetWriter fix, but only TestGenericData enables geospatial coverage here. Consider also enabling supportsGeospatial() in TestInternalParquet, and applying the same GEOMETRY / GEOGRAPHYByteBuffer.wrap(...) handling in RandomInternalData.primitive() that you added to RandomGenericData. Without that, RandomUtil.generatePrimitive returns byte[], which is the wrong in-memory type for geo columns in the internal path.

// geospatial value path is a separate follow-up
throw new UnsupportedOperationException("Cannot write geometry value to Parquet");
// geometry values are pure WKB stored in a BINARY column
return Optional.of(ParquetValueWriters.byteBuffers(desc));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: ParquetAvroWriter handles BSON logical types via byteBuffers but does not yet handle geometry/geography logical types. Is that path intentionally out of scope for this PR, or should it be tracked as a follow-up?

}
}

private static ByteBuffer wkbPoint(double xCoord, double yCoord) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: This wkbPoint helper duplicates the one added in RandomUtil. Consider reusing RandomUtil and wrapping with ByteBuffer.wrap(...) so WKB encoding lives in one place.

@szehon-ho

Copy link
Copy Markdown
Member

🟢 Nit (optional): Now that geo values are actually written, the geospatial branch in ParquetMetrics.metricsFromFooter (counts only, no lexicographic bounds for WKB) is exercised for the first time, but no test asserts geo metrics. Since testGeospatialRoundTrip already writes null and non-null geo values, it'd be cheap to also assert the resulting DataFile value/null counts (and that no min/max bounds are produced for the geo columns). Not blocking.

Address review feedback on the geo value path:

- Enable supportsGeospatial() in TestInternalParquet and map GEOMETRY/GEOGRAPHY
  to ByteBuffer in RandomInternalData.primitive() and InternalTestHelpers, so the
  internal object model exercises the same geo round-trip as the generic model
  (both inherit the BaseParquetReaders/BaseParquetWriter fix).
- Reuse RandomUtil.wkbPoint in TestParquetDataWriter instead of a duplicated
  local WKB encoder, keeping the encoding in one place.

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, let's do the unresolved additions as follow-ups, they dont affect this pr

@szehon-ho szehon-ho merged commit 744e811 into apache:main Jul 3, 2026
55 checks passed
@szehon-ho

Copy link
Copy Markdown
Member

Merged, thanks @huan233usc !

huan233usc pushed a commit to huan233usc/iceberg that referenced this pull request Jul 3, 2026
The Spark type mapping (apache#16851) and Iceberg's own Parquet value path
(apache#16982) are in place, but the Spark Parquet reader/writer did not handle
geo values: geometry/geography carry a LogicalTypeAnnotation with no legacy
OriginalType, so the reader fell through to a raw byte[] (mis-typed for a
GeometryVal/GeographyVal column) and the writer threw on the
unsupported-logical-type path.

Read a geo WKB BINARY column into Spark's GeometryVal/GeographyVal and write
those values back as their WKB bytes, mirroring the existing binary handling.
Enable the shared geospatial DataTest coverage for the Spark Parquet reader
and add a Spark writer round-trip test, including null values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants