[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads by cloud-fan · Pull Request #56104 · apache/spark

cloud-fan · 2026-05-26T00:45:40Z

What changes were proposed in this pull request?

Introduce a generic BinaryView physical-value class that holds a non-owning pointer to a contiguous chunk of bytes living either on-heap or off-heap, and migrate the GEOMETRY and GEOGRAPHY types onto it.

BinaryView is modelled on UTF8String's (Object base, long offset, int numBytes) shape so its accessors plug directly into existing Platform.copyMemory / Platform.get* call sites. Lifetime discipline matches UTF8String: callers that need to retain a value past the source buffer's lifetime must call copy().

Spark already separates logical type from physical type. GEOMETRY and GEOGRAPHY are different logical types but share the same physical layout: an opaque chunk of bytes. So they fold into a single physical type:

Delete PhysicalGeometryType and PhysicalGeographyType. Add PhysicalBinaryViewType whose InternalType is BinaryView. Both GeometryType and GeographyType map to it.
Delete GeometryVal and GeographyVal (they were byte[] marker wrappers).
SpecializedGetters.getGeometry / getGeography collapse into one getBinaryView(int), mirroring how getUTF8String is the single accessor regardless of StringType / CharType / VarcharType.
UnsafeWriter.write(int, GeometryVal/GeographyVal) collapses into write(int, BinaryView).
STUtils overloads that previously dispatched on GeometryVal vs GeographyVal are renamed to explicit stGeom* / stGeog* pairs; the ST expressions pick the right variant from the input's logical DataType at runtime-replacement time.

The on-disk UnsafeRow layout is unchanged.

Zero-copy reads (the actual perf win):

UnsafeRow.getBinaryView and UnsafeArrayData.getBinaryView now do BinaryView.fromAddress(baseObject, baseOffset + offset, size) instead of getBinary() + fromBytes() — drops one byte[] allocation + Platform.copyMemory per read, exactly mirroring getUTF8String.
UnsafeWriter.write(int, BinaryView) writes via (getBaseObject, getBaseOffset, numBytes) instead of getBytes().
The three copy() methods that materialize a GenericInternalRow from a columnar source (ColumnarRow, ColumnarBatchRow, MutableColumnarRow) now call .copy() on the BinaryView defensively, matching the existing getUTF8String(i).copy() discipline on the line right above.

Why are the changes needed?

Today reading GEOMETRY / GEOGRAPHY out of an UnsafeRow or UnsafeArrayData allocates a fresh byte[] and Platform.copyMemorys the bytes into it, even though UTF8String shows the zero-copy view pattern is already established in the same code paths.
GeometryVal and GeographyVal were marker classes whose only content was forwarding to byte[]. Once BinaryView exists, keeping them as separate types — and keeping separate PhysicalGeometryType / PhysicalGeographyType that differ only in the wrapper class — is circular: the physical layer distinguishes them solely because the Java type system distinguished them.
Same shape as the rest of Spark: StringType / CharType / VarcharType → PhysicalStringType → UTF8String. Geo had been the odd one out.

Does this PR introduce any user-facing change?

No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing org.apache.spark.sql.types.Geometry / org.apache.spark.sql.types.Geography APIs are unchanged. Only physical-layer internals move.

How was this patch tested?

New BinaryViewSuite (13 tests) covering on-heap, off-heap (via MemoryAllocator.UNSAFE), slice, copy independence, primitive readers, unsigned lexicographic compareTo, equals / hashCode across heap and off-heap, ByteBuffer round-trip on both paths, writeToMemory, and Java + Kryo serialization round-trips.
unsafe/checkstyle clean. ASCII-only spot-check on new files clean.
Existing GeometryValSuite and GeographyValSuite were deleted (the classes are gone). Their assertions (other than the deliberately-throwing compareTo) are subsumed by BinaryViewSuite and the existing GeometryExecutionSuite / GeographyExecutionSuite / STUtilsSuite / CatalystTypeConvertersSuite / LiteralExpressionSuite / UnsafeRowWriterSuite / ArrowWriterSuite / GenerateUnsafeProjectionSuite / ParquetDelta{Byte,Length}ArrayEncodingSuite / Geo{metry,graphy}TypeSuite which have all been updated for the new API.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude code (Opus 4.7)

…or zero-copy reads ### What changes were proposed in this pull request? Introduce a generic `BinaryView` physical-value class that holds a non-owning pointer to a contiguous chunk of bytes living either on-heap or off-heap, and migrate the GEOMETRY and GEOGRAPHY types onto it. `BinaryView` is modelled on `UTF8String`'s `(Object base, long offset, int numBytes)` shape so its accessors plug directly into existing `Platform.copyMemory` / `Platform.get*` call sites. Lifetime discipline matches `UTF8String`: callers that need to retain a value past the source buffer's lifetime must call `copy()`. Spark already separates logical type from physical type. GEOMETRY and GEOGRAPHY are different logical types but share the same physical layout: an opaque chunk of bytes. So they fold into a single physical type: - Delete `PhysicalGeometryType` and `PhysicalGeographyType`. Add `PhysicalBinaryViewType` whose `InternalType` is `BinaryView`. Both `GeometryType` and `GeographyType` map to it. - Delete `GeometryVal` and `GeographyVal` (they were `byte[]` marker wrappers). - `SpecializedGetters.getGeometry` / `getGeography` collapse into one `getBinaryView(int)`, mirroring how `getUTF8String` is the single accessor regardless of `StringType` / `CharType` / `VarcharType`. - `UnsafeWriter.write(int, GeometryVal/GeographyVal)` collapses into `write(int, BinaryView)`. - `STUtils` overloads that previously dispatched on `GeometryVal` vs `GeographyVal` are renamed to explicit `stGeom*` / `stGeog*` pairs; the ST expressions pick the right variant from the input's logical `DataType` at runtime-replacement time. The on-disk `UnsafeRow` layout is unchanged. Zero-copy reads (the actual perf win): - `UnsafeRow.getBinaryView` and `UnsafeArrayData.getBinaryView` now do `BinaryView.fromAddress(baseObject, baseOffset + offset, size)` instead of `getBinary()` + `fromBytes()` — drops one `byte[]` allocation + `Platform.copyMemory` per read, exactly mirroring `getUTF8String`. - `UnsafeWriter.write(int, BinaryView)` writes via `(getBaseObject, getBaseOffset, numBytes)` instead of `getBytes()`. - The three `copy()` methods that materialize a `GenericInternalRow` from a columnar source (`ColumnarRow`, `ColumnarBatchRow`, `MutableColumnarRow`) now call `.copy()` on the `BinaryView` defensively, matching the existing `getUTF8String(i).copy()` discipline on the line right above. ### Why are the changes needed? 1. Today reading GEOMETRY / GEOGRAPHY out of an `UnsafeRow` or `UnsafeArrayData` allocates a fresh `byte[]` and `Platform.copyMemory`s the bytes into it, even though `UTF8String` shows the zero-copy view pattern is already established in the same code paths. 2. `GeometryVal` and `GeographyVal` were marker classes whose only content was forwarding to `byte[]`. Once `BinaryView` exists, keeping them as separate types — and keeping separate `PhysicalGeometryType` / `PhysicalGeographyType` that differ only in the wrapper class — is circular: the physical layer distinguishes them solely because the Java type system distinguished them. 3. Same shape as the rest of Spark: `StringType` / `CharType` / `VarcharType` → `PhysicalStringType` → `UTF8String`. Geo had been the odd one out. ### Does this PR introduce _any_ user-facing change? No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing `org.apache.spark.sql.types.Geometry` / `org.apache.spark.sql.types.Geography` APIs are unchanged. Only physical-layer internals move. ### How was this patch tested? - New `BinaryViewSuite` (13 tests) covering on-heap, off-heap (via `MemoryAllocator.UNSAFE`), slice, copy independence, primitive readers, unsigned lexicographic `compareTo`, `equals` / `hashCode` across heap and off-heap, `ByteBuffer` round-trip on both paths, `writeToMemory`, and Java + Kryo serialization round-trips. - `unsafe/checkstyle` clean. ASCII-only spot-check on new files clean. - Existing `GeometryValSuite` and `GeographyValSuite` were deleted (the classes are gone). Their assertions (other than the deliberately-throwing `compareTo`) are subsumed by `BinaryViewSuite` and the existing `GeometryExecutionSuite` / `GeographyExecutionSuite` / `STUtilsSuite` / `CatalystTypeConvertersSuite` / `LiteralExpressionSuite` / `UnsafeRowWriterSuite` / `ArrowWriterSuite` / `GenerateUnsafeProjectionSuite` / `ParquetDelta{Byte,Length}ArrayEncodingSuite` / `Geo{metry,graphy}TypeSuite` which have all been updated for the new API. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude code (Opus 4.7) Co-authored-by: Isaac

…aryView migration Replace em-dashes/ellipses (scalastyle hazard) in new comments, correct the misleading "bytes are copied" claim on geo*FromPhysVal (BinaryView may alias), fully-qualify {@link DataType} in STUtils Javadoc, and stop using {@link ...} inside Scala line comments where it has no effect.

…wType

…ve copy parity with UTF8String Mirror the UTF8String pattern in two places that were left as copying paths: 1. Columnar zero-copy reads: WritableColumnVector.getBinaryView now delegates to a new protected abstract getBytesAsBinaryView, implemented by OnHeapColumnVector (BinaryView.fromBytes view into the on-heap byte array) and OffHeapColumnVector (BinaryView.fromAddress view into off-heap memory). Previously the base-class default fell back to getBinary(rowId), which allocates a fresh byte[] for every read. 2. Defensive copies at UTF8String-parity sites: - InternalRow.copyValue: new BinaryView case so GenericInternalRow.copy materializes geo fields instead of leaving them aliased to the source. - InternalRow.getWriterDefault: GeometryType / GeographyType branch that copies via BinaryView. - CodeGenerator.setColumn: include GeometryType / GeographyType in the pattern that emits value.copy(), matching the existing "may came from UnsafeRow" lifetime comment.

cloud-fan added 4 commits May 26, 2026 00:44

[SPARK-57058][SQL] Preserve unordered semantics for PhysicalBinaryVie…

0630358

…wType

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads#56104

[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads#56104
cloud-fan wants to merge 4 commits into
apache:masterfrom
cloud-fan:SPARK-57058

cloud-fan commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cloud-fan commented May 26, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant