[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads#56104
Open
cloud-fan wants to merge 4 commits into
Open
[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads#56104cloud-fan wants to merge 4 commits into
cloud-fan wants to merge 4 commits into
Conversation
…or zero-copy reads
### What changes were proposed in this pull request?
Introduce a generic `BinaryView` physical-value class that holds a non-owning
pointer to a contiguous chunk of bytes living either on-heap or off-heap, and
migrate the GEOMETRY and GEOGRAPHY types onto it.
`BinaryView` is modelled on `UTF8String`'s `(Object base, long offset, int numBytes)`
shape so its accessors plug directly into existing `Platform.copyMemory` / `Platform.get*`
call sites. Lifetime discipline matches `UTF8String`: callers that need to
retain a value past the source buffer's lifetime must call `copy()`.
Spark already separates logical type from physical type. GEOMETRY and GEOGRAPHY
are different logical types but share the same physical layout: an opaque chunk
of bytes. So they fold into a single physical type:
- Delete `PhysicalGeometryType` and `PhysicalGeographyType`. Add
`PhysicalBinaryViewType` whose `InternalType` is `BinaryView`. Both
`GeometryType` and `GeographyType` map to it.
- Delete `GeometryVal` and `GeographyVal` (they were `byte[]` marker wrappers).
- `SpecializedGetters.getGeometry` / `getGeography` collapse into one
`getBinaryView(int)`, mirroring how `getUTF8String` is the single accessor
regardless of `StringType` / `CharType` / `VarcharType`.
- `UnsafeWriter.write(int, GeometryVal/GeographyVal)` collapses into
`write(int, BinaryView)`.
- `STUtils` overloads that previously dispatched on `GeometryVal` vs `GeographyVal`
are renamed to explicit `stGeom*` / `stGeog*` pairs; the ST expressions pick
the right variant from the input's logical `DataType` at runtime-replacement
time.
The on-disk `UnsafeRow` layout is unchanged.
Zero-copy reads (the actual perf win):
- `UnsafeRow.getBinaryView` and `UnsafeArrayData.getBinaryView` now do
`BinaryView.fromAddress(baseObject, baseOffset + offset, size)` instead of
`getBinary()` + `fromBytes()` — drops one `byte[]` allocation + `Platform.copyMemory`
per read, exactly mirroring `getUTF8String`.
- `UnsafeWriter.write(int, BinaryView)` writes via `(getBaseObject, getBaseOffset,
numBytes)` instead of `getBytes()`.
- The three `copy()` methods that materialize a `GenericInternalRow` from a
columnar source (`ColumnarRow`, `ColumnarBatchRow`, `MutableColumnarRow`) now
call `.copy()` on the `BinaryView` defensively, matching the existing
`getUTF8String(i).copy()` discipline on the line right above.
### Why are the changes needed?
1. Today reading GEOMETRY / GEOGRAPHY out of an `UnsafeRow` or `UnsafeArrayData`
allocates a fresh `byte[]` and `Platform.copyMemory`s the bytes into it,
even though `UTF8String` shows the zero-copy view pattern is already
established in the same code paths.
2. `GeometryVal` and `GeographyVal` were marker classes whose only content was
forwarding to `byte[]`. Once `BinaryView` exists, keeping them as separate
types — and keeping separate `PhysicalGeometryType` / `PhysicalGeographyType`
that differ only in the wrapper class — is circular: the physical layer
distinguishes them solely because the Java type system distinguished them.
3. Same shape as the rest of Spark: `StringType` / `CharType` / `VarcharType`
→ `PhysicalStringType` → `UTF8String`. Geo had been the odd one out.
### Does this PR introduce _any_ user-facing change?
No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing
`org.apache.spark.sql.types.Geometry` / `org.apache.spark.sql.types.Geography`
APIs are unchanged. Only physical-layer internals move.
### How was this patch tested?
- New `BinaryViewSuite` (13 tests) covering on-heap, off-heap (via
`MemoryAllocator.UNSAFE`), slice, copy independence, primitive readers,
unsigned lexicographic `compareTo`, `equals` / `hashCode` across heap and
off-heap, `ByteBuffer` round-trip on both paths, `writeToMemory`, and Java +
Kryo serialization round-trips.
- `unsafe/checkstyle` clean. ASCII-only spot-check on new files clean.
- Existing `GeometryValSuite` and `GeographyValSuite` were deleted (the
classes are gone). Their assertions (other than the deliberately-throwing
`compareTo`) are subsumed by `BinaryViewSuite` and the existing
`GeometryExecutionSuite` / `GeographyExecutionSuite` / `STUtilsSuite` /
`CatalystTypeConvertersSuite` / `LiteralExpressionSuite` /
`UnsafeRowWriterSuite` / `ArrowWriterSuite` / `GenerateUnsafeProjectionSuite`
/ `ParquetDelta{Byte,Length}ArrayEncodingSuite` / `Geo{metry,graphy}TypeSuite`
which have all been updated for the new API.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude code (Opus 4.7)
Co-authored-by: Isaac
…aryView migration
Replace em-dashes/ellipses (scalastyle hazard) in new comments, correct
the misleading "bytes are copied" claim on geo*FromPhysVal (BinaryView
may alias), fully-qualify {@link DataType} in STUtils Javadoc, and stop
using {@link ...} inside Scala line comments where it has no effect.
…ve copy parity with UTF8String
Mirror the UTF8String pattern in two places that were left as copying paths:
1. Columnar zero-copy reads: WritableColumnVector.getBinaryView now delegates
to a new protected abstract getBytesAsBinaryView, implemented by
OnHeapColumnVector (BinaryView.fromBytes view into the on-heap byte array)
and OffHeapColumnVector (BinaryView.fromAddress view into off-heap memory).
Previously the base-class default fell back to getBinary(rowId), which
allocates a fresh byte[] for every read.
2. Defensive copies at UTF8String-parity sites:
- InternalRow.copyValue: new BinaryView case so GenericInternalRow.copy
materializes geo fields instead of leaving them aliased to the source.
- InternalRow.getWriterDefault: GeometryType / GeographyType branch that
copies via BinaryView.
- CodeGenerator.setColumn: include GeometryType / GeographyType in the
pattern that emits value.copy(), matching the existing "may came from
UnsafeRow" lifetime comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Introduce a generic
BinaryViewphysical-value class that holds a non-owning pointer to a contiguous chunk of bytes living either on-heap or off-heap, and migrate the GEOMETRY and GEOGRAPHY types onto it.BinaryViewis modelled onUTF8String's(Object base, long offset, int numBytes)shape so its accessors plug directly into existingPlatform.copyMemory/Platform.get*call sites. Lifetime discipline matchesUTF8String: callers that need to retain a value past the source buffer's lifetime must callcopy().Spark already separates logical type from physical type. GEOMETRY and GEOGRAPHY are different logical types but share the same physical layout: an opaque chunk of bytes. So they fold into a single physical type:
PhysicalGeometryTypeandPhysicalGeographyType. AddPhysicalBinaryViewTypewhoseInternalTypeisBinaryView. BothGeometryTypeandGeographyTypemap to it.GeometryValandGeographyVal(they werebyte[]marker wrappers).SpecializedGetters.getGeometry/getGeographycollapse into onegetBinaryView(int), mirroring howgetUTF8Stringis the single accessor regardless ofStringType/CharType/VarcharType.UnsafeWriter.write(int, GeometryVal/GeographyVal)collapses intowrite(int, BinaryView).STUtilsoverloads that previously dispatched onGeometryValvsGeographyValare renamed to explicitstGeom*/stGeog*pairs; the ST expressions pick the right variant from the input's logicalDataTypeat runtime-replacement time.The on-disk
UnsafeRowlayout is unchanged.Zero-copy reads (the actual perf win):
UnsafeRow.getBinaryViewandUnsafeArrayData.getBinaryViewnow doBinaryView.fromAddress(baseObject, baseOffset + offset, size)instead ofgetBinary()+fromBytes()— drops onebyte[]allocation +Platform.copyMemoryper read, exactly mirroringgetUTF8String.UnsafeWriter.write(int, BinaryView)writes via(getBaseObject, getBaseOffset, numBytes)instead ofgetBytes().copy()methods that materialize aGenericInternalRowfrom a columnar source (ColumnarRow,ColumnarBatchRow,MutableColumnarRow) now call.copy()on theBinaryViewdefensively, matching the existinggetUTF8String(i).copy()discipline on the line right above.Why are the changes needed?
UnsafeRoworUnsafeArrayDataallocates a freshbyte[]andPlatform.copyMemorys the bytes into it, even thoughUTF8Stringshows the zero-copy view pattern is already established in the same code paths.GeometryValandGeographyValwere marker classes whose only content was forwarding tobyte[]. OnceBinaryViewexists, keeping them as separate types — and keeping separatePhysicalGeometryType/PhysicalGeographyTypethat differ only in the wrapper class — is circular: the physical layer distinguishes them solely because the Java type system distinguished them.StringType/CharType/VarcharType→PhysicalStringType→UTF8String. Geo had been the odd one out.Does this PR introduce any user-facing change?
No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing
org.apache.spark.sql.types.Geometry/org.apache.spark.sql.types.GeographyAPIs are unchanged. Only physical-layer internals move.How was this patch tested?
BinaryViewSuite(13 tests) covering on-heap, off-heap (viaMemoryAllocator.UNSAFE), slice, copy independence, primitive readers, unsigned lexicographiccompareTo,equals/hashCodeacross heap and off-heap,ByteBufferround-trip on both paths,writeToMemory, and Java + Kryo serialization round-trips.unsafe/checkstyleclean. ASCII-only spot-check on new files clean.GeometryValSuiteandGeographyValSuitewere deleted (the classes are gone). Their assertions (other than the deliberately-throwingcompareTo) are subsumed byBinaryViewSuiteand the existingGeometryExecutionSuite/GeographyExecutionSuite/STUtilsSuite/CatalystTypeConvertersSuite/LiteralExpressionSuite/UnsafeRowWriterSuite/ArrowWriterSuite/GenerateUnsafeProjectionSuite/ParquetDelta{Byte,Length}ArrayEncodingSuite/Geo{metry,graphy}TypeSuitewhich have all been updated for the new API.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude code (Opus 4.7)