Skip to content

[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads#56104

Open
cloud-fan wants to merge 4 commits into
apache:masterfrom
cloud-fan:SPARK-57058
Open

[SPARK-57058][SQL] Introduce BinaryView and migrate geo types to it for zero-copy reads#56104
cloud-fan wants to merge 4 commits into
apache:masterfrom
cloud-fan:SPARK-57058

Conversation

@cloud-fan
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Introduce a generic BinaryView physical-value class that holds a non-owning pointer to a contiguous chunk of bytes living either on-heap or off-heap, and migrate the GEOMETRY and GEOGRAPHY types onto it.

BinaryView is modelled on UTF8String's (Object base, long offset, int numBytes) shape so its accessors plug directly into existing Platform.copyMemory / Platform.get* call sites. Lifetime discipline matches UTF8String: callers that need to retain a value past the source buffer's lifetime must call copy().

Spark already separates logical type from physical type. GEOMETRY and GEOGRAPHY are different logical types but share the same physical layout: an opaque chunk of bytes. So they fold into a single physical type:

  • Delete PhysicalGeometryType and PhysicalGeographyType. Add PhysicalBinaryViewType whose InternalType is BinaryView. Both GeometryType and GeographyType map to it.
  • Delete GeometryVal and GeographyVal (they were byte[] marker wrappers).
  • SpecializedGetters.getGeometry / getGeography collapse into one getBinaryView(int), mirroring how getUTF8String is the single accessor regardless of StringType / CharType / VarcharType.
  • UnsafeWriter.write(int, GeometryVal/GeographyVal) collapses into write(int, BinaryView).
  • STUtils overloads that previously dispatched on GeometryVal vs GeographyVal are renamed to explicit stGeom* / stGeog* pairs; the ST expressions pick the right variant from the input's logical DataType at runtime-replacement time.

The on-disk UnsafeRow layout is unchanged.

Zero-copy reads (the actual perf win):

  • UnsafeRow.getBinaryView and UnsafeArrayData.getBinaryView now do BinaryView.fromAddress(baseObject, baseOffset + offset, size) instead of getBinary() + fromBytes() — drops one byte[] allocation + Platform.copyMemory per read, exactly mirroring getUTF8String.
  • UnsafeWriter.write(int, BinaryView) writes via (getBaseObject, getBaseOffset, numBytes) instead of getBytes().
  • The three copy() methods that materialize a GenericInternalRow from a columnar source (ColumnarRow, ColumnarBatchRow, MutableColumnarRow) now call .copy() on the BinaryView defensively, matching the existing getUTF8String(i).copy() discipline on the line right above.

Why are the changes needed?

  1. Today reading GEOMETRY / GEOGRAPHY out of an UnsafeRow or UnsafeArrayData allocates a fresh byte[] and Platform.copyMemorys the bytes into it, even though UTF8String shows the zero-copy view pattern is already established in the same code paths.
  2. GeometryVal and GeographyVal were marker classes whose only content was forwarding to byte[]. Once BinaryView exists, keeping them as separate types — and keeping separate PhysicalGeometryType / PhysicalGeographyType that differ only in the wrapper class — is circular: the physical layer distinguishes them solely because the Java type system distinguished them.
  3. Same shape as the rest of Spark: StringType / CharType / VarcharTypePhysicalStringTypeUTF8String. Geo had been the odd one out.

Does this PR introduce any user-facing change?

No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing org.apache.spark.sql.types.Geometry / org.apache.spark.sql.types.Geography APIs are unchanged. Only physical-layer internals move.

How was this patch tested?

  • New BinaryViewSuite (13 tests) covering on-heap, off-heap (via MemoryAllocator.UNSAFE), slice, copy independence, primitive readers, unsigned lexicographic compareTo, equals / hashCode across heap and off-heap, ByteBuffer round-trip on both paths, writeToMemory, and Java + Kryo serialization round-trips.
  • unsafe/checkstyle clean. ASCII-only spot-check on new files clean.
  • Existing GeometryValSuite and GeographyValSuite were deleted (the classes are gone). Their assertions (other than the deliberately-throwing compareTo) are subsumed by BinaryViewSuite and the existing GeometryExecutionSuite / GeographyExecutionSuite / STUtilsSuite / CatalystTypeConvertersSuite / LiteralExpressionSuite / UnsafeRowWriterSuite / ArrowWriterSuite / GenerateUnsafeProjectionSuite / ParquetDelta{Byte,Length}ArrayEncodingSuite / Geo{metry,graphy}TypeSuite which have all been updated for the new API.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude code (Opus 4.7)

cloud-fan added 4 commits May 26, 2026 00:44
…or zero-copy reads

### What changes were proposed in this pull request?

Introduce a generic `BinaryView` physical-value class that holds a non-owning
pointer to a contiguous chunk of bytes living either on-heap or off-heap, and
migrate the GEOMETRY and GEOGRAPHY types onto it.

`BinaryView` is modelled on `UTF8String`'s `(Object base, long offset, int numBytes)`
shape so its accessors plug directly into existing `Platform.copyMemory` / `Platform.get*`
call sites. Lifetime discipline matches `UTF8String`: callers that need to
retain a value past the source buffer's lifetime must call `copy()`.

Spark already separates logical type from physical type. GEOMETRY and GEOGRAPHY
are different logical types but share the same physical layout: an opaque chunk
of bytes. So they fold into a single physical type:

- Delete `PhysicalGeometryType` and `PhysicalGeographyType`. Add
  `PhysicalBinaryViewType` whose `InternalType` is `BinaryView`. Both
  `GeometryType` and `GeographyType` map to it.
- Delete `GeometryVal` and `GeographyVal` (they were `byte[]` marker wrappers).
- `SpecializedGetters.getGeometry` / `getGeography` collapse into one
  `getBinaryView(int)`, mirroring how `getUTF8String` is the single accessor
  regardless of `StringType` / `CharType` / `VarcharType`.
- `UnsafeWriter.write(int, GeometryVal/GeographyVal)` collapses into
  `write(int, BinaryView)`.
- `STUtils` overloads that previously dispatched on `GeometryVal` vs `GeographyVal`
  are renamed to explicit `stGeom*` / `stGeog*` pairs; the ST expressions pick
  the right variant from the input's logical `DataType` at runtime-replacement
  time.

The on-disk `UnsafeRow` layout is unchanged.

Zero-copy reads (the actual perf win):
- `UnsafeRow.getBinaryView` and `UnsafeArrayData.getBinaryView` now do
  `BinaryView.fromAddress(baseObject, baseOffset + offset, size)` instead of
  `getBinary()` + `fromBytes()` — drops one `byte[]` allocation + `Platform.copyMemory`
  per read, exactly mirroring `getUTF8String`.
- `UnsafeWriter.write(int, BinaryView)` writes via `(getBaseObject, getBaseOffset,
  numBytes)` instead of `getBytes()`.
- The three `copy()` methods that materialize a `GenericInternalRow` from a
  columnar source (`ColumnarRow`, `ColumnarBatchRow`, `MutableColumnarRow`) now
  call `.copy()` on the `BinaryView` defensively, matching the existing
  `getUTF8String(i).copy()` discipline on the line right above.

### Why are the changes needed?

1. Today reading GEOMETRY / GEOGRAPHY out of an `UnsafeRow` or `UnsafeArrayData`
   allocates a fresh `byte[]` and `Platform.copyMemory`s the bytes into it,
   even though `UTF8String` shows the zero-copy view pattern is already
   established in the same code paths.
2. `GeometryVal` and `GeographyVal` were marker classes whose only content was
   forwarding to `byte[]`. Once `BinaryView` exists, keeping them as separate
   types — and keeping separate `PhysicalGeometryType` / `PhysicalGeographyType`
   that differ only in the wrapper class — is circular: the physical layer
   distinguishes them solely because the Java type system distinguished them.
3. Same shape as the rest of Spark: `StringType` / `CharType` / `VarcharType`
   → `PhysicalStringType` → `UTF8String`. Geo had been the odd one out.

### Does this PR introduce _any_ user-facing change?

No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing
`org.apache.spark.sql.types.Geometry` / `org.apache.spark.sql.types.Geography`
APIs are unchanged. Only physical-layer internals move.

### How was this patch tested?

- New `BinaryViewSuite` (13 tests) covering on-heap, off-heap (via
  `MemoryAllocator.UNSAFE`), slice, copy independence, primitive readers,
  unsigned lexicographic `compareTo`, `equals` / `hashCode` across heap and
  off-heap, `ByteBuffer` round-trip on both paths, `writeToMemory`, and Java +
  Kryo serialization round-trips.
- `unsafe/checkstyle` clean. ASCII-only spot-check on new files clean.
- Existing `GeometryValSuite` and `GeographyValSuite` were deleted (the
  classes are gone). Their assertions (other than the deliberately-throwing
  `compareTo`) are subsumed by `BinaryViewSuite` and the existing
  `GeometryExecutionSuite` / `GeographyExecutionSuite` / `STUtilsSuite` /
  `CatalystTypeConvertersSuite` / `LiteralExpressionSuite` /
  `UnsafeRowWriterSuite` / `ArrowWriterSuite` / `GenerateUnsafeProjectionSuite`
  / `ParquetDelta{Byte,Length}ArrayEncodingSuite` / `Geo{metry,graphy}TypeSuite`
  which have all been updated for the new API.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude code (Opus 4.7)

Co-authored-by: Isaac
…aryView migration

Replace em-dashes/ellipses (scalastyle hazard) in new comments, correct
the misleading "bytes are copied" claim on geo*FromPhysVal (BinaryView
may alias), fully-qualify {@link DataType} in STUtils Javadoc, and stop
using {@link ...} inside Scala line comments where it has no effect.
…ve copy parity with UTF8String

Mirror the UTF8String pattern in two places that were left as copying paths:

1. Columnar zero-copy reads: WritableColumnVector.getBinaryView now delegates
   to a new protected abstract getBytesAsBinaryView, implemented by
   OnHeapColumnVector (BinaryView.fromBytes view into the on-heap byte array)
   and OffHeapColumnVector (BinaryView.fromAddress view into off-heap memory).
   Previously the base-class default fell back to getBinary(rowId), which
   allocates a fresh byte[] for every read.

2. Defensive copies at UTF8String-parity sites:
   - InternalRow.copyValue: new BinaryView case so GenericInternalRow.copy
     materializes geo fields instead of leaving them aliased to the source.
   - InternalRow.getWriterDefault: GeometryType / GeographyType branch that
     copies via BinaryView.
   - CodeGenerator.setColumn: include GeometryType / GeographyType in the
     pattern that emits value.copy(), matching the existing "may came from
     UnsafeRow" lifetime comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant