ArrowArrayStream.allocateNew(BufferAllocator) in shaded bundle takes wrong (relocated) parameter type, breaking interop with vanilla Apache Arrow callers
Summary
The gluten-velox bundle's org.apache.arrow.c.ArrowArrayStream.allocateNew(BufferAllocator) method is compiled to take a gluten-internal shaded org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator, not the public org.apache.arrow.memory.BufferAllocator. On clusters that load the bundle into the JVM AppClassLoader (e.g. as a wildcard extraClassPath), it shadows the user's vanilla Arrow ArrowArrayStream class. Any caller — Lance, Iceberg, Snowflake JDBC, anyone using Arrow C-Data — then fails with NoSuchMethodError because their public BufferAllocator doesn't match gluten's shaded one.
This is a shading-config bug in package/pom.xml: org.apache.arrow.c.* is excluded from relocation (correct, because Arrow's native C-Data JNI hardcodes those class names), but org.apache.arrow.memory.* is relocated. Since org.apache.arrow.c.ArrowArrayStream references org.apache.arrow.memory.BufferAllocator in its public API signatures, the resulting class is internally inconsistent — public class with private parameter types.
Repro
3-line standalone repro using only Apache Arrow Java (no other deps):
import org.apache.arrow.c.ArrowArrayStream
import org.apache.arrow.memory.RootAllocator
object GlutenArrowConflictRepro {
def main(args: Array[String]): Unit = {
val allocator = new RootAllocator(Long.MaxValue)
val stream = ArrowArrayStream.allocateNew(allocator) // <-- fails here
stream.close()
allocator.close()
}
}
Run as a Spark application on any cluster that has gluten's bundle on the wildcard extraClassPath (e.g. IBM CP4D spark175 engine ships gluten 1.7.0-WXD233RC1 at /opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar).
Expected
=== Probe: where does ArrowArrayStream resolve from? ===
ArrowArrayStream class loaded from: file:/<path-to-vanilla-arrow-c-data>.jar
declared methods:
public static org.apache.arrow.c.ArrowArrayStream
org.apache.arrow.c.ArrowArrayStream.allocateNew(org.apache.arrow.memory.BufferAllocator)
=== Attempt ===
OK
DONE
Actual
=== Probe: where does ArrowArrayStream resolve from? ===
ArrowArrayStream class loaded from: file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
declared methods:
public static org.apache.arrow.c.ArrowArrayStream
org.apache.arrow.c.ArrowArrayStream.allocateNew(
org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator) <-- shaded type!
=== Attempt ===
FAILED: java.lang.NoSuchMethodError: org/apache/arrow/c/ArrowArrayStream.allocateNew(
Lorg/apache/arrow/memory/BufferAllocator;)Lorg/apache/arrow/c/ArrowArrayStream;
(loaded from file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
by jdk.internal.loader.ClassLoaders$AppClassLoader)
Root cause
package/pom.xml, around line 121-130 (same on every release since v1.0.0):
<relocation>
<pattern>org.apache.arrow</pattern>
<shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
<!--arrow's C and dataset wrapper refers to the original class path,
so we should not relocate here-->
<excludes>
<exclude>org.apache.arrow.c.*</exclude>
<exclude>org.apache.arrow.c.jni.*</exclude>
<exclude>org.apache.arrow.dataset.**</exclude>
</excludes>
</relocation>
The intent is correct (don't relocate JNI-bound classes). But the public API of org.apache.arrow.c.* returns and accepts org.apache.arrow.memory.* types. When ArrowArrayStream is included in the bundle without relocation, but BufferAllocator IS relocated, the bundled ArrowArrayStream's method signatures get re-bound to the shaded BufferAllocator at compile time. The bundled class is then incompatible with anyone passing a vanilla BufferAllocator.
The same applies to other public org.apache.arrow.c.* ↔ org.apache.arrow.memory.* boundary methods: Data.exportArrayStream(BufferAllocator, ...), ArrowSchema.allocateNew(BufferAllocator), etc.
Why it's been latent
The bug has been present since v1.0.0 but only fires when:
- Some other code on the same JVM calls
ArrowArrayStream.allocateNew(BufferAllocator) with a vanilla BufferAllocator, AND
- The gluten bundle's class wins resolution in the AppClassLoader (it ships on
extraClassPath wildcards)
Most pure-gluten workloads don't hit it because gluten's own internal callers always use the shaded type. The bug becomes user-facing whenever a Spark app pulls in another library that uses Arrow C-Data (Iceberg's Arrow vector layer, Lance's Java writer, Snowflake JDBC's Arrow result decoder, etc.).
In our case (Lance Java + IBM CP4D Spark cluster), it surfaces because Lance's LanceDataWriter calls ArrowArrayStream.allocateNew(LanceRuntime.allocator()) — which is the public BufferAllocator — to hand off batches to the native Lance writer.
Proposed fix
Add org.apache.arrow.memory.** (and possibly org.apache.arrow.vector.** for symmetry — see Discussion below) to the relocation excludes, so the bundled ArrowArrayStream references the public BufferAllocator and matches everyone else's API:
<relocation>
<pattern>org.apache.arrow</pattern>
<shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
<!--arrow's C and dataset wrapper refers to the original class path,
so we should not relocate here. Their public API takes
org.apache.arrow.memory.* and returns org.apache.arrow.vector.*,
which therefore must also be left unshaded so the bundled C-Data
classes match the public Apache Arrow API.-->
<excludes>
<exclude>org.apache.arrow.c.*</exclude>
<exclude>org.apache.arrow.c.jni.*</exclude>
<exclude>org.apache.arrow.memory.**</exclude>
<exclude>org.apache.arrow.vector.**</exclude>
<exclude>org.apache.arrow.dataset.**</exclude>
</excludes>
</relocation>
The smaller fix (adding only org.apache.arrow.memory.**) addresses the immediate BufferAllocator mismatch. Adding vector.** is necessary if any C-Data method returns or accepts vector types — which Data.exportVectorSchemaRoot(...) does.
Discussion: why not just exclude arrow.dataset and the rest of arrow?
Gluten relocates Arrow precisely to avoid version conflicts with the user's Arrow. The C-Data exclusion was a partial walk-back of that strategy because the JNI native code can't be relocated. The fix here is just the consistent extension: anything reachable through the unshaded API surface must be unshaded.
This means gluten's internal users of BufferAllocator/vector.* will see whatever Arrow version is on the user's classpath, not gluten's bundled one. That's fine if gluten's compiled-against version is API-compatible with the user's version — Arrow Java has been ABI/API stable from 7.x through 18.x for the common types.
If gluten needs a specific BufferAllocator API that the user's Arrow doesn't provide, that's a hard incompatibility and needs a separate fix (e.g., gluten provides its own non-conflicting class name).
Affected files
package/pom.xml (one block, ~3 lines added)
Tests
Adding a tiny standalone Java test that asserts the public method signature on the bundled ArrowArrayStream. Lives under package/src/test/... so it runs as part of package module's tests.
Severity / urgency
- Medium-high: blocks any Spark workload that combines gluten with another library using Arrow C-Data
- Has been latent for ~2 years (since v1.0.0); affecting users now as more libraries adopt Arrow C-Data for native interop
- Workaround possible per-app (re-shading at fat-jar level, classpath ordering tricks) but fragile and doesn't help libraries that need to work without app-level fat-jar control
References
- Same bug visible since v1.0.0: confirmed by checking
package/pom.xml on tags v1.0.0, v1.1.0, v1.2.0, v1.3.0, v1.4.0, v1.5.0, v1.6.0, main
ArrowArrayStream.allocateNew(BufferAllocator)in shaded bundle takes wrong (relocated) parameter type, breaking interop with vanilla Apache Arrow callersSummary
The gluten-velox bundle's
org.apache.arrow.c.ArrowArrayStream.allocateNew(BufferAllocator)method is compiled to take a gluten-internal shadedorg.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator, not the publicorg.apache.arrow.memory.BufferAllocator. On clusters that load the bundle into the JVMAppClassLoader(e.g. as a wildcardextraClassPath), it shadows the user's vanilla ArrowArrowArrayStreamclass. Any caller — Lance, Iceberg, Snowflake JDBC, anyone using Arrow C-Data — then fails withNoSuchMethodErrorbecause their publicBufferAllocatordoesn't match gluten's shaded one.This is a shading-config bug in
package/pom.xml:org.apache.arrow.c.*is excluded from relocation (correct, because Arrow's native C-Data JNI hardcodes those class names), butorg.apache.arrow.memory.*is relocated. Sinceorg.apache.arrow.c.ArrowArrayStreamreferencesorg.apache.arrow.memory.BufferAllocatorin its public API signatures, the resulting class is internally inconsistent — public class with private parameter types.Repro
3-line standalone repro using only Apache Arrow Java (no other deps):
Run as a Spark application on any cluster that has gluten's bundle on the wildcard
extraClassPath(e.g. IBM CP4Dspark175engine ships gluten 1.7.0-WXD233RC1 at/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar).Expected
Actual
Root cause
package/pom.xml, around line 121-130 (same on every release since v1.0.0):The intent is correct (don't relocate JNI-bound classes). But the public API of
org.apache.arrow.c.*returns and acceptsorg.apache.arrow.memory.*types. WhenArrowArrayStreamis included in the bundle without relocation, butBufferAllocatorIS relocated, the bundledArrowArrayStream's method signatures get re-bound to the shadedBufferAllocatorat compile time. The bundled class is then incompatible with anyone passing a vanillaBufferAllocator.The same applies to other public
org.apache.arrow.c.*↔org.apache.arrow.memory.*boundary methods:Data.exportArrayStream(BufferAllocator, ...),ArrowSchema.allocateNew(BufferAllocator), etc.Why it's been latent
The bug has been present since v1.0.0 but only fires when:
ArrowArrayStream.allocateNew(BufferAllocator)with a vanillaBufferAllocator, ANDextraClassPathwildcards)Most pure-gluten workloads don't hit it because gluten's own internal callers always use the shaded type. The bug becomes user-facing whenever a Spark app pulls in another library that uses Arrow C-Data (Iceberg's Arrow vector layer, Lance's Java writer, Snowflake JDBC's Arrow result decoder, etc.).
In our case (Lance Java + IBM CP4D Spark cluster), it surfaces because Lance's
LanceDataWritercallsArrowArrayStream.allocateNew(LanceRuntime.allocator())— which is the publicBufferAllocator— to hand off batches to the native Lance writer.Proposed fix
Add
org.apache.arrow.memory.**(and possiblyorg.apache.arrow.vector.**for symmetry — see Discussion below) to the relocation excludes, so the bundledArrowArrayStreamreferences the publicBufferAllocatorand matches everyone else's API:The smaller fix (adding only
org.apache.arrow.memory.**) addresses the immediateBufferAllocatormismatch. Addingvector.**is necessary if any C-Data method returns or accepts vector types — whichData.exportVectorSchemaRoot(...)does.Discussion: why not just exclude
arrow.datasetand the rest of arrow?Gluten relocates Arrow precisely to avoid version conflicts with the user's Arrow. The C-Data exclusion was a partial walk-back of that strategy because the JNI native code can't be relocated. The fix here is just the consistent extension: anything reachable through the unshaded API surface must be unshaded.
This means gluten's internal users of
BufferAllocator/vector.*will see whatever Arrow version is on the user's classpath, not gluten's bundled one. That's fine if gluten's compiled-against version is API-compatible with the user's version — Arrow Java has been ABI/API stable from 7.x through 18.x for the common types.If gluten needs a specific
BufferAllocatorAPI that the user's Arrow doesn't provide, that's a hard incompatibility and needs a separate fix (e.g., gluten provides its own non-conflicting class name).Affected files
package/pom.xml(one block, ~3 lines added)Tests
Adding a tiny standalone Java test that asserts the public method signature on the bundled
ArrowArrayStream. Lives underpackage/src/test/...so it runs as part ofpackagemodule's tests.Severity / urgency
References
package/pom.xmlon tags v1.0.0, v1.1.0, v1.2.0, v1.3.0, v1.4.0, v1.5.0, v1.6.0, main