Skip to content

Shaded ArrowArrayStream.allocateNew signature points at gluten-shaded BufferAllocator, breaking Arrow C-Data interop #12225

@sezruby

Description

@sezruby

ArrowArrayStream.allocateNew(BufferAllocator) in shaded bundle takes wrong (relocated) parameter type, breaking interop with vanilla Apache Arrow callers

Summary

The gluten-velox bundle's org.apache.arrow.c.ArrowArrayStream.allocateNew(BufferAllocator) method is compiled to take a gluten-internal shaded org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator, not the public org.apache.arrow.memory.BufferAllocator. On clusters that load the bundle into the JVM AppClassLoader (e.g. as a wildcard extraClassPath), it shadows the user's vanilla Arrow ArrowArrayStream class. Any caller — Lance, Iceberg, Snowflake JDBC, anyone using Arrow C-Data — then fails with NoSuchMethodError because their public BufferAllocator doesn't match gluten's shaded one.

This is a shading-config bug in package/pom.xml: org.apache.arrow.c.* is excluded from relocation (correct, because Arrow's native C-Data JNI hardcodes those class names), but org.apache.arrow.memory.* is relocated. Since org.apache.arrow.c.ArrowArrayStream references org.apache.arrow.memory.BufferAllocator in its public API signatures, the resulting class is internally inconsistent — public class with private parameter types.

Repro

3-line standalone repro using only Apache Arrow Java (no other deps):

import org.apache.arrow.c.ArrowArrayStream
import org.apache.arrow.memory.RootAllocator

object GlutenArrowConflictRepro {
  def main(args: Array[String]): Unit = {
    val allocator = new RootAllocator(Long.MaxValue)
    val stream = ArrowArrayStream.allocateNew(allocator)  // <-- fails here
    stream.close()
    allocator.close()
  }
}

Run as a Spark application on any cluster that has gluten's bundle on the wildcard extraClassPath (e.g. IBM CP4D spark175 engine ships gluten 1.7.0-WXD233RC1 at /opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar).

Expected

=== Probe: where does ArrowArrayStream resolve from? ===
ArrowArrayStream class loaded from: file:/<path-to-vanilla-arrow-c-data>.jar
declared methods:
  public static org.apache.arrow.c.ArrowArrayStream
  org.apache.arrow.c.ArrowArrayStream.allocateNew(org.apache.arrow.memory.BufferAllocator)
=== Attempt ===
OK
DONE

Actual

=== Probe: where does ArrowArrayStream resolve from? ===
ArrowArrayStream class loaded from: file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
declared methods:
  public static org.apache.arrow.c.ArrowArrayStream
  org.apache.arrow.c.ArrowArrayStream.allocateNew(
    org.apache.gluten.shaded.org.apache.arrow.memory.BufferAllocator)   <-- shaded type!
=== Attempt ===
FAILED: java.lang.NoSuchMethodError: org/apache/arrow/c/ArrowArrayStream.allocateNew(
  Lorg/apache/arrow/memory/BufferAllocator;)Lorg/apache/arrow/c/ArrowArrayStream;
  (loaded from file:/opt/gluten/jars/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.7.0-...jar
   by jdk.internal.loader.ClassLoaders$AppClassLoader)

Root cause

package/pom.xml, around line 121-130 (same on every release since v1.0.0):

<relocation>
  <pattern>org.apache.arrow</pattern>
  <shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
  <!--arrow's C and dataset wrapper refers to the original class path,
      so we should not relocate here-->
  <excludes>
    <exclude>org.apache.arrow.c.*</exclude>
    <exclude>org.apache.arrow.c.jni.*</exclude>
    <exclude>org.apache.arrow.dataset.**</exclude>
  </excludes>
</relocation>

The intent is correct (don't relocate JNI-bound classes). But the public API of org.apache.arrow.c.* returns and accepts org.apache.arrow.memory.* types. When ArrowArrayStream is included in the bundle without relocation, but BufferAllocator IS relocated, the bundled ArrowArrayStream's method signatures get re-bound to the shaded BufferAllocator at compile time. The bundled class is then incompatible with anyone passing a vanilla BufferAllocator.

The same applies to other public org.apache.arrow.c.*org.apache.arrow.memory.* boundary methods: Data.exportArrayStream(BufferAllocator, ...), ArrowSchema.allocateNew(BufferAllocator), etc.

Why it's been latent

The bug has been present since v1.0.0 but only fires when:

  1. Some other code on the same JVM calls ArrowArrayStream.allocateNew(BufferAllocator) with a vanilla BufferAllocator, AND
  2. The gluten bundle's class wins resolution in the AppClassLoader (it ships on extraClassPath wildcards)

Most pure-gluten workloads don't hit it because gluten's own internal callers always use the shaded type. The bug becomes user-facing whenever a Spark app pulls in another library that uses Arrow C-Data (Iceberg's Arrow vector layer, Lance's Java writer, Snowflake JDBC's Arrow result decoder, etc.).

In our case (Lance Java + IBM CP4D Spark cluster), it surfaces because Lance's LanceDataWriter calls ArrowArrayStream.allocateNew(LanceRuntime.allocator()) — which is the public BufferAllocator — to hand off batches to the native Lance writer.

Proposed fix

Add org.apache.arrow.memory.** (and possibly org.apache.arrow.vector.** for symmetry — see Discussion below) to the relocation excludes, so the bundled ArrowArrayStream references the public BufferAllocator and matches everyone else's API:

<relocation>
  <pattern>org.apache.arrow</pattern>
  <shadedPattern>${gluten.shade.packageName}.org.apache.arrow</shadedPattern>
  <!--arrow's C and dataset wrapper refers to the original class path,
      so we should not relocate here. Their public API takes
      org.apache.arrow.memory.* and returns org.apache.arrow.vector.*,
      which therefore must also be left unshaded so the bundled C-Data
      classes match the public Apache Arrow API.-->
  <excludes>
    <exclude>org.apache.arrow.c.*</exclude>
    <exclude>org.apache.arrow.c.jni.*</exclude>
    <exclude>org.apache.arrow.memory.**</exclude>
    <exclude>org.apache.arrow.vector.**</exclude>
    <exclude>org.apache.arrow.dataset.**</exclude>
  </excludes>
</relocation>

The smaller fix (adding only org.apache.arrow.memory.**) addresses the immediate BufferAllocator mismatch. Adding vector.** is necessary if any C-Data method returns or accepts vector types — which Data.exportVectorSchemaRoot(...) does.

Discussion: why not just exclude arrow.dataset and the rest of arrow?

Gluten relocates Arrow precisely to avoid version conflicts with the user's Arrow. The C-Data exclusion was a partial walk-back of that strategy because the JNI native code can't be relocated. The fix here is just the consistent extension: anything reachable through the unshaded API surface must be unshaded.

This means gluten's internal users of BufferAllocator/vector.* will see whatever Arrow version is on the user's classpath, not gluten's bundled one. That's fine if gluten's compiled-against version is API-compatible with the user's version — Arrow Java has been ABI/API stable from 7.x through 18.x for the common types.

If gluten needs a specific BufferAllocator API that the user's Arrow doesn't provide, that's a hard incompatibility and needs a separate fix (e.g., gluten provides its own non-conflicting class name).

Affected files

  • package/pom.xml (one block, ~3 lines added)

Tests

Adding a tiny standalone Java test that asserts the public method signature on the bundled ArrowArrayStream. Lives under package/src/test/... so it runs as part of package module's tests.

Severity / urgency

  • Medium-high: blocks any Spark workload that combines gluten with another library using Arrow C-Data
  • Has been latent for ~2 years (since v1.0.0); affecting users now as more libraries adopt Arrow C-Data for native interop
  • Workaround possible per-app (re-shading at fat-jar level, classpath ordering tricks) but fragile and doesn't help libraries that need to work without app-level fat-jar control

References

  • Same bug visible since v1.0.0: confirmed by checking package/pom.xml on tags v1.0.0, v1.1.0, v1.2.0, v1.3.0, v1.4.0, v1.5.0, v1.6.0, main

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions