Skip to content

[CORE] Unbundle Arrow memory + vector from gluten-velox-bundle (Draft)#12245

Closed
sezruby wants to merge 2 commits into
apache:mainfrom
sezruby:arrow-unbundle-from-velox
Closed

[CORE] Unbundle Arrow memory + vector from gluten-velox-bundle (Draft)#12245
sezruby wants to merge 2 commits into
apache:mainfrom
sezruby:arrow-unbundle-from-velox

Conversation

@sezruby
Copy link
Copy Markdown
Contributor

@sezruby sezruby commented Jun 5, 2026

Draft. Stacked on #12244. The diff below is what the unbundle looks like in pom-form; cross-distro testing (vanilla 3.5, DBR 16.4, Cloudera, 4.0/4.1) is still TODO and gates merge.

What changes were proposed in this pull request?

Stop bundling `arrow-memory-*` and `arrow-vector` in `gluten-velox-bundle`. Mark them as `scope=provided` in `gluten-arrow/pom.xml` and rely on Spark's own Arrow distribution at runtime (`$SPARK_HOME/jars/` for Spark 3.x; declared in Spark 4.x's pom).

`arrow-c-data` and `arrow-dataset` stay bundled — Spark does not ship those.

Why

Follow-up from #12226 discussion. The bundled-and-shaded-Arrow approach is the source of #12225 (and similar #7423): when gluten's bundle wins classloader resolution, its class signatures collide with the user's vanilla Arrow. #12226 fixed the immediate `NoSuchMethodError` by un-shading; but as @zhztheplayer noted, "Memory and vector APIs should be stable across minor versions" is a real risk worth eliminating: the cleanest fix is to not ship them at all.

Effects:

  • gluten-velox-bundle no longer contains any `org.apache.arrow.memory.` or `org.apache.arrow.vector.` classes. Class-shadowing from Shaded ArrowArrayStream.allocateNew signature points at gluten-shaded BufferAllocator, breaking Arrow C-Data interop #12225 disappears by construction.
  • The `org.apache.arrow` shade-relocation block in `package/pom.xml` is removed (nothing to relocate, since memory/vector aren't bundled and c-data/dataset were already excluded).
  • `arrow-c-data` / `arrow-dataset` remain bundled. With no relocation, their public API signatures bind to vanilla `BufferAllocator` / `VectorSchemaRoot` — exactly what every other Arrow C-Data caller on the classpath expects.
  • `backends-velox/pom.xml` re-declares `arrow-memory-core` and `arrow-vector` at `provided` scope so its compile classpath still resolves them after the gluten-arrow scope flip. `gluten-ut/*` and `backends-clickhouse` already declare them locally.

Open questions / why this is a Draft

  1. `arrow.version` pin per Spark distro. `<arrow.version>15.0.0</arrow.version>` matches vanilla Spark 3.5.x. DBR 16.4 ships Spark 3.5 with Arrow 12.0.1 — gluten compiled against 15 might `NoSuchMethodError` on DBR. Need to either (a) downgrade to LCD 12.0.1 for the Spark-3.5 profile, or (b) add a DBR-specific profile, or (c) declare gluten as DBR-incompatible. Cloudera flavors need similar verification.
  2. Cross-distro test matrix. Want to actually run the gluten test suite against vanilla 3.5, DBR 16.4, Cloudera CDS, and 4.0/4.1 before merging. CI here only covers vanilla.
  3. Velox C++ side still uses bundled Arrow. The cpp side links its own Arrow (the C++ patches in `ep/build-velox/src/modify_arrow.patch`); this PR only changes JVM-side bundling. The JVM ↔ C++ exchange happens via Arrow C-Data's stable ABI, so the JVM-side Arrow version doesn't need to match the C++-side one. Worth noting in case anyone assumes they should track.
  4. `dev/check-arrow-c-shading.sh` from [GLUTEN-12225][CORE] Fix arrow.c shading: exclude memory/vector packages so public API stays unshaded #12226 still passes — bundle still has `org/apache/arrow/c/` and their signatures now reference unshaded `memory.` / `vector.*` types (which resolve from Spark's bundled Arrow at runtime).

How was this patch tested?

Local build only. CI green needed before un-drafting.

Closes / refs

Stacked on:

sezruby added 2 commits June 4, 2026 21:20
… Apache Arrow

The custom 15.0.0-gluten artifact coordinate forced every contributor to run
dev/build-arrow.sh before they could build gluten, even though the Java side
of that build no longer carries any load-bearing modifications:

* The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait
  dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every
  consumer of those classes inside gluten was deleted by apache#12130 along with
  the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied
  to the Arrow Java build here; the file itself is kept because get-velox.sh
  still copies it into Velox's CMake Arrow EP for the C++ side.
* support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load
  bearing for ppc64le builds, but does not require an artifact rename — it
  only patches the binary-resource lookup inside the arrow-c-data JNI jar
  and is still applied by build-arrow.sh.
* The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are
  unchanged.

After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency
resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0,
arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for
the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh
to produce locally-patched 15.0.0 artifacts — the local-m2 install
overrides Central as before.

Note: this PR removes the artifact-rename indirection but does not yet
unbundle Arrow from the gluten-velox bundle. The bundle still ships
unshaded Arrow (per apache#12226) at the same vanilla coordinates. Removing
the bundled Arrow in favour of Spark's bundled copy is a separate
follow-up driven by the discussion on apache#12226.
Mark arrow-memory-{core,unsafe,netty} and arrow-vector as scope=provided in
gluten-arrow/pom.xml. They are bundled in Spark's distribution
($SPARK_HOME/jars/ for Spark 3.x; declared in Spark 4.x's pom), so the user's
classpath already has them at runtime — gluten does not need to ship its own
copy.

Effects:

* The gluten-velox bundle no longer ships ANY org.apache.arrow.memory.* or
  org.apache.arrow.vector.* classes. The class-shadowing problem from apache#12225
  goes away by construction — there is no gluten-shipped copy left to shadow
  the user's vanilla Arrow.
* The org.apache.arrow shade-relocation block in package/pom.xml becomes
  redundant and is removed: arrow-memory/vector are no longer in the bundle
  to relocate, and arrow-c-data / arrow-dataset (still bundled) were already
  excluded from relocation because their JNI binds to the original class
  names.
* arrow-c-data and arrow-dataset remain at scope=compile in gluten-arrow —
  Spark does NOT ship those, so gluten still bundles them. With the relocation
  block gone, their public method signatures naturally bind to the user's
  vanilla org.apache.arrow.memory.BufferAllocator / arrow-vector types,
  exactly matching what every other Arrow C-Data caller on the classpath
  expects.

Compile-classpath touch-ups:

* backends-velox/pom.xml: re-declare arrow-memory-core and arrow-vector at
  scope=provided. The transitive route through gluten-arrow no longer carries
  them after the scope flip, so backends-velox needs its own provided
  declaration to compile.
* gluten-ut/* and backends-clickhouse already declare arrow at provided
  scope locally, so they are unaffected.

Caveats:

* Spark 3.5 and earlier do NOT declare arrow-memory/arrow-vector in their
  Maven POM (they ship them inside the binary distribution only). gluten
  builds against the version pinned in `arrow.version`. Maintainers should
  keep `arrow.version` aligned with the lowest-common-denominator Arrow
  version across supported Spark distros (DBR 16.4 ships Arrow 12.0.1 with
  Spark 3.5; vanilla Spark 3.5.x ships 15.0.0 — the 15.0.0 default here is
  fine for vanilla Spark 3.5 but may need a compat profile for DBR/Cloudera
  flavors).
* dev/check-arrow-c-shading.sh added in apache#12226 still passes — the bundle
  still contains org/apache/arrow/c/* classes whose method signatures now
  reference unshaded org.apache.arrow.memory.* / org.apache.arrow.vector.*
  types (which are no longer in the bundle, but resolve at runtime from
  Spark's Arrow).

Builds on apache#12244 (drop the 15.0.0-gluten Arrow version rename). Addresses the
follow-up direction from apache#12226 discussion: "remove Arrow from the bundled
Gluten Jar and let users rely on Spark's bundled Arrow".
@github-actions github-actions Bot added CORE works for Gluten Core BUILD VELOX labels Jun 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Run Gluten Clickhouse CI on x86

@sezruby sezruby marked this pull request as ready for review June 5, 2026 04:30
@sezruby
Copy link
Copy Markdown
Contributor Author

sezruby commented Jun 5, 2026

Closing this — the unbundling direction turns out to be incompatible with gluten's current Spark 3.3 / 3.4 support, and I don't think the workaround is worth the risk.

cc @zhztheplayer @FelixYBW

What CI showed. Spark 3.5 / 4.0 / 4.1 lanes were on track, but spark-test-spark33 and spark-test-spark34 (and several tpc-test-* lanes built against them) failed early. Root cause traced to the bundled-Arrow being load-bearing for older Spark:

  • Spark 3.3.1 ships Arrow 7.0.0
  • Spark 3.4.4 ships Arrow 11.0.0
  • Spark 3.5.5 ships Arrow 15.0.0
  • Spark 4.0.x / 4.1.x ship Arrow 18.x

Gluten's parent pom.xml pins <arrow.version>15.0.0</arrow.version> and uses it at compile scope. Today that works because gluten bundles its own Arrow 15 into the velox bundle, which wins classloader resolution at runtime over Spark's older Arrow.

Once arrow-memory-* / arrow-vector flip to scope=provided (this PR), the bundle stops shipping Arrow. The compile classpath still has 15, but at runtime on Spark 3.3 / 3.4 only the older Arrow (7 / 11) is on the classpath — NoSuchMethodError / NoClassDefFoundError follow.

Workarounds considered.

  1. Per-Spark-profile <arrow.version> overrides (3.3→7.0, 3.4→11.0, 3.5→15.0, 4.x→18.1). Compiles, but ships gluten built against Arrow 7 on the 3.3 profile — exactly the "API stability across versions" concern you raised on #12226 (> Memory and vector APIs should be stable across minor versions / This sounds a real risk), now applied across an eight-version gap rather than a one-or-two-version gap. Surface area too large to be confident without per-version testing.
  2. Conditional <scope> (provided on 3.5+, compile on 3.3/3.4). Works mechanically but is ugly and leaves the bug (#12225) latent on Spark 3.3 / 3.4.
  3. Drop Spark 3.3 / 3.4 support. Out of scope for this fix.

None feels worth it as a one-shot, especially since #12226 already neutralized the immediate NoSuchMethodError from #12225 by un-shading the boundary types.

What I'm keeping. #12244 — drop the 15.0.0-gluten artifact rename, drop the dead modify_arrow_dataset_scan_option.patch from the Arrow JVM build, depend on vanilla Apache Arrow from Maven Central. CI green there. That gives non-ppc64le contributors a faster build-from-source path without changing the runtime/bundling story.

For follow-up. If gluten ever drops Spark 3.3 / 3.4, this unbundling work is small — the diff is ~3 poms. Happy to revisit then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant