[CORE] Unbundle Arrow memory + vector from gluten-velox-bundle (Draft)#12245
[CORE] Unbundle Arrow memory + vector from gluten-velox-bundle (Draft)#12245sezruby wants to merge 2 commits into
Conversation
… Apache Arrow The custom 15.0.0-gluten artifact coordinate forced every contributor to run dev/build-arrow.sh before they could build gluten, even though the Java side of that build no longer carries any load-bearing modifications: * The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every consumer of those classes inside gluten was deleted by apache#12130 along with the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied to the Arrow Java build here; the file itself is kept because get-velox.sh still copies it into Velox's CMake Arrow EP for the C++ side. * support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load bearing for ppc64le builds, but does not require an artifact rename — it only patches the binary-resource lookup inside the arrow-c-data JNI jar and is still applied by build-arrow.sh. * The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are unchanged. After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0, arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh to produce locally-patched 15.0.0 artifacts — the local-m2 install overrides Central as before. Note: this PR removes the artifact-rename indirection but does not yet unbundle Arrow from the gluten-velox bundle. The bundle still ships unshaded Arrow (per apache#12226) at the same vanilla coordinates. Removing the bundled Arrow in favour of Spark's bundled copy is a separate follow-up driven by the discussion on apache#12226.
Mark arrow-memory-{core,unsafe,netty} and arrow-vector as scope=provided in
gluten-arrow/pom.xml. They are bundled in Spark's distribution
($SPARK_HOME/jars/ for Spark 3.x; declared in Spark 4.x's pom), so the user's
classpath already has them at runtime — gluten does not need to ship its own
copy.
Effects:
* The gluten-velox bundle no longer ships ANY org.apache.arrow.memory.* or
org.apache.arrow.vector.* classes. The class-shadowing problem from apache#12225
goes away by construction — there is no gluten-shipped copy left to shadow
the user's vanilla Arrow.
* The org.apache.arrow shade-relocation block in package/pom.xml becomes
redundant and is removed: arrow-memory/vector are no longer in the bundle
to relocate, and arrow-c-data / arrow-dataset (still bundled) were already
excluded from relocation because their JNI binds to the original class
names.
* arrow-c-data and arrow-dataset remain at scope=compile in gluten-arrow —
Spark does NOT ship those, so gluten still bundles them. With the relocation
block gone, their public method signatures naturally bind to the user's
vanilla org.apache.arrow.memory.BufferAllocator / arrow-vector types,
exactly matching what every other Arrow C-Data caller on the classpath
expects.
Compile-classpath touch-ups:
* backends-velox/pom.xml: re-declare arrow-memory-core and arrow-vector at
scope=provided. The transitive route through gluten-arrow no longer carries
them after the scope flip, so backends-velox needs its own provided
declaration to compile.
* gluten-ut/* and backends-clickhouse already declare arrow at provided
scope locally, so they are unaffected.
Caveats:
* Spark 3.5 and earlier do NOT declare arrow-memory/arrow-vector in their
Maven POM (they ship them inside the binary distribution only). gluten
builds against the version pinned in `arrow.version`. Maintainers should
keep `arrow.version` aligned with the lowest-common-denominator Arrow
version across supported Spark distros (DBR 16.4 ships Arrow 12.0.1 with
Spark 3.5; vanilla Spark 3.5.x ships 15.0.0 — the 15.0.0 default here is
fine for vanilla Spark 3.5 but may need a compat profile for DBR/Cloudera
flavors).
* dev/check-arrow-c-shading.sh added in apache#12226 still passes — the bundle
still contains org/apache/arrow/c/* classes whose method signatures now
reference unshaded org.apache.arrow.memory.* / org.apache.arrow.vector.*
types (which are no longer in the bundle, but resolve at runtime from
Spark's Arrow).
Builds on apache#12244 (drop the 15.0.0-gluten Arrow version rename). Addresses the
follow-up direction from apache#12226 discussion: "remove Arrow from the bundled
Gluten Jar and let users rely on Spark's bundled Arrow".
|
Run Gluten Clickhouse CI on x86 |
|
Closing this — the unbundling direction turns out to be incompatible with gluten's current Spark 3.3 / 3.4 support, and I don't think the workaround is worth the risk. What CI showed. Spark 3.5 / 4.0 / 4.1 lanes were on track, but
Gluten's parent Once Workarounds considered.
None feels worth it as a one-shot, especially since #12226 already neutralized the immediate What I'm keeping. #12244 — drop the For follow-up. If gluten ever drops Spark 3.3 / 3.4, this unbundling work is small — the diff is ~3 poms. Happy to revisit then. |
Draft. Stacked on #12244. The diff below is what the unbundle looks like in pom-form; cross-distro testing (vanilla 3.5, DBR 16.4, Cloudera, 4.0/4.1) is still TODO and gates merge.
What changes were proposed in this pull request?
Stop bundling `arrow-memory-*` and `arrow-vector` in `gluten-velox-bundle`. Mark them as `scope=provided` in `gluten-arrow/pom.xml` and rely on Spark's own Arrow distribution at runtime (`$SPARK_HOME/jars/` for Spark 3.x; declared in Spark 4.x's pom).
`arrow-c-data` and `arrow-dataset` stay bundled — Spark does not ship those.
Why
Follow-up from #12226 discussion. The bundled-and-shaded-Arrow approach is the source of #12225 (and similar #7423): when gluten's bundle wins classloader resolution, its class signatures collide with the user's vanilla Arrow. #12226 fixed the immediate `NoSuchMethodError` by un-shading; but as @zhztheplayer noted, "Memory and vector APIs should be stable across minor versions" is a real risk worth eliminating: the cleanest fix is to not ship them at all.
Effects:
Open questions / why this is a Draft
How was this patch tested?
Local build only. CI green needed before un-drafting.
Closes / refs
Stacked on: