[CORE] Drop 15.0.0-gluten Arrow version rename, depend on vanilla Apache Arrow#12244
[CORE] Drop 15.0.0-gluten Arrow version rename, depend on vanilla Apache Arrow#12244sezruby wants to merge 1 commit into
Conversation
… Apache Arrow The custom 15.0.0-gluten artifact coordinate forced every contributor to run dev/build-arrow.sh before they could build gluten, even though the Java side of that build no longer carries any load-bearing modifications: * The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every consumer of those classes inside gluten was deleted by apache#12130 along with the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied to the Arrow Java build here; the file itself is kept because get-velox.sh still copies it into Velox's CMake Arrow EP for the C++ side. * support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load bearing for ppc64le builds, but does not require an artifact rename — it only patches the binary-resource lookup inside the arrow-c-data JNI jar and is still applied by build-arrow.sh. * The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are unchanged. After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0, arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh to produce locally-patched 15.0.0 artifacts — the local-m2 install overrides Central as before. Note: this PR removes the artifact-rename indirection but does not yet unbundle Arrow from the gluten-velox bundle. The bundle still ships unshaded Arrow (per apache#12226) at the same vanilla coordinates. Removing the bundled Arrow in favour of Spark's bundled copy is a separate follow-up driven by the discussion on apache#12226.
|
Run Gluten Clickhouse CI on x86 |
| # Arrow Java libraries | ||
| ${MVN_CMD} install -Parrow-jni -P arrow-c-data -pl c,dataset -am \ | ||
| -Darrow.c.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.dataset.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.cpp.build.dir=$ARROW_INSTALL_DIR/lib \ | ||
| -Dmaven.test.skip -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -Dassembly.skipAssembly |
There was a problem hiding this comment.
Do we still need to build Arrow Java locally?
Mostly no — Maven Central's The reason it's still wired into
Happy to add a follow-up commit gating |
What changes were proposed in this pull request?
Drop the
15.0.0-glutencustom artifact coordinate that gluten currently builds viadev/build-arrow.sh. Switch everyorg.apache.arrow:*dependency to the vanilla${arrow.version}coordinate (15.0.0 for Spark 3.x default; 18.1.0 already in the Spark-4.0/4.1 profiles).The
arrow-gluten.versionproperty and theversions:set -DnewVersion=15.0.0-glutenstep inbuild-arrow.share removed. Themodify_arrow_dataset_scan_option.patchis no longer applied to the Arrow Java build.Why this is safe — the patch audit
build-arrow.shpreviously applied four patches and renamed the resulting jars to15.0.0-gluten. Auditing each:modify_arrow.patchmodify_arrow_dataset_scan_option.patchCsvFragmentScanOptions,CsvConvertOptions,ConvertUtil, etc.; C++file_csv/ Substraitexpression_internal/serdeCMake/resolve_dependency_modules/arrow/—get-velox.shcontinues to copy the patch file into the Velox EP.cmake-compatibility.patchsupport_ibm_power.patchppc64le→ppcle_64arch case toJniLoader.javain arrow-c and arrow-datasetbuild-arrow.sh, which now installs vanilla15.0.0(overriding the Central jars in their local m2).A separate sweep (
grep -rn 'org.apache.arrow.dataset' --include='*.java' --include='*.scala') shows that after #12130 the only main-source consumers ofarrow-datasetaregluten-arrow/.../ArrowNativeMemoryPoolandArrowReservationListener, which use only the upstreamorg.apache.arrow.dataset.jni.{NativeMemoryPool, ReservationListener}types — no patched classes.Effect on contributors
dev/build-arrow.shis no longer required to bootstrap the build. All Arrow JVM dependencies resolve from Maven Central. CI / local builds skip the ~hour-long Arrow C++/Java compile.dev/build-arrow.shis still required (for the patchedarrow-c-data/arrow-datasetJNI binaries with ppc64le arch mapping). The script now installsarrow-vector:15.0.0etc. into local m2, overriding Central — same dev-loop as before, just without the rename indirection.Effect on shading
Independent of bundling. The bundled gluten-velox-bundle still ships unshaded
org.apache.arrow.*per #12226. This PR doesn't change which Arrow artifacts end up in the bundle — only their coordinate. The follow-up to actually unbundle Arrow (use Spark's shipped Arrow at runtime) is tracked separately in the discussion under #12226.How was this patch tested?
mvn dependency:tree -pl gluten-arrowshows everyorg.apache.arrow:*resolved at vanilla15.0.0/18.1.0from Central.grep -rn 'arrow-gluten\.version\|15\.0\.0-gluten'— no matches outside of expected pom diff context.References
modify_arrow_dataset_scan_option.patchzhztheplayerandFelixYBWproposed this unbundling direction