Skip to content

[CORE] Drop 15.0.0-gluten Arrow version rename, depend on vanilla Apache Arrow#12244

Open
sezruby wants to merge 1 commit into
apache:mainfrom
sezruby:arrow-drop-gluten-rename
Open

[CORE] Drop 15.0.0-gluten Arrow version rename, depend on vanilla Apache Arrow#12244
sezruby wants to merge 1 commit into
apache:mainfrom
sezruby:arrow-drop-gluten-rename

Conversation

@sezruby
Copy link
Copy Markdown
Contributor

@sezruby sezruby commented Jun 5, 2026

What changes were proposed in this pull request?

Drop the 15.0.0-gluten custom artifact coordinate that gluten currently builds via dev/build-arrow.sh. Switch every org.apache.arrow:* dependency to the vanilla ${arrow.version} coordinate (15.0.0 for Spark 3.x default; 18.1.0 already in the Spark-4.0/4.1 profiles).

The arrow-gluten.version property and the versions:set -DnewVersion=15.0.0-gluten step in build-arrow.sh are removed. The modify_arrow_dataset_scan_option.patch is no longer applied to the Arrow Java build.

Why this is safe — the patch audit

build-arrow.sh previously applied four patches and renamed the resulting jars to 15.0.0-gluten. Auditing each:

Patch Lines Touches Status after this PR
modify_arrow.patch 135 C++ only (CMakeLists, ThirdpartyToolchain, helpers.h, Java pom S3/HDFS) Still applied; no rename needed (CMake patch survives jar coordinate change)
modify_arrow_dataset_scan_option.patch 883 Adds JVM classes CsvFragmentScanOptions, CsvConvertOptions, ConvertUtil, etc.; C++ file_csv / Substrait expression_internal / serde No longer applied to Arrow JVM build. Every gluten consumer of these JVM classes was deleted by #12130 (Arrow-CSV / Arrow-Dataset JVM code path removal). The C++ portion is still applied via Velox's CMake/resolve_dependency_modules/arrow/get-velox.sh continues to copy the patch file into the Velox EP.
cmake-compatibility.patch 34 C++ only (CMake policy version) Unchanged
support_ibm_power.patch 28 Adds ppc64le→ppcle_64 arch case to JniLoader.java in arrow-c and arrow-dataset Still applied; ppc64le CI builds still need it. The patch only adds a switch case — it doesn't change any public Arrow API and doesn't require a custom artifact coordinate. ppc64le devs still get the patched binary by running build-arrow.sh, which now installs vanilla 15.0.0 (overriding the Central jars in their local m2).

A separate sweep (grep -rn 'org.apache.arrow.dataset' --include='*.java' --include='*.scala') shows that after #12130 the only main-source consumers of arrow-dataset are gluten-arrow/.../ArrowNativeMemoryPool and ArrowReservationListener, which use only the upstream org.apache.arrow.dataset.jni.{NativeMemoryPool, ReservationListener} types — no patched classes.

Effect on contributors

  • x86_64 / aarch64: dev/build-arrow.sh is no longer required to bootstrap the build. All Arrow JVM dependencies resolve from Maven Central. CI / local builds skip the ~hour-long Arrow C++/Java compile.
  • ppc64le: dev/build-arrow.sh is still required (for the patched arrow-c-data/arrow-dataset JNI binaries with ppc64le arch mapping). The script now installs arrow-vector:15.0.0 etc. into local m2, overriding Central — same dev-loop as before, just without the rename indirection.

Effect on shading

Independent of bundling. The bundled gluten-velox-bundle still ships unshaded org.apache.arrow.* per #12226. This PR doesn't change which Arrow artifacts end up in the bundle — only their coordinate. The follow-up to actually unbundle Arrow (use Spark's shipped Arrow at runtime) is tracked separately in the discussion under #12226.

How was this patch tested?

  • mvn dependency:tree -pl gluten-arrow shows every org.apache.arrow:* resolved at vanilla 15.0.0 / 18.1.0 from Central.
  • Sweep for stale references: grep -rn 'arrow-gluten\.version\|15\.0\.0-gluten' — no matches outside of expected pom diff context.
  • Manual CI run pending.

References

… Apache Arrow

The custom 15.0.0-gluten artifact coordinate forced every contributor to run
dev/build-arrow.sh before they could build gluten, even though the Java side
of that build no longer carries any load-bearing modifications:

* The 883-line modify_arrow_dataset_scan_option.patch added CSV / Substrait
  dataset Java classes (CsvFragmentScanOptions, ConvertUtil, etc.). Every
  consumer of those classes inside gluten was deleted by apache#12130 along with
  the Arrow-CSV / Arrow-Dataset JVM code path. The patch is no longer applied
  to the Arrow Java build here; the file itself is kept because get-velox.sh
  still copies it into Velox's CMake Arrow EP for the C++ side.
* support_ibm_power.patch (ppc64le → ppcle_64 in JniLoader) is still load
  bearing for ppc64le builds, but does not require an artifact rename — it
  only patches the binary-resource lookup inside the arrow-c-data JNI jar
  and is still applied by build-arrow.sh.
* The C++ patches (modify_arrow.patch, cmake-compatibility.patch) are
  unchanged.

After this change, on x86_64 / aarch64 every gluten-arrow Arrow dependency
resolves from Maven Central (arrow-c-data:15.0.0, arrow-dataset:15.0.0,
arrow-vector:15.0.0, arrow-memory-{core,unsafe,netty}:15.0.0; 18.1.0 for
the Spark 4.x profiles). ppc64le builds still rely on dev/build-arrow.sh
to produce locally-patched 15.0.0 artifacts — the local-m2 install
overrides Central as before.

Note: this PR removes the artifact-rename indirection but does not yet
unbundle Arrow from the gluten-velox bundle. The bundle still ships
unshaded Arrow (per apache#12226) at the same vanilla coordinates. Removing
the bundled Arrow in favour of Spark's bundled copy is a separate
follow-up driven by the discussion on apache#12226.
@github-actions github-actions Bot added CORE works for Gluten Core BUILD VELOX labels Jun 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Run Gluten Clickhouse CI on x86

Comment thread dev/build-arrow.sh
Comment on lines 113 to 116
# Arrow Java libraries
${MVN_CMD} install -Parrow-jni -P arrow-c-data -pl c,dataset -am \
-Darrow.c.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.dataset.jni.dist.dir=$ARROW_INSTALL_DIR/lib -Darrow.cpp.build.dir=$ARROW_INSTALL_DIR/lib \
-Dmaven.test.skip -Drat.skip -Dmaven.gitcommitid.skip -Dcheckstyle.skip -Dassembly.skipAssembly
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to build Arrow Java locally?

@sezruby
Copy link
Copy Markdown
Contributor Author

sezruby commented Jun 5, 2026

Do we still need to build Arrow Java locally?

Mostly no — Maven Central's arrow-c-data:15.0.0 jar already ships libarrow_cdata_jni for x86_64/ (Linux/macOS/Windows) and aarch_64/ (Linux/macOS), so x86_64 / aarch64 contributors no longer need the local Java build after this PR.

The reason it's still wired into dev/builddeps-veloxbe.sh unconditionally:

  • ppc64le has no native in the Central jar. support_ibm_power.patch (kept) adds the ppc64le → ppcle_64 arch case to JniLoader.java and the local mvn install step bakes a locally-built libarrow_cdata_jni.so for ppc64le into the resulting arrow-c-data:15.0.0 jar in ~/.m2, overriding Central.

Happy to add a follow-up commit gating build_arrow_java on [[ $(uname -m) == ppc64le ]] so x86_64 / aarch64 users skip ~10 min of redundant work. Holding off in this PR because there's no ppc64le CI lane to confirm the conditional doesn't break the patched build, and I don't have a qemu setup locally to validate it either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BUILD CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants