Description
Summary
We observed a ~15x performance regression in BroadcastHashJoin's HashBuild step when running TPC-DS 5TB on Gluten 1.6 vs Gluten 1.2 (Velox backend).
Reproduction
- Workload: TPC-DS 5TB, Q64
- Same cluster, same configurations on both versions
| Version |
time of hash build total |
| Gluten 1.2 |
8.0 min |
| Gluten 1.6 |
2.04 hours (~15x slower) |
Spark configuration
--num-executors 500
spark.executor.cores = 4
spark.executor.memory = 2g
spark.executor.memoryOverhead = 2g
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 8g
spark.driver.cores = 4
spark.driver.memory = 10g
spark.driver.maxResultSize = 3g
spark.sql.autoBroadcastJoinThreshold = 128MB
spark.sql.adaptive.autoBroadcastJoinThreshold = 128MB
spark.sql.shuffle.partitions = 760
spark.io.compression.codec = lz4
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max = 2000m
spark.plugins = org.apache.gluten.GlutenPlugin
spark.sql.extensions = org.apache.spark.sql.gluten.extension.GlutenExtension
spark.gluten.sql.columnar.backend.lib = velox
spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.gluten.enabled = true
spark.gluten.sql.columnar.shuffle.enabled = true
Spark UI screenshots
gluten1.6

gluten1.2

Expected behavior
HashBuild time on 1.6 should be on par with 1.2 for the same workload. At a minimum, it shouldn't be this much slower.
Actual behavior
HashBuild time degrades ~15x with no change in input rows.
Root cause
Bisected to PR #9521 (commit 3e9989d): [GLUTEN-9475][VL] Serialize ColumnarBatch one by one to reduce memory footprint when broadcasting
Behavior change:
|
1.2 |
1.6 |
| Driver-side JNI calls during broadcast serialize |
1 |
N |
Broadcast payload (Array[Array[Byte]]) length |
1 |
N |
Executor deserialize calls |
1 |
N |
| ColumnarBatches fed to HashBuild |
1 (large) |
N (small) |
HashBuild.addInput calls |
1 |
N |
HashBuild::addInput has fixed per-call overhead independent of row count (memory reservation check, DecodedVector init, NULL/stats locking, spill check). For Q64 with ~2B rows spread across many batches, this fixed cost gets multiplied by N and dominates total wall time.
Tradeoff
PR #9521 was introduced to reduce driver-side native memory peak for huge broadcasts (issue #9475). That tradeoff is valid for very large broadcasts that would OOM on 1.2, but for typical workloads it introduces a heavy CPU cost. The right fix is not to revert #9521 — both
behaviors have valid use cases.
Proposed fix: keep both paths behind a config switch
Add spark.gluten.velox.broadcastBuild.mergeBatches (boolean, default false):
The new JNI is additive — existing serialize(long) is preserved, so ColumnarCachedBatchSerializer and other callers are unaffected.
I have a working patch. On the same Q64 reproduction:
mergeBatches=false → HashBuild ~2.04 h (1.6 baseline)
mergeBatches=true → HashBuild ~22 min (matches 1.2)
when I set mergeBatches=true:
Both paths produce semantically equivalent broadcast results. Happy to open a PR for review.
Gluten version
None
Description
Summary
We observed a ~15x performance regression in BroadcastHashJoin's HashBuild step when running TPC-DS 5TB on Gluten 1.6 vs Gluten 1.2 (Velox backend).
Reproduction
Spark configuration
--num-executors 500
spark.executor.cores = 4
spark.executor.memory = 2g
spark.executor.memoryOverhead = 2g
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 8g
spark.driver.cores = 4
spark.driver.memory = 10g
spark.driver.maxResultSize = 3g
spark.sql.autoBroadcastJoinThreshold = 128MB
spark.sql.adaptive.autoBroadcastJoinThreshold = 128MB
spark.sql.shuffle.partitions = 760
spark.io.compression.codec = lz4
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max = 2000m
spark.plugins = org.apache.gluten.GlutenPlugin
spark.sql.extensions = org.apache.spark.sql.gluten.extension.GlutenExtension
spark.gluten.sql.columnar.backend.lib = velox
spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.gluten.enabled = true
spark.gluten.sql.columnar.shuffle.enabled = true
Spark UI screenshots
gluten1.6

gluten1.2

Expected behavior
HashBuild time on 1.6 should be on par with 1.2 for the same workload. At a minimum, it shouldn't be this much slower.
Actual behavior
HashBuild time degrades ~15x with no change in input rows.
Root cause
Bisected to PR #9521 (commit 3e9989d): [GLUTEN-9475][VL] Serialize ColumnarBatch one by one to reduce memory footprint when broadcasting
Behavior change:
Array[Array[Byte]]) lengthdeserializecallsHashBuild.addInputcallsHashBuild::addInputhas fixed per-call overhead independent of row count (memory reservation check, DecodedVector init, NULL/stats locking, spill check). For Q64 with ~2B rows spread across many batches, this fixed cost gets multiplied by N and dominates total wall time.Tradeoff
PR #9521 was introduced to reduce driver-side native memory peak for huge broadcasts (issue #9475). That tradeoff is valid for very large broadcasts that would OOM on 1.2, but for typical workloads it introduces a heavy CPU cost. The right fix is not to revert #9521 — both
behaviors have valid use cases.
Proposed fix: keep both paths behind a config switch
Add
spark.gluten.velox.broadcastBuild.mergeBatches(boolean, defaultfalse):false(default): keep 1.6's per-batch serialization (preserves OOM fix from [GLUTEN-9475][VL] Serialize ColumnarBatch one by one to reduce memory footprint when broadcasting #9521)true: collect all batch handles, issue one JNI call to a newserializeAll(long[]), producing 1 combined buffer (matches 1.2 behavior)The new JNI is additive — existing
serialize(long)is preserved, soColumnarCachedBatchSerializerand other callers are unaffected.I have a working patch. On the same Q64 reproduction:
mergeBatches=false→ HashBuild ~2.04 h (1.6 baseline)mergeBatches=true→ HashBuild ~22 min (matches 1.2)when I set mergeBatches=true:
Both paths produce semantically equivalent broadcast results. Happy to open a PR for review.
Gluten version
None