[VL] BroadcastHashJoin HashBuild is significantly slower in 1.6 compared to 1.2

### Description

## Summary
We observed a ~15x performance regression in BroadcastHashJoin's HashBuild step when running TPC-DS 5TB on Gluten 1.6 vs Gluten 1.2 (Velox backend).


## Reproduction
- Workload: TPC-DS 5TB, Q64
- Same cluster, same configurations on both versions

| Version | time of hash build total |
|---------|--------------------------|
| Gluten 1.2 | 8.0 min |
| Gluten 1.6 | 2.04 hours (~15x slower) |

### Spark configuration

--num-executors 500
spark.executor.cores = 4
spark.executor.memory = 2g
spark.executor.memoryOverhead = 2g
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 8g
spark.driver.cores = 4
spark.driver.memory = 10g
spark.driver.maxResultSize = 3g

spark.sql.autoBroadcastJoinThreshold = 128MB
spark.sql.adaptive.autoBroadcastJoinThreshold = 128MB
spark.sql.shuffle.partitions = 760
spark.io.compression.codec = lz4
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max = 2000m

spark.plugins = org.apache.gluten.GlutenPlugin
spark.sql.extensions = org.apache.spark.sql.gluten.extension.GlutenExtension
spark.gluten.sql.columnar.backend.lib = velox
spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.gluten.enabled = true
spark.gluten.sql.columnar.shuffle.enabled = true

### Spark UI screenshots
gluten1.6
<img width="420" height="683" alt="Image" src="https://github.com/user-attachments/assets/83956494-917b-4f94-bd44-526f0769ed27" />

gluten1.2
<img width="452" height="612" alt="Image" src="https://github.com/user-attachments/assets/02b617a2-174f-416f-9a32-8cd80d815bb7" />



## Expected behavior
HashBuild time on 1.6 should be on par with 1.2 for the same workload. At a minimum, it shouldn't be this much slower.

## Actual behavior
HashBuild time degrades ~15x with no change in input rows.


  ## Root cause
  Bisected to PR #9521 (commit 3e9989da1): [GLUTEN-9475][VL] Serialize ColumnarBatch one by one to reduce memory footprint when broadcasting

  Behavior change:
  | | 1.2 | 1.6 |
  |---|---|---|
  | Driver-side JNI calls during broadcast serialize | 1 | N |
  | Broadcast payload (`Array[Array[Byte]]`) length | 1 | N |
  | Executor `deserialize` calls | 1 | N |
  | ColumnarBatches fed to HashBuild | 1 (large) | N (small) |
  | `HashBuild.addInput` calls | 1 | N |

`HashBuild::addInput` has fixed per-call overhead independent of row count (memory reservation check, DecodedVector init, NULL/stats locking, spill check). For Q64 with ~2B rows spread across many batches, this fixed cost gets multiplied by N and dominates total wall time.

## Tradeoff
PR #9521 was introduced to reduce driver-side native memory peak for huge broadcasts (issue #9475). That tradeoff is valid for very large broadcasts that would OOM on 1.2, but for typical workloads it introduces a heavy CPU cost. The right fix is **not** to revert #9521 — both
behaviors have valid use cases.

## Proposed fix: keep both paths behind a config switch
Add `spark.gluten.velox.broadcastBuild.mergeBatches` (boolean, default `false`):
- `false` (default): keep 1.6's per-batch serialization (preserves OOM fix from #9521)
- `true`: collect all batch handles, issue **one** JNI call to a new `serializeAll(long[])`, producing 1 combined buffer (matches 1.2 behavior)

The new JNI is **additive** — existing `serialize(long)` is preserved, so `ColumnarCachedBatchSerializer` and other callers are unaffected.

I have a working patch. On the same Q64 reproduction:
- `mergeBatches=false` → HashBuild ~2.04 h (1.6 baseline)
- `mergeBatches=true`  → HashBuild ~22 min (matches 1.2)
when I set mergeBatches=true:
<img width="422" height="681" alt="Image" src="https://github.com/user-attachments/assets/57aa8a63-2821-4db5-9655-eb9bd71773d9" />

Both paths produce semantically equivalent broadcast results. Happy to open a PR for review.




### Gluten version

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] BroadcastHashJoin HashBuild is significantly slower in 1.6 compared to 1.2 #12251

Description

Summary

Reproduction

Spark configuration

Spark UI screenshots

Expected behavior

Actual behavior

Root cause

Tradeoff

Proposed fix: keep both paths behind a config switch

Gluten version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	1.2	1.6
Driver-side JNI calls during broadcast serialize	1	N
Broadcast payload (`Array[Array[Byte]]`) length	1	N
Executor `deserialize` calls	1	N
ColumnarBatches fed to HashBuild	1 (large)	N (small)
`HashBuild.addInput` calls	1	N

[VL] BroadcastHashJoin HashBuild is significantly slower in 1.6 compared to 1.2 #12251

Description

Description

Summary

Reproduction

Spark configuration

Spark UI screenshots

Expected behavior

Actual behavior

Root cause

Tradeoff

Proposed fix: keep both paths behind a config switch

Gluten version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions