Skip to content

[VL] BroadcastHashJoin HashBuild is significantly slower in 1.6 compared to 1.2 #12251

@Xtpacz

Description

@Xtpacz

Description

Summary

We observed a ~15x performance regression in BroadcastHashJoin's HashBuild step when running TPC-DS 5TB on Gluten 1.6 vs Gluten 1.2 (Velox backend).

Reproduction

  • Workload: TPC-DS 5TB, Q64
  • Same cluster, same configurations on both versions
Version time of hash build total
Gluten 1.2 8.0 min
Gluten 1.6 2.04 hours (~15x slower)

Spark configuration

--num-executors 500
spark.executor.cores = 4
spark.executor.memory = 2g
spark.executor.memoryOverhead = 2g
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size = 8g
spark.driver.cores = 4
spark.driver.memory = 10g
spark.driver.maxResultSize = 3g

spark.sql.autoBroadcastJoinThreshold = 128MB
spark.sql.adaptive.autoBroadcastJoinThreshold = 128MB
spark.sql.shuffle.partitions = 760
spark.io.compression.codec = lz4
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max = 2000m

spark.plugins = org.apache.gluten.GlutenPlugin
spark.sql.extensions = org.apache.spark.sql.gluten.extension.GlutenExtension
spark.gluten.sql.columnar.backend.lib = velox
spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.gluten.enabled = true
spark.gluten.sql.columnar.shuffle.enabled = true

Spark UI screenshots

gluten1.6
Image

gluten1.2
Image

Expected behavior

HashBuild time on 1.6 should be on par with 1.2 for the same workload. At a minimum, it shouldn't be this much slower.

Actual behavior

HashBuild time degrades ~15x with no change in input rows.

Root cause

Bisected to PR #9521 (commit 3e9989d): [GLUTEN-9475][VL] Serialize ColumnarBatch one by one to reduce memory footprint when broadcasting

Behavior change:

1.2 1.6
Driver-side JNI calls during broadcast serialize 1 N
Broadcast payload (Array[Array[Byte]]) length 1 N
Executor deserialize calls 1 N
ColumnarBatches fed to HashBuild 1 (large) N (small)
HashBuild.addInput calls 1 N

HashBuild::addInput has fixed per-call overhead independent of row count (memory reservation check, DecodedVector init, NULL/stats locking, spill check). For Q64 with ~2B rows spread across many batches, this fixed cost gets multiplied by N and dominates total wall time.

Tradeoff

PR #9521 was introduced to reduce driver-side native memory peak for huge broadcasts (issue #9475). That tradeoff is valid for very large broadcasts that would OOM on 1.2, but for typical workloads it introduces a heavy CPU cost. The right fix is not to revert #9521 — both
behaviors have valid use cases.

Proposed fix: keep both paths behind a config switch

Add spark.gluten.velox.broadcastBuild.mergeBatches (boolean, default false):

The new JNI is additive — existing serialize(long) is preserved, so ColumnarCachedBatchSerializer and other callers are unaffected.

I have a working patch. On the same Q64 reproduction:

  • mergeBatches=false → HashBuild ~2.04 h (1.6 baseline)
  • mergeBatches=true → HashBuild ~22 min (matches 1.2)
    when I set mergeBatches=true:
Image

Both paths produce semantically equivalent broadcast results. Happy to open a PR for review.

Gluten version

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions