[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation #52817

zhztheplayer · 2025-10-31T12:24:47Z

What changes were proposed in this pull request?

The PR adds off-heap memory mode support for LongHashedRelation.

The PR only affects ShuffledHashJoin. In BroadcastHashJoin, the hashed relations are not closed explicitly but are managed by GC. So it will require a different approach to allocate from off-heap.

Why are the changes needed?

To avoid on-heap OOMs when user sets spark.memory.offHeap.enabled=true, and configures JVM with a comparatively small heap size.
Off-heap mode is seen faster than on-heap mode. See benchmark results.

Does this PR introduce any user-facing change?

By design, When spark.memory.offHeap.enabled=true is set:

Required off-heap memory size may increase
Required on-heap memory size may decrease

How was this patch tested?

WIP

Was this patch authored or co-authored using generative AI tooling?

No.

…nSuite ### What changes were proposed in this pull request? Add the following code in `HashedRelationSuite`, to cover the API `HashedRelation#close` in the test suite. ```scala protected override def afterEach(): Unit = { super.afterEach() assert(umm.executionMemoryUsed === 0) } ``` ### Why are the changes needed? Doing this will: 1. Ensure `HashedRelation#close` is called in test code, to lower memory footprint and avoid memory leak when executing tests. 2. Ensure implementations of `HashedRelation#close` free the allocated memory blocks correctly. It's an individual effort to improve the test quality, but also a prerequisite task for #52817. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's a test PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52830 from zhztheplayer/wip-54132. Authored-by: Hongze Zhang <hongze.zzz123@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…nSuite ### What changes were proposed in this pull request? Add the following code in `HashedRelationSuite`, to cover the API `HashedRelation#close` in the test suite. ```scala protected override def afterEach(): Unit = { super.afterEach() assert(umm.executionMemoryUsed === 0) } ``` ### Why are the changes needed? Doing this will: 1. Ensure `HashedRelation#close` is called in test code, to lower memory footprint and avoid memory leak when executing tests. 2. Ensure implementations of `HashedRelation#close` free the allocated memory blocks correctly. It's an individual effort to improve the test quality, but also a prerequisite task for #52817. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's a test PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52830 from zhztheplayer/wip-54132. Authored-by: Hongze Zhang <hongze.zzz123@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a5e866f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

zhztheplayer · 2025-11-10T17:49:47Z

@cloud-fan @yaooqinn @dongjoon-hyun @@HyukjinKwon @viirya

Would you kindly help review this PR? It's for making Spark SQL work more smoothly with 3rd party off-heap based operators / expressions. Thanks!

HyukjinKwon · 2025-11-10T17:54:05Z

I wonder if the benchmark can be done at least. I worked on similar changes to implement off heap stuff, and realised that it isn't necessarily fast.

HyukjinKwon · 2025-11-10T17:56:55Z

e.g., sometimes it appears JIT to be quite smarter than using direct off heap memory

zhztheplayer · 2025-11-10T18:17:40Z

@HyukjinKwon Thank you for the quick response.

I benchmarked using the existing HashedRelationMetricsBenchmark.

500K Rows

Before:

OpenJDK 64-Bit Server VM 17.0.16+8-Ubuntu-0ubuntu124.04.1 on Linux 6.14.0-33-generic
18:40:31.460 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: 

Unknown processor
LongToUnsafeRowMap metrics:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
LongToUnsafeRowMap                                   55             63           4          9.1         109.8       1.0X

After (on-heap):

OpenJDK 64-Bit Server VM 17.0.16+8-Ubuntu-0ubuntu124.04.1 on Linux 6.14.0-33-generic
18:39:53.863 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: 

Unknown processor
LongToUnsafeRowMap metrics:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
LongToUnsafeRowMap                                   63            105          38          8.0         125.5       1.0X

Afer (off-heap):

OpenJDK 64-Bit Server VM 17.0.16+8-Ubuntu-0ubuntu124.04.1 on Linux 6.14.0-33-generic
15:13:48.380 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: 

Unknown processor
LongToUnsafeRowMap metrics:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
LongToUnsafeRowMap                                   62             68           4          8.1         123.0       1.0X

10M Rows

Before:

OpenJDK 64-Bit Server VM 17.0.16+8-Ubuntu-0ubuntu124.04.1 on Linux 6.14.0-33-generic
18:53:43.292 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: 

Unknown processor
LongToUnsafeRowMap metrics:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
LongToUnsafeRowMap                                 2955           3121         235          3.4         295.5       1.0X

After (on-heap):

OpenJDK 64-Bit Server VM 17.0.16+8-Ubuntu-0ubuntu124.04.1 on Linux 6.14.0-33-generic
18:54:10.447 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: 

Unknown processor
LongToUnsafeRowMap metrics:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
LongToUnsafeRowMap                                 3048           3336         408          3.3         304.8       1.0X

After (off-heap):

OpenJDK 64-Bit Server VM 17.0.16+8-Ubuntu-0ubuntu124.04.1 on Linux 6.14.0-33-generic
15:11:40.943 ERROR org.apache.spark.util.Utils: Process List(/usr/bin/grep, -m, 1, model name, /proc/cpuinfo) exited with code 1: 

Unknown processor
LongToUnsafeRowMap metrics:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
LongToUnsafeRowMap                                 2453           2459           9          4.1         245.3       1.0X

e.g., sometimes it appears JIT to be quite smarter than using direct off heap memory

Yes, and also regarding the benchmark results, I do think the new approach came across a bit slower (although they are close). Do you have any suggestions?

zhztheplayer · 2025-11-10T18:22:34Z

e.g., sometimes it appears JIT to be quite smarter than using direct off heap memory

I actually didn't expect the off-heap relation can be faster - it's more a work to make sure Spark can work with a relatively smaller heap under off-heap memory mode, so we can prevent heap OOMs.

yaooqinn · 2025-11-11T02:59:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

-    val got = acquireMemory(size)
-    if (got < size) {
-      freeMemory(got)
-      throw QueryExecutionErrors.cannotAcquireMemoryToBuildLongHashedRelationError(size, got)


We can remove this error class

zhztheplayer · 2025-11-11T14:11:51Z

It's interesting that, after the change, the off-heap mode relation looks meaningfully faster than on-heap relation in the benchmark. I will update the benchmark result inline.

dongjoon-hyun · 2025-11-12T04:26:50Z

Just a note to @zhztheplayer . From my understanding, this feature needs enough time to do extensive tests over various workloads. Given that, I'd like to recommend to re-target this to Apache Spark 4.2.0 only. For 4.2.0, we can get more chances to test this in 4.2.0-preview1, 4.2.0-preview2, and so on. For Spark 4.1, it looks too late to me.

zhztheplayer · 2025-11-12T09:06:02Z

Just a note to @zhztheplayer . From my understanding, this feature needs enough time to do extensive tests over various workloads. Given that, I'd like to recommend to re-target this to Apache Spark 4.2.0 only. For 4.2.0, we can get more chances to test this in 4.2.0-preview1, 4.2.0-preview2, and so on. For Spark 4.1, it looks too late to me.

@dongjoon-hyun According to the developer document mentioned in the issue, I will leave the target version empty. Thank you for helping with planning the target version. 4.2 looks totally fine to me.

cloud-fan · 2025-11-17T03:45:55Z

The PR only affects ShuffledHashJoin. In BroadcastHashJoin, the hashed relations are not closed explicitly but are managed by GC. So it will require a different approach to allocate from off-heap.

How is this implemented in this PR? I don't see any branching code regarding different joins.

zhztheplayer · 2025-11-17T09:48:51Z

Hi @cloud-fan, thanks for having a look.

How is this implemented in this PR? I don't see any branching code regarding different joins.

This is a bit subtle due to how the task memory manager is got for creating a hashed relation in Spark code.

For SHJ

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala

Lines 108 to 116 in 722bcc0

    
           val relation = HashedRelation( 
        
             iter, 
        
             buildBoundKeys, 
        
             taskMemoryManager = context.taskMemoryManager(), 
        
             // build-side or full outer join needs support for NULL key in HashedRelation. 
        
             allowsNullKey = joinType == FullOuter || 
        
               (joinType == LeftOuter && buildSide == BuildLeft) || 
        
               (joinType == RightOuter && buildSide == BuildRight), 
        
             ignoresDuplicatedKey = ignoreDuplicatedKey)

As seen, context.taskMemoryManager() is passed in, so SHJ is supposed to follow the Spark option spark.memory.offHeap.enabled.

For BHJ (Driver)

The code goes through this path:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala

Line 197 in 722bcc0

val relation = mode.transform(input, Some(numRows))

then

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

Lines 1157 to 1166 in 722bcc0

    
           override def transform( 
        
               rows: Iterator[InternalRow], 
        
               sizeHint: Option[Long]): HashedRelation = { 
        
             sizeHint match { 
        
               case Some(numRows) => 
        
                 HashedRelation(rows, key, numRows.toInt, isNullAware = isNullAware) 
        
               case None => 
        
                 HashedRelation(rows, key, isNullAware = isNullAware) 
        
             } 
        
           }

, where no tmm is passed for creating the hashed relation. In this case, a temporary on-heap tmm will be created and used:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

Lines 142 to 150 in 722bcc0

    
           val mm = Option(taskMemoryManager).getOrElse { 
        
             new TaskMemoryManager( 
        
               new UnifiedMemoryManager( 
        
                 new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"), 
        
                 Long.MaxValue, 
        
                 Long.MaxValue / 2, 
        
                 1), 
        
               0) 
        
           }

.

For BHJ (Executor)

Similar to the driver side, the deserialization code also uses a temporary on-heap tmm:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

Lines 401 to 407 in 722bcc0

    
           val taskMemoryManager = new TaskMemoryManager( 
        
             new UnifiedMemoryManager( 
        
               new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"), 
        
               Long.MaxValue, 
        
               Long.MaxValue / 2, 
        
               1), 
        
             0)

.

zhztheplayer · 2025-11-17T10:01:55Z

The CI failure (hanged forever after changing to on-heap) is being addressed by another PR: #53065, which this one depends on.

github-actions bot added the SQL label Oct 31, 2025

zhztheplayer mentioned this pull request Nov 1, 2025

[SPARK-54132][SQL][TESTS] Cover HashedRelation#close in HashedRelationSuite #52830

Closed

zhztheplayer force-pushed the wip-54116-off-heap-long-relation branch from 3351fb1 to 3686285 Compare November 10, 2025 16:59

LongHashedRelation off-heap

842d2de

zhztheplayer force-pushed the wip-54116-off-heap-long-relation branch from 3686285 to 842d2de Compare November 10, 2025 17:00

zhztheplayer marked this pull request as ready for review November 10, 2025 17:30

fixup

0709a23

zhztheplayer changed the title ~~[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation~~ [SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation to avoid JVM heap OOMs under off-heap memory mode Nov 10, 2025

yaooqinn reviewed Nov 11, 2025

View reviewed changes

zhztheplayer added 2 commits November 11, 2025 14:50

fixup

d533ff4

fixup

745b73a

zhztheplayer changed the title ~~[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation to avoid JVM heap OOMs under off-heap memory mode~~ [SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation Nov 11, 2025

fixup

b9b24d6

zhztheplayer force-pushed the wip-54116-off-heap-long-relation branch from ff3765c to b9b24d6 Compare November 12, 2025 13:18

fixup

c8d965c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation #52817

[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation #52817

zhztheplayer commented Oct 31, 2025 •

edited

Loading

Uh oh!

zhztheplayer commented Nov 10, 2025

Uh oh!

HyukjinKwon commented Nov 10, 2025

Uh oh!

HyukjinKwon commented Nov 10, 2025

Uh oh!

zhztheplayer commented Nov 10, 2025 •

edited

Loading

Uh oh!

zhztheplayer commented Nov 10, 2025

Uh oh!

yaooqinn Nov 11, 2025

Uh oh!

zhztheplayer commented Nov 11, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Nov 12, 2025

Uh oh!

zhztheplayer commented Nov 12, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Nov 17, 2025 •

edited

Loading

Uh oh!

zhztheplayer commented Nov 17, 2025 •

edited

Loading

Uh oh!

zhztheplayer commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation #52817

Are you sure you want to change the base?

[SPARK-54116][SQL] Add off-heap mode support for LongHashedRelation #52817

Conversation

zhztheplayer commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhztheplayer commented Nov 10, 2025

Uh oh!

HyukjinKwon commented Nov 10, 2025

Uh oh!

HyukjinKwon commented Nov 10, 2025

Uh oh!

zhztheplayer commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

500K Rows

10M Rows

Uh oh!

zhztheplayer commented Nov 10, 2025

Uh oh!

yaooqinn Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Nov 12, 2025

Uh oh!

zhztheplayer commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For SHJ

For BHJ (Driver)

For BHJ (Executor)

Uh oh!

zhztheplayer commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhztheplayer commented Oct 31, 2025 •

edited

Loading

zhztheplayer commented Nov 10, 2025 •

edited

Loading

zhztheplayer commented Nov 11, 2025 •

edited

Loading

zhztheplayer commented Nov 12, 2025 •

edited

Loading

cloud-fan commented Nov 17, 2025 •

edited

Loading

zhztheplayer commented Nov 17, 2025 •

edited

Loading