Optimizing Pre-built Hash Table Support in Velox for Presto (Prestissimo) and Spark (Gluten) #17546

JkSelf · 2026-05-18T10:44:34Z

JkSelf
May 18, 2026
Collaborator

Background and Problem

Gluten currently faces severe memory issues when implementing Broadcast Hash Join (see Gluten Issue #7548). Since Gluten's implementation follows the Shuffle Hash Join approach, each task in every executor independently builds its own hash table, which leads to:

Massive memory consumption: When the Spark broadcast threshold is increased to 100MB, Broadcast Hash Join easily encounters OOM errors.
Severe resource waste: Multiple tasks within the same executor repeatedly build identical hash tables.

Our Proposed Solution

To address these issues, we proposed two optimization strategies in our design document:

Solution 1: Executor-level Hash Table Reuse

Build one hash table per executor
All tasks within the executor share the same hash table.
Significantly reduces memory consumption (from N copies to 1, where N is the number of tasks)

Solution 2: Driver-side Pre-build (Consistent with Spark Architecture)

Pre-build the hash table on the Spark driver side
Broadcast the built hash table to all executors
Advantages:
- Hash table is built only once, dramatically reducing CPU utilization
- Consistent with Spark's native Broadcast Hash Join implementation
- Significant performance improvement

To support these two solutions at the Velox layer, we submitted PR #13041, which adds the capability for the HashBuild operator to accept pre-built hash tables.

Recent Developments and Architectural Considerations

During the review process of PR #13041, the community introduced PR #15754, which implements Presto-based Broadcast Hash Table Caching. While we appreciate this contribution and recognize its value for Presto workloads, we've identified that this approach addresses a different use case and cannot fully satisfy Gluten's requirements due to fundamental architectural differences between Spark and Presto.

Spark Architecture:

Hash table construction occurs in the BroadcastExchange operator (Build Hash Table Source)
Hash table is completed as a broadcast variable on the driver side (Broadcast Hash Table Source)
Join operator directly uses the pre-built hash table (Use Hash Table Source)

Why Both Approaches Are Needed
PR #15754's implementation is designed for Presto's architecture where hash table construction happens within the HashJoin operator. However, Spark's framework fundamentally differs in that. Adapting Gluten to use PR #15754's approach would require:

Restructuring Spark's BroadcastExchange logic
Deviating from Spark's established design patterns
Introducing additional complexity that conflicts with Spark's architectural principles

We believe both approaches have merit and serve different architectural needs. We respectfully request that the Velox community consider supporting both solutions:

PR #13041: Provides interface support for scenarios requiring external pre-built hash tables (such as Spark/Gluten)
PR #15754: Provides built-in broadcast hash table caching optimized for Presto's architecture

@mbasmanova @pedroerp @xiaoxmeng @shrinidhijoshi @FelixYBW @zhouyuan @jinchengchenghh @rui-mo @zhli1142015 @zhztheplayer @marin-ma

Looking forward to your insights. Thanks.

mbasmanova · 2026-05-18T13:03:20Z

mbasmanova
May 18, 2026
Maintainer

@JkSelf Would you clarify how #17435 is related to this?

3 replies

JkSelf May 18, 2026
Collaborator Author

Thank you for the detailed discussion. The Spark architectural argument makes sense — building the hash table in BroadcastExchange before the join operator is a fundamental difference from Presto.

@mbasmanova Thanks for the review. I'd like to follow up on the hash table caching design in #13041. Gluten has already integrated this into our internal branch, and getting it upstreamed would significantly reduce our rebase overhead.

@JkSelf Would you clarify how #17435 is related to this?

Since #13041 was closed earlier due to a lack of review, huge thanks to @infvg for picking up the work and keeping it alive in #17435.

I have reopened this original PR (#13041) as I'd like to continue driving it to completion. I will address and resolve all the review comments from #17435 right here in #13041 until it finally gets merged.

mbasmanova May 18, 2026
Maintainer

@JkSelf To make sure I understand, are you going to close $17435 in favor of #13041? Then, address feedback from #17435 in #13041 and ping for review?

JkSelf May 18, 2026
Collaborator Author

@mbasmanova Yes, I will resolve all your feedback from this review here in #13041. I’ll ping you for another review once it's ready. Thanks!

shrinidhijoshi · 2026-05-18T16:43:37Z

shrinidhijoshi
May 18, 2026
Collaborator

@JkSelf Thanks for the detailed description

IMO, conceptually, it seems like the difference we are discussing is not in the engine architecture, but rather how the HashTable is built and injected. Which is orthogonal to #15754 which just provides a cache to store and reference HashTables during joins, regardless of origin.

For instance,

Presto engine sources the HashTable data from Exchange and can populate the cache
Presto-on-Spark sources the HashTable data from File based exchange, builds it and populates it
Spark sources it pre-built (in-memory) (does not yet populate cache)

#15754 is agnostic to this aspect. It is just a mechanism for storing HashTables and referencing them throughout the join execution.

I wonder if for Spark/Gluten, is it just matter of populating the cache by changing,

 reusedHashTableInfo_(std::move(reusedHashTableInfo)

to

  auto* cache = HashTableCache::instance();
  cache->put(cacheKey(), table, joinHasNullKeys_);

Then all your HashBuild operators will have cache hit (instead of presto approach where 1st HashBuild has cache miss), but everything else should work out of the box ?

3 replies

JkSelf May 19, 2026
Collaborator Author

@shrinidhijoshi I appreciate the suggestion, but there are fundamental architectural differences that make the HashTableCache approach unsuitable for Spark/Gluten scenarios. Let me explain:

Lifecycle Management Issue

HashTableCache ties the hash table lifecycle to QueryCtx through release callbacks (HashTableCache.cpp):

queryCtx->addReleaseCallback(
    [cacheKey = key]() { HashTableCache::instance()->drop(cacheKey); });

This design assumes:

One QueryCtx per cache entry
Hash table is only needed within a single query
Automatic cleanup when query completes

Spark's requirements are different:

Cross-query reuse: Query2 needs to reuse a hash table built by Query1 (which may have already completed)
Concurrent query sharing: Multiple queries running simultaneously need to share the same broadcast hash table
Reused stage semantics: Spark's AQE reuses stages across queries, but they have different QueryCtx instances

The CacheKey Problem

Even if we could solve the lifecycle issue, determining a stable cacheKey is problematic:

In Spark, the same physical plan (same planNodeId) can be reused across different logical queries
Using planNodeId as cacheKey would cause different queries to incorrectly share state
We'd need a complex key that encodes query identity, but that defeats the purpose of reuse

Why Our Approach Works

Our implementation delegates lifecycle management to Spark/Gluten:

void* reusedHashTableAddress_; // Managed externally by Spark

This allows:

Spark to control when hash tables are built, shared, and destroyed
Reference counting at the Spark level across multiple queries
Clear separation of concerns: Velox uses the table, Spark manages its lifecycle

While HashTableCache is elegant for Presto's single-query coordination, Spark/Gluten needs cross-query sharing with external lifecycle management. Our approach is orthogonal to #15754 - both can coexist for their respective use cases. cc @mbasmanova

shrinidhijoshi May 19, 2026
Collaborator

Thanks for the additional context @JkSelf . With these new constraints/info that you just mentioned, I would suggest, we can just build on/extend existing HashTableCache.

Based on this info it seems like once your spark logic has the HashTable (say sharedHashTable_ ), it can just do the below

  auto* cache = HashTableCache::instance();
  cache->put('QueryId-1.PlanNodeId-11', sharedHashTable_, joinHasNullKeys_);
  cache->put('QueryId-2.PlanNodeId-22', sharedHashTable_, joinHasNullKeys_);

should solve the scenarios you are mentioning. So your spark logic you mention will still control the lifecycle of the table as it holds shared_ptr

drop() does not destroy the cached table, as it is a shared_ptr so queryCtx destruction callback will only drop references, not destroy the table.

Because

  /// Stores a built hash table and notifies waiting tasks.
  void put(
      const std::string& key,
      std::shared_ptr<BaseHashTable> table,
      bool hasNullKeys);

  /// Removes a cache entry.
  void drop(const std::string& key);

Also, can you please share what QueryId means for Gluten ?
IIUC, SparkAppId -> [N] Spark JobId -> [M] Spark StageId. Which one is used as QueryId for velox instance/task ?

It would great to capture these constraints in the original issue and some examples with QueryId and PlanNodeId details, showing how planNodeIds are re-used and the relation between velox QueryId passed to gluten, SparkAppId, SparkJobId, etc .. as these are actually driving the design here.

My pushback is merely on #13041 which seems orthogonal to these particular design aspects and is more a low level building block for your overall solution

shrinidhijoshi May 19, 2026
Collaborator

Cc @xiaoxmeng ^

xiaoxmeng · 2026-05-21T17:46:33Z

xiaoxmeng
May 21, 2026
Maintainer

Discussed this with @JkSelf offline and will make the change to extend hash table cache API to fit Gluten use case by passing the pre-built hash table from Gluten runtime and the hash build internal workflow with hash table cache should remain the same (pretty much as @shrinidhijoshi suggested). cc @shrinidhijoshi @mbasmanova

0 replies

shrinidhijoshi · 2026-05-22T16:55:21Z

shrinidhijoshi
May 22, 2026
Collaborator

Thank you @xiaoxmeng @JkSelf.

0 replies

Uh oh!

Optimizing Pre-built Hash Table Support in Velox for Presto (Prestissimo) and Spark (Gluten) #17546

Uh oh!

Uh oh!

JkSelf May 18, 2026 Collaborator

Background and Problem

Our Proposed Solution

Recent Developments and Architectural Considerations

Replies: 4 comments · 6 replies

Uh oh!

mbasmanova May 18, 2026 Maintainer

Uh oh!

JkSelf May 18, 2026 Collaborator Author

Uh oh!

mbasmanova May 18, 2026 Maintainer

Uh oh!

JkSelf May 18, 2026 Collaborator Author

Uh oh!

Uh oh!

shrinidhijoshi May 18, 2026 Collaborator

Uh oh!

JkSelf May 19, 2026 Collaborator Author

Lifecycle Management Issue

The CacheKey Problem

Why Our Approach Works

Uh oh!

Uh oh!

shrinidhijoshi May 19, 2026 Collaborator

Uh oh!

shrinidhijoshi May 19, 2026 Collaborator

Uh oh!

xiaoxmeng May 21, 2026 Maintainer

Uh oh!

shrinidhijoshi May 22, 2026 Collaborator

JkSelf
May 18, 2026
Collaborator

Replies: 4 comments 6 replies

mbasmanova
May 18, 2026
Maintainer

JkSelf May 18, 2026
Collaborator Author

mbasmanova May 18, 2026
Maintainer

JkSelf May 18, 2026
Collaborator Author

shrinidhijoshi
May 18, 2026
Collaborator

JkSelf May 19, 2026
Collaborator Author

shrinidhijoshi May 19, 2026
Collaborator

shrinidhijoshi May 19, 2026
Collaborator

xiaoxmeng
May 21, 2026
Maintainer

shrinidhijoshi
May 22, 2026
Collaborator