[VL] Remove buffering of sorted partitions in RSS writer to prevent OOM by boneanxs · Pull Request #11059 · apache/gluten

boneanxs · 2025-11-10T07:18:52Z

What changes are proposed in this pull request?

#10244 removed RssPartitionWriterOutputStream which could buffer the whole partition in memory which could cause oom.

This pr directly push data to rssClient when sortEvict calls, since rssClient itself will also buffer data before sending to the remote.

How was this patch tested?

boneanxs · 2025-11-10T07:19:48Z

@marin-ma @kerwin-zk hey, could you please help review this? Thanks!

boneanxs · 2025-11-10T07:21:55Z

+  ARROW_ASSIGN_OR_RAISE(
+      auto rssOs, arrow::io::BufferOutputStream::Create(options_->pushBufferMaxSize, arrow::default_memory_pool()));
+  if (codec_ != nullptr) {
+    ARROW_ASSIGN_OR_RAISE(


Do we need to compress data here again for rss shuffle? Looks there's a compression already inside shuffleClient: https://github.com/apache/celeborn/blob/5e4d80bb1e764b80f5d3462bb8ffb9061efc63b4/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java#L1052

@kerwin-zk

FYI: We have disabled compression of rss client(Uniffle) and only retained compression of the gluten shuffle.

make sense, celeborn also don't need to compress again IMO, let me remove it as well.

make sense, celeborn also don't need to compress again IMO, let me remove it as well.

Maybe disabling compression on the Celeborn side by config(if having this option) is sufficient, and there is no need to implicitly disable it in the Gluten codebase.

@zuston This has already been implemented in the new version of Celeborn, and I'll adapt it accordingly later.

github-actions · 2025-11-10T10:01:09Z

Run Gluten Clickhouse CI on x86

marin-ma · 2025-11-10T11:45:20Z

@boneanxs Thanks for identifying this issue. I wonder if this change may affect the shuffled data size. The original design was aimed at generating a smaller compressed output by flushing the compressed data with more buffered input.

boneanxs · 2025-11-11T02:24:01Z

@boneanxs Thanks for identifying this issue. I wonder if this change may affect the shuffled data size. The original design was aimed at generating a smaller compressed output by flushing the compressed data with more buffered input.

@marin-ma Thanks for pointing that out. I tested it and found that it still produces less shuffle data than rss_sort. I can add more tests to measure how much the data volume increases when compared to buffering the entire partition. However, even if the volume does increase, I think we still can’t buffer the whole partition for large ones

wForget · 2025-11-11T02:50:34Z

which could buffer the whole partition in memory which could cause oom.

Doesn't rss shuffle writer trigger a spill?

boneanxs · 2025-11-11T07:30:12Z

which could buffer the whole partition in memory which could cause oom.

Doesn't rss shuffle writer trigger a spill?

Spill still can't evict buffers, which will call sortEvict and then buffered in arrow bufferOutputStream

wForget · 2025-11-11T08:01:11Z

which could buffer the whole partition in memory which could cause oom.

Doesn't rss shuffle writer trigger a spill?

Spill still can't evict buffers, which will call sortEvict and then buffered in arrow bufferOutputStream

That's right, thank you for your explanation.

github-actions · 2025-11-11T08:20:08Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-12T02:21:58Z

Run Gluten Clickhouse CI on x86

marin-ma · 2025-11-12T09:31:56Z

@boneanxs Thanks for following.

I tested it and found that it still produces less shuffle data than rss_sort.

Could you also compare with vanilla spark + celeborn?

I can add more tests to measure how much the data volume increases when compared to buffering the entire partition.

It would be nice if there are more performance results to be shared.

boneanxs · 2025-11-14T02:49:18Z

@boneanxs Thanks for following.

I tested it and found that it still produces less shuffle data than rss_sort.

Could you also compare with vanilla spark + celeborn?

I can add more tests to measure how much the data volume increases when compared to buffering the entire partition.

It would be nice if there are more performance results to be shared.

@marin-ma I ran a test query on TPC-DS (3TB) using the following SQL:

SELECT
  ss.ss_customer_sk,
  ss.ss_item_sk,
  ss.ss_ticket_number,
  ss.ss_store_sk,
  ss.ss_promo_sk,
  ss.ss_sold_date_sk,
  c.c_customer_id,
  c.c_first_name,
  c.c_last_name,
  SUM(ss.ss_net_paid) AS total_paid
FROM 
(SELECT /*+ REPARTITION(50)*/ * from store_sales) ss
LEFT JOIN customer c
  ON ss.ss_customer_sk = c.c_customer_sk
GROUP BY
  ss.ss_customer_sk,
  ss.ss_item_sk,
  ss.ss_ticket_number,
  ss.ss_store_sk,
  ss.ss_promo_sk,
  ss.ss_sold_date_sk,
  c.c_customer_id,
  c.c_first_name,
  c.c_last_name
limit 100;

When comparing the first shuffle stage, I didn’t observe any performance regression with this patch applied.

Vanilla Spark

Buffering all partitions + 512m offheap

With this patch + 512m offheap

With this patch + 256m off-heap (to force more spill)

By counting the number of VeloxCelebornColumnarShuffleWriter: Gluten shuffle writer: Trying to push for the same partition, I can confirm there're more spill happens when reducing the memory.

marin-ma

LGTM. Thanks for confirming the performance results!

github-actions Bot added the VELOX label Nov 10, 2025

boneanxs commented Nov 10, 2025

View reviewed changes

github-actions Bot added the RSS label Nov 10, 2025

kerwin-zk force-pushed the fix_celeborn_oom branch from 411cacf to e1a22c3 Compare November 11, 2025 08:19

boneanxs added 2 commits November 12, 2025 10:21

[VL] Remove buffering of sorted partitions in RSS writer to prevent OOM

68946e1

Disable shuffle client compression

4cb3648

kerwin-zk force-pushed the fix_celeborn_oom branch from e1a22c3 to 4cb3648 Compare November 12, 2025 02:21

marin-ma approved these changes Nov 14, 2025

View reviewed changes

wForget approved these changes Nov 17, 2025

View reviewed changes

kerwin-zk merged commit f2a6870 into apache:main Nov 17, 2025
100 of 102 checks passed

wecharyu mentioned this pull request Nov 19, 2025

[VL] RSS client should push complete rows #11123

Merged

wForget mentioned this pull request Feb 2, 2026

[VL][1.5] Not enough spark off-heap execution memory on rss shuffle writer #11542

Closed

Conversation

boneanxs commented Nov 10, 2025

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

boneanxs commented Nov 10, 2025

Uh oh!

boneanxs Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

wForget Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

boneanxs Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

zuston Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

kerwin-zk Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 10, 2025

Uh oh!

marin-ma commented Nov 10, 2025

Uh oh!

boneanxs commented Nov 11, 2025

Uh oh!

wForget commented Nov 11, 2025

Uh oh!

boneanxs commented Nov 11, 2025

Uh oh!

wForget commented Nov 11, 2025

Uh oh!

github-actions Bot commented Nov 11, 2025

Uh oh!

github-actions Bot commented Nov 12, 2025

Uh oh!

marin-ma commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boneanxs commented Nov 14, 2025

Uh oh!

marin-ma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wForget Nov 10, 2025 •

edited

Loading

marin-ma commented Nov 12, 2025 •

edited

Loading