[VL] Remove buffering of sorted partitions in RSS writer to prevent OOM#11059
[VL] Remove buffering of sorted partitions in RSS writer to prevent OOM#11059kerwin-zk merged 2 commits intoapache:mainfrom
Conversation
|
@marin-ma @kerwin-zk hey, could you please help review this? Thanks! |
| ARROW_ASSIGN_OR_RAISE( | ||
| auto rssOs, arrow::io::BufferOutputStream::Create(options_->pushBufferMaxSize, arrow::default_memory_pool())); | ||
| if (codec_ != nullptr) { | ||
| ARROW_ASSIGN_OR_RAISE( |
There was a problem hiding this comment.
Do we need to compress data here again for rss shuffle? Looks there's a compression already inside shuffleClient: https://github.com/apache/celeborn/blob/5e4d80bb1e764b80f5d3462bb8ffb9061efc63b4/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java#L1052
There was a problem hiding this comment.
FYI: We have disabled compression of rss client(Uniffle) and only retained compression of the gluten shuffle.
There was a problem hiding this comment.
make sense, celeborn also don't need to compress again IMO, let me remove it as well.
There was a problem hiding this comment.
make sense, celeborn also don't need to compress again IMO, let me remove it as well.
Maybe disabling compression on the Celeborn side by config(if having this option) is sufficient, and there is no need to implicitly disable it in the Gluten codebase.
There was a problem hiding this comment.
@zuston This has already been implemented in the new version of Celeborn, and I'll adapt it accordingly later.
|
Run Gluten Clickhouse CI on x86 |
|
@boneanxs Thanks for identifying this issue. I wonder if this change may affect the shuffled data size. The original design was aimed at generating a smaller compressed output by flushing the compressed data with more buffered input. |
@marin-ma Thanks for pointing that out. I tested it and found that it still produces less shuffle data than rss_sort. I can add more tests to measure how much the data volume increases when compared to buffering the entire partition. However, even if the volume does increase, I think we still can’t buffer the whole partition for large ones |
Doesn't rss shuffle writer trigger a spill? |
Spill still can't evict buffers, which will call sortEvict and then buffered in arrow bufferOutputStream |
That's right, thank you for your explanation. |
411cacf to
e1a22c3
Compare
|
Run Gluten Clickhouse CI on x86 |
e1a22c3 to
4cb3648
Compare
|
Run Gluten Clickhouse CI on x86 |
|
@boneanxs Thanks for following.
Could you also compare with vanilla spark + celeborn?
It would be nice if there are more performance results to be shared. |
@marin-ma I ran a test query on TPC-DS (3TB) using the following SQL: SELECT
ss.ss_customer_sk,
ss.ss_item_sk,
ss.ss_ticket_number,
ss.ss_store_sk,
ss.ss_promo_sk,
ss.ss_sold_date_sk,
c.c_customer_id,
c.c_first_name,
c.c_last_name,
SUM(ss.ss_net_paid) AS total_paid
FROM
(SELECT /*+ REPARTITION(50)*/ * from store_sales) ss
LEFT JOIN customer c
ON ss.ss_customer_sk = c.c_customer_sk
GROUP BY
ss.ss_customer_sk,
ss.ss_item_sk,
ss.ss_ticket_number,
ss.ss_store_sk,
ss.ss_promo_sk,
ss.ss_sold_date_sk,
c.c_customer_id,
c.c_first_name,
c.c_last_name
limit 100;When comparing the first shuffle stage, I didn’t observe any performance regression with this patch applied.
By counting the number of |
marin-ma
left a comment
There was a problem hiding this comment.
LGTM. Thanks for confirming the performance results!




What changes are proposed in this pull request?
#10244 removed
RssPartitionWriterOutputStreamwhich could buffer the whole partition in memory which could cause oom.This pr directly push data to rssClient when sortEvict calls, since rssClient itself will also buffer data before sending to the remote.
How was this patch tested?