Skip to content

[CELEBORN-1410] Combine multiple ShuffleBlockInfo into a single ShuffleBlockInfo#2524

Closed
s0nskar wants to merge 10 commits intoapache:mainfrom
s0nskar:CELEBORN-1410
Closed

[CELEBORN-1410] Combine multiple ShuffleBlockInfo into a single ShuffleBlockInfo#2524
s0nskar wants to merge 10 commits intoapache:mainfrom
s0nskar:CELEBORN-1410

Conversation

@s0nskar
Copy link
Contributor

@s0nskar s0nskar commented May 21, 2024

What changes were proposed in this pull request?

Merging smaller ShuffleBlockInfo corresponding into same mapID, such that size of each block does not exceeds celeborn.shuffle.chunk.size

Why are the changes needed?

As sorted ShuffleBlocks are contiguous, we can compact multiple ShuffleBlockInfo into one as long as the size of compacted one does not exceeds half of celeborn.shuffle.chunk.size. This way we can decrease the number of ShuffleBlock objects.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs

@s0nskar
Copy link
Contributor Author

s0nskar commented May 21, 2024

@waitinfuture Does these changes looks ok? Also in ticket you have mentioned

as long as the size of compacted one does not exceeds half of celeborn.shuffle.chunk.size

Why do we want to limit the size at half of shuffleChunkSize and not use full shuffleChunkSize.

@codecov
Copy link

codecov bot commented May 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 40.81%. Comparing base (121395f) to head (817eb62).
Report is 51 commits behind head on main.

Current head 817eb62 differs from pull request most recent head 3df993c

Please upload reports for the commit 3df993c to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2524      +/-   ##
==========================================
+ Coverage   40.17%   40.81%   +0.64%     
==========================================
  Files         218      220       +2     
  Lines       13742    14104     +362     
  Branches     1214     1258      +44     
==========================================
+ Hits         5520     5755     +235     
- Misses       7905     8020     +115     
- Partials      317      329      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @s0nskar , thanks for the PR! I left a comment, PTAL :)

sortedBlock.offset = fileIndex;
sortedBlock.length = length;
sortedShuffleBlocks.add(sortedBlock);
fileIndex += transferBlock(offset, length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put fileIndex += transferBlock(offset, length); below the if-else block, because we need make sure the offset of the first sortedBlock equals to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, Fixed it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to setup UT locally it should've been caught by it. Facing below error while running the tests, have you seen this error.

scala: bad option: -P:silencer:globalFilters=.*deprecated.*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see such error, I usually run local UT through Intellij or using sbt https://celeborn.apache.org/docs/latest/developers/sbt/#testing-with-sbt , you can try out this document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is working for me ✌️

@waitinfuture
Copy link
Contributor

Why do we want to limit the size at half of shuffleChunkSize and not use full shuffleChunkSize.

The reason is to avoid that the next batch exceeds half the shuffleChunkSize so the whole length exceeds shuffleChunkSize. But I think this is trivial, it's ok to use full shuffleChunkSize.

// combine multiple `ShuffleBlockInfo` into a single `ShuffleBlockInfo` of size
// less than `shuffleChunkSize`
if (!sortedShuffleBlocks.isEmpty() &&
sortedShuffleBlocks.get(sortedShuffleBlocks.size() - 1).length + length <= shuffleChunkSize) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waitinfuture Just wanted to double check if <= condition would be fine here, or shuffleChunkSize is exclusive upperBound.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<= is fine here, shuffleChunkSize is not exclusive upper bound, in fact it's not a strict bound.

@s0nskar
Copy link
Contributor Author

s0nskar commented May 22, 2024

Thanks for the review @waitinfuture

@s0nskar s0nskar marked this pull request as ready for review May 22, 2024 05:11
@SteNicholas SteNicholas changed the title [CELEBORN-1410] combine multiple ShuffleBlockInfo into a single ShuffleBlockInfo [CELEBORN-1410] Combine multiple ShuffleBlockInfo into a single ShuffleBlockInfo May 22, 2024
// size of compacted `ShuffleBlockInfo` does not exceed `shuffleChunkSize`
if (!sortedShuffleBlocks.isEmpty()
&& sortedShuffleBlocks.get(sortedShuffleBlocks.size() - 1).length + length
<= shuffleChunkSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition "<= shuffleChunkSize" may cause the returned chunk's size to approach 2 * chunkSize. This won't be a trouble but changes the default behavior.

Copy link
Contributor

@waitinfuture waitinfuture May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, say shuffleChunkSize is 8m, the generated blocks can be two 7.9m chunks, when fetch data it can return the two chunks together, which is 15.8m. That's the reason why I mentioned in the jira to use half of shuffleChunkSize, in which the worst case is 3.9m * 3 = 11.7m.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe narrow the threshold to some fraction * shuffleChunkSize is more safe, say 0.25, WDYT?

sortedShuffleBlocks.get(sortedShuffleBlocks.size() - 1).length + length
                    <= 0.25 * shuffleChunkSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that sounds good. I can make this threshold configurable with default value of 0.25 to start with.

Although i'm not super clear about the issue in the discussion. Why would fetch data will return two or three chunks together making it 7.9 * 2 = 15.8m or 3.9 * 3 = 11.7 as mentioned above. Do clients cache the initial read chunk while reading the next one. If that is the case, can someone point me to this piece of code.

Copy link
Contributor

@waitinfuture waitinfuture May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that sounds good. I can make this threshold configurable with default value of 0.25 to start with.

Although i'm not super clear about the issue in the discussion. Why would fetch data will return two or three chunks together making it 7.9 * 2 = 15.8m or 3.9 * 3 = 11.7 as mentioned above. Do clients cache the initial read chunk while reading the next one. If that is the case, can someone point me to this piece of code.

Hi @s0nskar , when creating ReduceFileMeta it will compact contiguous ShuffleBlockInfos into a chunk, it first adds the current ShuffleBlockInfo then checks whether size exceeds fetchChunkSize:

if (info.offset - sortedChunkOffset.get(sortedChunkOffset.size() - 1) >= fetchChunkSize) {

When Worker handles fetch, it sends one chunk at a time, and chunk boundaries are stored in ReduceFileMeta#chunkOffsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks for the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per documentation fetchChunkSize is "Max chunk size". So shouldn't we change above condition to respect it. As you mentioned, if the ShuffleBlocks are generated close of 8mb then chunk can be close to 16mb.

We can change this condition to generate offsets closer to fetchChunkSize with some error factor of 10 or 20%. wdyt?

// combine multiple small length `ShuffleBlockInfo` for same mapId such that
// size of compacted `ShuffleBlockInfo` does not exceed `shuffleChunkSize`
if (!sortedShuffleBlocks.isEmpty()
&& sortedShuffleBlocks.get(sortedShuffleBlocks.size() - 1).length + length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we introduce a variable for sortedShuffleBlocks.get(sortedShuffleBlocks.size() - 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored the code a little to simplify it.

@waitinfuture
Copy link
Contributor

Hi @s0nskar , please run UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite to auto update docs, see https://github.com/apache/celeborn/blob/main/CONTRIBUTING.md

Copy link
Contributor

@cfmcgrady cfmcgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@waitinfuture waitinfuture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Merging to main(v0.5.0)/branch-0.4(v0.4.2)

@s0nskar s0nskar deleted the CELEBORN-1410 branch May 23, 2024 13:29
waitinfuture pushed a commit that referenced this pull request May 23, 2024
…leBlockInfo

Merging smaller `ShuffleBlockInfo` corresponding into same mapID, such that size of each block does not exceeds `celeborn.shuffle.chunk.size`

As sorted ShuffleBlocks are contiguous, we can compact multiple `ShuffleBlockInfo` into one as long as the size of compacted one does not exceeds half of `celeborn.shuffle.chunk.size`. This way we can decrease the number of ShuffleBlock objects.

No

Existing UTs

Closes #2524 from s0nskar/CELEBORN-1410.

Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants