Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-475][Improvement] It's unnecessary to use ConcurrentHashMap for "partitionToBlockIds" in RssShuffleWriter #480

Merged
merged 5 commits into from
Jan 16, 2023

Conversation

jiafuzha
Copy link
Contributor

What changes were proposed in this pull request?

replaced some unnecessary concurrenthashmp with hashmap

Why are the changes needed?

improve performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested with repartition workload

…r partitionToBlockIds in RssShuffleWriter

Signed-off-by: Jifu Zhang <jiafu.zhang@intel.com>
…r partitionToBlockIds in RssShuffleWriter

replaced concurrenthashmap with hashmap for local variable in ShuffleWriteClientImpl

Signed-off-by: Jifu Zhang <jiafu.zhang@intel.com>
@jiafuzha jiafuzha marked this pull request as draft January 13, 2023 02:28
@jiafuzha jiafuzha marked this pull request as ready for review January 13, 2023 02:29
@jerqi jerqi changed the title [Improvement] It's unnecessary to use ConcurrentHashMap for "partitionToBlockIds" in RssShuffleWriter [ISSUE-475][Improvement] It's unnecessary to use ConcurrentHashMap for "partitionToBlockIds" in RssShuffleWriter Jan 13, 2023
@jerqi jerqi requested a review from advancedxy January 13, 2023 02:30
@jiafuzha
Copy link
Contributor Author

@advancedxy @jerqi please help review.

@codecov-commenter
Copy link

codecov-commenter commented Jan 13, 2023

Codecov Report

Merging #480 (acc0d37) into master (19a8bac) will decrease coverage by 0.01%.
The diff coverage is 33.33%.

@@             Coverage Diff              @@
##             master     #480      +/-   ##
============================================
- Coverage     58.78%   58.77%   -0.02%     
  Complexity     1704     1704              
============================================
  Files           206      206              
  Lines         11471    11468       -3     
  Branches       1024     1024              
============================================
- Hits           6743     6740       -3     
  Misses         4317     4317              
  Partials        411      411              
Impacted Files Coverage Δ
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 29.82% <33.33%> (ø)
...rg/apache/uniffle/server/ShuffleServerMetrics.java 97.05% <0.00%> (-0.14%) ⬇️
...org/apache/uniffle/server/ShuffleFlushManager.java 84.04% <0.00%> (+0.08%) ⬆️
...g/apache/uniffle/server/ShuffleDataFlushEvent.java 83.67% <0.00%> (+0.34%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…r partitionToBlockIds in RssShuffleWriter

applied changes to spark2 module

Signed-off-by: Jifu Zhang <jiafu.zhang@intel.com>
@@ -259,7 +259,7 @@ public SendShuffleDataResult sendShuffleData(String appId, List<ShuffleBlockInfo
}

// maintain the count of blocks that have been sent to the server
Map<Long, AtomicInteger> blockIdsTracker = Maps.newConcurrentMap();
Map<Long, AtomicInteger> blockIdsTracker = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable will be accessed by multiple threads in sendShuffleDataAsync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's not shared since a new instance is created each time you call the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked the CompletableFuture part inside sendShuffleDataAsync. let me rollback change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's not shared since a new instance is created each time you call the method.

You can see for more details.

serverToBlockIds.get(ssi).forEach(block -> blockIdsTracker.get(block).incrementAndGet());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolled back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through some logic and didn't find any update to "blockIdsTracker" (correct me if I am wrong) in main thread after "sendShuffleDataAsync" call which runs asynchronously in the threadpool, "dataTransferPool". According to BlockingQueue (used internally by the thread pool), "...actions in a thread prior to placing an object into a BlockingQueue happen-before actions subsequent to the access or removal of that element from the BlockingQueue in another thread.".

So, I think we don't need cocurrentHashmap for "blockIdsTracker". And you use "AtomicInteger" as value part of "blockIdsTracker", it's enough to make the updated value visible to other threads in later code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. But it's more safe to use cocurrentHashmap. If we modify this logic one day, we could forget to change this type to ConcurrentHashmap. If you still think it's meaningful to modify this type, I think we could add some comments to explain why we don't use ConcurrentHashmap and remind us of this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed back to HashMap with comments to show reason. And from the code logic perspective, we will unlikely to insert/delete entries after dispatching it for sendShuffleDataAsync.

…r partitionToBlockIds in RssShuffleWriter

rolled back changes to blockIdsTracker in ShuffleWriteClientImpl since it will be referenced later in different threads

Signed-off-by: Jifu Zhang <jiafu.zhang@intel.com>
jerqi
jerqi previously approved these changes Jan 13, 2023
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zjf2012

…r partitionToBlockIds in RssShuffleWriter

hashmap is good here since no delete/insert to the tracker in other threads

Signed-off-by: Jifu Zhang <jiafu.zhang@intel.com>
@jiafuzha jiafuzha requested review from jerqi and removed request for advancedxy January 16, 2023 02:15
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@jerqi please take another look

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,thanks @zjf2012 @advancedxy , merged.

@jerqi jerqi merged commit 96cf2cc into apache:master Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] It's unnecessary to use ConcurrentHashMap for "partitionToBlockIds" in RssShuffleWriter
4 participants