Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-468] Put unavailable servers to the end of the list when sending shuffle data #470

Merged
merged 11 commits into from
Jan 31, 2023

Conversation

xianjingfeng
Copy link
Member

@xianjingfeng xianjingfeng commented Jan 11, 2023

What changes were proposed in this pull request?

Put unavailable shuffle servers to the end of the server list when sending shuffle data if replica=1.

Why are the changes needed?

If we use multiple replicas and the first shuffle server becomes unavailable, sending data will take a lot of time. Because the client will always send to the first shuffle server firstly. #468

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@codecov-commenter
Copy link

codecov-commenter commented Jan 11, 2023

Codecov Report

Merging #470 (bd54e9b) into master (ebaff6a) will increase coverage by 1.17%.
The diff coverage is 79.31%.

@@             Coverage Diff              @@
##             master     #470      +/-   ##
============================================
+ Coverage     58.74%   59.92%   +1.17%     
- Complexity     1664     1785     +121     
============================================
  Files           199      205       +6     
  Lines         11236    11557     +321     
  Branches        999     1043      +44     
============================================
+ Hits           6601     6925     +324     
+ Misses         4243     4226      -17     
- Partials        392      406      +14     
Impacted Files Coverage Δ
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 35.10% <79.31%> (+5.28%) ⬆️
...che/uniffle/server/storage/HdfsStorageManager.java 90.90% <0.00%> (-4.33%) ⬇️
...he/uniffle/server/storage/MultiStorageManager.java 60.71% <0.00%> (-3.44%) ⬇️
...rg/apache/uniffle/server/buffer/ShuffleBuffer.java 93.38% <0.00%> (-1.66%) ⬇️
...he/uniffle/server/storage/LocalStorageManager.java 87.00% <0.00%> (-1.36%) ⬇️
...ava/org/apache/uniffle/common/util/RetryUtils.java 71.42% <0.00%> (-1.30%) ⬇️
...g/apache/hadoop/mapred/SortWriteBufferManager.java 79.89% <0.00%> (-0.22%) ⬇️
...ache/uniffle/coordinator/SimpleClusterManager.java 86.71% <0.00%> (-0.11%) ⬇️
...pache/uniffle/server/ShuffleServerGrpcService.java 0.79% <0.00%> (-0.02%) ⬇️
...he/uniffle/coordinator/CoordinatorGrpcService.java 2.08% <0.00%> (-0.02%) ⬇️
... and 23 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jerqi
Copy link
Contributor

jerqi commented Jan 13, 2023

…nding shuffle data

Could you modify your title and description?

@xianjingfeng xianjingfeng changed the title [ISSUE-468] Put unavailable shuffle servers to the end of the server list when se… [ISSUE-468] Put unavailable servers to the end of the list when sending shuffle data Jan 13, 2023
@xianjingfeng
Copy link
Member Author

…nding shuffle data

Could you modify your title and description?

Done

advancedxy
advancedxy previously approved these changes Jan 29, 2023
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me. ping @jerqi to see if he has more input/comments.

for (ShuffleServerInfo ssi : serverList) {
if (!includeBlockList && replica > 1 && !shuffleServerBlockList.isEmpty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it allocate the servers which is less than replicaNum if exclude nodes are too many?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@kaijchen kaijchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xianjingfeng for the work, I have left some comments, PTAL.

kaijchen
kaijchen previously approved these changes Jan 30, 2023
Copy link
Contributor

@kaijchen kaijchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xianjingfeng.

@@ -105,6 +106,7 @@ public class ShuffleWriteClientImpl implements ShuffleWriteClient {
private final ExecutorService dataTransferPool;
private final int unregisterThreadPoolSize;
private final int unregisterRequestTimeSec;
private Set<ShuffleServerInfo> shuffleServerBlocklist;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use the name shuffleServerBlocklist? What's the meaning of this variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocklist is equal to Blacklist

Copy link
Contributor

@jerqi jerqi Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block is a concept of our system. This name make me confused.

Copy link
Contributor

@kaijchen kaijchen Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatives: disallowlist, denylist, excludelist.
Or maybe just blacklist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block is a concept of our system. This name make me confused.

I think so too, but i have not good idea. How about defectiveServerList? We need to unify our opinions. 😂 @jerqi @kaijchen @advancedxy

Copy link
Contributor

@kaijchen kaijchen Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, but i have not good idea. How about defectiveServerList? We need to unify our opinions. 😂 @jerqi @kaijchen @advancedxy

Maybe just defectiveServers? Because it's actually a Set.
And withBlocklist may be changed to excludeDefectiveServers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, but i have not good idea. How about defectiveServerList? We need to unify our opinions. 😂 @jerqi @kaijchen @advancedxy

Maybe just defectiveServers? Because it's actually a Set. And withBlocklist may be changed to excludeDefectiveServers.

It's ok for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for defectiveServers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

if (assignedNum < replicaNum && withBlocklist) {
genServerToBlocks(sbi, serverList, replicaNum - assignedNum,
Copy link
Contributor

@jerqi jerqi Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this cause that one shuffle server will be allocated twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Assigned server will be added to excludeServers.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerqi jerqi merged commit ebbe2db into apache:master Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Put unavailable shuffle servers to the end of the server list when sending shuffle data
5 participants