Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deploy multiple shuffle servers in a single node #166

Merged
merged 6 commits into from
Sep 27, 2022

Conversation

xianjingfeng
Copy link
Member

What changes were proposed in this pull request?

1.Sufflle server with same ip will not be assigned to same partition
2.Check whether port is in use in start script of shuffle server

Why are the changes needed?

If we have a lot of memory(more than 1T) per host, so we need deploy multiple shuffle servers in a single node. #77

Does this PR introduce any user-facing change?

No

How was this patch tested?

already added

@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2022

Codecov Report

Merging #166 (2b74e85) into master (e6b4260) will increase coverage by 0.16%.
The diff coverage is 91.30%.

@@             Coverage Diff              @@
##             master     #166      +/-   ##
============================================
+ Coverage     59.17%   59.33%   +0.16%     
- Complexity     1332     1346      +14     
============================================
  Files           160      161       +1     
  Lines          8732     8780      +48     
  Branches        819      828       +9     
============================================
+ Hits           5167     5210      +43     
- Misses         3300     3303       +3     
- Partials        265      267       +2     
Impacted Files Coverage Δ
...uniffle/coordinator/AssignmentStrategyFactory.java 69.23% <50.00%> (ø)
...niffle/coordinator/AbstractAssignmentStrategy.java 91.42% <91.42%> (ø)
...e/uniffle/coordinator/BasicAssignmentStrategy.java 96.77% <100.00%> (ø)
...rg/apache/uniffle/coordinator/CoordinatorConf.java 97.05% <100.00%> (+0.11%) ⬆️
...oordinator/PartitionBalanceAssignmentStrategy.java 98.46% <100.00%> (ø)
...he/uniffle/server/buffer/ShuffleBufferManager.java 82.15% <0.00%> (-1.04%) ⬇️
...a/org/apache/uniffle/server/ShuffleServerConf.java 99.18% <0.00%> (-0.01%) ⬇️
...rg/apache/uniffle/server/ShuffleServerMetrics.java 96.22% <0.00%> (+0.10%) ⬆️
...rg/apache/uniffle/storage/common/LocalStorage.java 44.82% <0.00%> (+1.16%) ⬆️
...ava/org/apache/uniffle/coordinator/ServerNode.java 83.78% <0.00%> (+2.70%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -158,6 +158,12 @@ public class RssBaseConf extends RssConf {
.defaultValue(true)
.withDescription("The switch for jvm metrics verbose");

public static final ConfigOption<String> RSS_SERVER_IP = ConfigOptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this config option is only used for test? We already have 6937631, could we use this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think use environment variable is not good in this case, because we need create mutil ShuffleServer in ut sometimes.

import java.util.Set;

public abstract class AbstractAssignmentStrategy implements AssignmentStrategy {
protected List<ServerNode> getCandidateNodes(List<ServerNode> allNodes, int expectNum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a scheduler, we usually provide an option for users to choose whether to allocate on a single node. Because there are enough nodes to choose sometimes. To solve this problems, scheduler usually provide three semantics.
First, don't consider this factor. When we can assign the servers, we don't consider whether there are partitions on the same node.
Second, prefer considering this factor. When we can assign the servers, we try our best to avoid the partitions on the same node. But if we don't have enough the servers, we could assign the same node to the partitions.
Third, must considering this factor. We must avoid the partitions on the same node. If we don't have enough servers, we return the servers directly.
You don't need to implement the every semantics, but we hope we can implement the other semantics easily in the future when there are users which need the semantics , so should we need some config options or implement abstraction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.util.List;
import java.util.Set;

public abstract class AbstractAssignmentStrategy implements AssignmentStrategy {
Copy link
Member

@zuston zuston Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this? If u want to deploy multiple shuffle servers on one machine and hope avoid partitions assigned to same host. I think you could implement custom assignment strategy. There is no need to change default strategy.

Right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most cases need this logic, and strategys may have many same logics in the future.

@jerqi jerqi linked an issue Aug 22, 2022 that may be closed by this pull request
# Conflicts:
#	bin/start-shuffle-server.sh
#	bin/utils.sh
#	common/src/main/java/org/apache/uniffle/common/config/RssBaseConf.java
@@ -191,6 +191,12 @@ public class RssBaseConf extends RssConf {
.defaultValue(60L)
.withDescription("The kerberos authentication relogin interval. unit: sec");

public static final ConfigOption<String> RSS_SERVER_IP = ConfigOptions
Copy link
Contributor

@jerqi jerqi Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse this environment value? 6937631

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse this environment value? 6937631

Yes, but using environment variables may cause some problems if we create mutil ShuffleServer in ut, should we reuse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the pr's test case, it's not a big problem for ut. We just need to set the environment variable for one server and start the server instead of starting all the servers.

@jerqi
Copy link
Contributor

jerqi commented Sep 22, 2022

@leixm Could you help see this flaky test? https://github.com/apache/incubator-uniffle/actions/runs/3104642429/jobs/5029351283
error logs is as follow
test-reports-spark3.2.0.zip

@jerqi
Copy link
Contributor

jerqi commented Sep 22, 2022

@leixm Could you help see this flaky test? https://github.com/apache/incubator-uniffle/actions/runs/3104642429/jobs/5029351283 error logs is as follow test-reports-spark3.2.0.zip

I have fixed this in the pr #238

@leixm
Copy link
Contributor

leixm commented Sep 22, 2022

@leixm Could you help see this flaky test? https://github.com/apache/incubator-uniffle/actions/runs/3104642429/jobs/5029351283 error logs is as follow test-reports-spark3.2.0.zip

I have fixed this in the pr #238

Thank you. @jerqi

@jerqi
Copy link
Contributor

jerqi commented Sep 23, 2022

Should we add the documents for this feature?

assertTrue(hasSameHost);
});
}

Copy link
Contributor

@jerqi jerqi Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would better test the strategy PREFER_DIFF and NONE, too.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xianjingfeng @zuston

@jerqi
Copy link
Contributor

jerqi commented Sep 27, 2022

Pending CI

@jerqi jerqi merged commit f1cb43f into apache:master Sep 27, 2022
@jerqi
Copy link
Contributor

jerqi commented Sep 27, 2022

@xianjingfeng Sorry, I forget updating the document. If u have time, please raise a pr to update the related the doc.

@@ -48,7 +48,9 @@ MAIN_CLASS="org.apache.uniffle.server.ShuffleServer"
HADOOP_DEPENDENCY="$("$HADOOP_HOME/bin/hadoop" classpath --glob)"

echo "Check process existence"
is_jvm_process_running "$JPS" $MAIN_CLASS
RPC_PORT=`grep '^rss.rpc.server.port' $CONF_FILE |awk '{print $2}'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONF_FILE should be SHUFFLE_SERVER_CONF_FILE. I will raise a pr to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support deploy multiple shuffle servers in a single node
5 participants