-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support deploy multiple shuffle servers in a single node #166
Conversation
Codecov Report
@@ Coverage Diff @@
## master #166 +/- ##
============================================
+ Coverage 59.17% 59.33% +0.16%
- Complexity 1332 1346 +14
============================================
Files 160 161 +1
Lines 8732 8780 +48
Branches 819 828 +9
============================================
+ Hits 5167 5210 +43
- Misses 3300 3303 +3
- Partials 265 267 +2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -158,6 +158,12 @@ public class RssBaseConf extends RssConf { | |||
.defaultValue(true) | |||
.withDescription("The switch for jvm metrics verbose"); | |||
|
|||
public static final ConfigOption<String> RSS_SERVER_IP = ConfigOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this config option is only used for test? We already have 6937631, could we use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think use environment variable is not good in this case, because we need create mutil ShuffleServer in ut sometimes.
import java.util.Set; | ||
|
||
public abstract class AbstractAssignmentStrategy implements AssignmentStrategy { | ||
protected List<ServerNode> getCandidateNodes(List<ServerNode> allNodes, int expectNum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a scheduler, we usually provide an option for users to choose whether to allocate on a single node. Because there are enough nodes to choose sometimes. To solve this problems, scheduler usually provide three semantics.
First, don't consider this factor. When we can assign the servers, we don't consider whether there are partitions on the same node.
Second, prefer considering this factor. When we can assign the servers, we try our best to avoid the partitions on the same node. But if we don't have enough the servers, we could assign the same node to the partitions.
Third, must considering this factor. We must avoid the partitions on the same node. If we don't have enough servers, we return the servers directly.
You don't need to implement the every semantics, but we hope we can implement the other semantics easily in the future when there are users which need the semantics , so should we need some config options or implement abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import java.util.List; | ||
import java.util.Set; | ||
|
||
public abstract class AbstractAssignmentStrategy implements AssignmentStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this? If u want to deploy multiple shuffle servers on one machine and hope avoid partitions assigned to same host. I think you could implement custom assignment strategy. There is no need to change default strategy.
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most cases need this logic, and strategys may have many same logics in the future.
# Conflicts: # bin/start-shuffle-server.sh # bin/utils.sh # common/src/main/java/org/apache/uniffle/common/config/RssBaseConf.java
@@ -191,6 +191,12 @@ public class RssBaseConf extends RssConf { | |||
.defaultValue(60L) | |||
.withDescription("The kerberos authentication relogin interval. unit: sec"); | |||
|
|||
public static final ConfigOption<String> RSS_SERVER_IP = ConfigOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reuse this environment value? 6937631
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reuse this environment value? 6937631
Yes, but using environment variables may cause some problems if we create mutil ShuffleServer in ut, should we reuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the pr's test case, it's not a big problem for ut. We just need to set the environment variable for one server and start the server instead of starting all the servers.
@leixm Could you help see this flaky test? https://github.com/apache/incubator-uniffle/actions/runs/3104642429/jobs/5029351283 |
I have fixed this in the pr #238 |
Thank you. @jerqi |
Should we add the documents for this feature? |
assertTrue(hasSameHost); | ||
}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would better test the strategy PREFER_DIFF
and NONE
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xianjingfeng @zuston
Pending CI |
@xianjingfeng Sorry, I forget updating the document. If u have time, please raise a pr to update the related the doc. |
@@ -48,7 +48,9 @@ MAIN_CLASS="org.apache.uniffle.server.ShuffleServer" | |||
HADOOP_DEPENDENCY="$("$HADOOP_HOME/bin/hadoop" classpath --glob)" | |||
|
|||
echo "Check process existence" | |||
is_jvm_process_running "$JPS" $MAIN_CLASS | |||
RPC_PORT=`grep '^rss.rpc.server.port' $CONF_FILE |awk '{print $2}'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONF_FILE
should be SHUFFLE_SERVER_CONF_FILE
. I will raise a pr to fix this.
What changes were proposed in this pull request?
1.Sufflle server with same ip will not be assigned to same partition
2.Check whether port is in use in start script of shuffle server
Why are the changes needed?
If we have a lot of memory(more than 1T) per host, so we need deploy multiple shuffle servers in a single node. #77
Does this PR introduce any user-facing change?
No
How was this patch tested?
already added