-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#80][Part-1] feat: Add decommisson logic to shuffle server #606
Conversation
# Conflicts: # coordinator/src/main/java/org/apache/uniffle/coordinator/SimpleClusterManager.java
Codecov Report
@@ Coverage Diff @@
## master #606 +/- ##
============================================
+ Coverage 60.90% 62.87% +1.96%
+ Complexity 1799 1798 -1
============================================
Files 214 202 -12
Lines 12381 10489 -1892
Branches 1042 1051 +9
============================================
- Hits 7541 6595 -946
+ Misses 4437 3547 -890
+ Partials 403 347 -56
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
common/src/main/java/org/apache/uniffle/common/ServerStatus.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public void decommission() { | ||
checkStatusForDecommission(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe decommission
should be idempotent?
Once the shuffle server is entering decomissioning
state, following calls should just do nothing, rather than throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be other statuses in the future, such as Upgrading
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't matter that much?
Upgrading could also been decommissoned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If current status is Upgrading
, and than invoke decommisson, the status will be changed to decommissoning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you in mind the Upgrading
status mean? when the shuffle server would be upgrading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if the server is not decomissioning and not in normal status(such as upgrading), decommisson command should be reject. How about if server is decomissioning, we do nothing, but if server is upgrading, throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/uniffle/server/ShuffleServerTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xianjingfeng for the work, some suggestions here.
server/src/test/java/org/apache/uniffle/server/ShuffleServerTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/uniffle/server/ShuffleServerTest.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/uniffle/common/ServerStatus.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this pr is in good shape, and almost ready to go. Left some minor comments.
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/uniffle/server/ShuffleServerTest.java
Outdated
Show resolved
Hide resolved
Hi @xianjingfeng, sorry for not expressing it correctly. Edit: maybe just use a diff --git a/server/src/main/java/org/apache/uniffle/server/ShuffleServer.java b/server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
index ee8f0775..0c5ecd46 100644
--- a/server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
+++ b/server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
@@ -88,7 +87,7 @@ public class ShuffleServer {
private MetricReporter metricReporter;
private volatile ServerStatus serverStatus = ServerStatus.ACTIVE;
private volatile boolean running;
private ExecutorService executorService;
+ private Future<?> decommissionFuture;
public ShuffleServer(ShuffleServerConf shuffleServerConf) throws Exception {
this.shuffleServerConf = shuffleServerConf;
@@ -287,11 +286,7 @@ public class ShuffleServer {
}
serverStatus = ServerStatus.DECOMMISSIONING;
LOG.info("Shuffle Server is decommissioning.");
if (executorService == null) {
executorService = Executors.newSingleThreadExecutor(
ThreadUtils.getThreadFactory("shuffle-server-decommission-%d"));
}
- executorService.submit(this::waitDecommissionFinish);
+ decommissionFuture = executorService.submit(this::waitDecommissionFinish);
}
private void waitDecommissionFinish() {
@@ -332,8 +327,12 @@ public class ShuffleServer {
LOG.info("Shuffle server is not decommissioning. Nothing needs to be done.");
return;
}
- serverStatus = ServerStatus.ACTIVE;
- LOG.info("Decommission canceled.");
+ if (decommissionFuture.cancel(true)) {
+ serverStatus = ServerStatus.ACTIVE;
+ LOG.info("Decommission canceled.");
+ } else {
+ LOG.warn("Failed to cancel decommission.");
+ }
}
public String getIp() { |
What are the benefits? |
Keeping a thread pool (singleThreadExecutor) for an uncommon task sounds inefficient. Edit: it is not a good practice to blocking the thread using |
+1 for this. |
Done |
@xianjingfeng Hi, could you rebase you code with the latest master? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xianjingfeng, thanks for updating the PR and docs. Some comments regarding to the "cancel decommission".
server/src/main/java/org/apache/uniffle/server/ShuffleServer.java
Outdated
Show resolved
Hide resolved
if (decommissionFuture.cancel(true)) { | ||
LOG.info("Decommission canceled."); | ||
} else { | ||
LOG.warn("Failed to cancel decommission."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result of decommissionFuture.cancel()
doesn't matter now. Since "cancel" always brings a server back to ACTIVE. Maybe we should call executorService.shutdownNow()
here, @advancedxy WDYT?
if (decommissionFuture.cancel(true)) { | |
LOG.info("Decommission canceled."); | |
} else { | |
LOG.warn("Failed to cancel decommission."); | |
} | |
if (decommissionFuture != null) { | |
decommissionFuture.cancel(true); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Cancel is always need. And we cannot call shutdownNow
here, otherwise we cannot decommission again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this since no blocking issues.
Co-authored-by: Kaijie Chen <ckj@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -149,6 +163,7 @@ public void stopServer() throws Exception { | |||
} | |||
SecurityContextFactory.get().getSecurityContext().close(); | |||
server.stop(); | |||
running = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shutdown the executorService here?
if (executorService != null) {
executorService.shutdown()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xianjingfeng.
Thanks @kaijchen @advancedxy @jerqi for the review |
What changes were proposed in this pull request?
Add decommisson logic to shuffle server
Why are the changes needed?
Support shuffle server decommission. It is a part of #80
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT