feat: implement fast-failover for MessageRecvManager and DataClientManager#243
feat: implement fast-failover for MessageRecvManager and DataClientManager#243coderzc merged 6 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #243 +/- ##
============================================
+ Coverage 85.83% 85.85% +0.02%
Complexity 3232 3232
============================================
Files 344 344
Lines 12072 12076 +4
Branches 1087 1088 +1
============================================
+ Hits 10362 10368 +6
+ Misses 1185 1182 -3
- Partials 525 526 +1
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
javeme
left a comment
There was a problem hiding this comment.
Thanks for your contribution~
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSendManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/QueuedMessageSender.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/worker/WorkerService.java
Outdated
Show resolved
Hide resolved
|
Hi, @javeme , thanks for your review again, i've made some changes based on your comments, PTAL : ) AND
I've reverted this part to only catch InterruptedException with original log message. |
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
|
Hi, all, really thanks for @coderzc's suggestion, I've modified the implementation in a more elegant way. Like describe above in PR description:
PTAL : ) |
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSender.java
Outdated
Show resolved
Hide resolved
Hi, @coderzc , during the process of sending vertices and edges, two control message sent from the same worker share the same public void finishSend(MessageType type, boolean sendControlMessage) {
Map<Integer, MessageSendPartition> all = this.buffers.all();
MessageStat stat = this.sortAndSendLastBuffer(all, type);
Set<Integer> workerIds = all.keySet().stream()
.map(this.partitioner::workerId)
.collect(Collectors.toSet());
if (sendControlMessage == true) {
this.sendControlMessageToWorkers(workerIds, MessageType.FINISH);
}
LOG.info("Finish sending message(type={},count={},bytes={})",
type, stat.messageCount(), stat.messageBytes());
} or we separate the call of |
@Radeity Do you mean to merge send vertex and send edge into the same session? |
Yes, can we do that? |
|
@coderzc it seems that a lot to change, maybe we should keep using |
I agree with this, then only add a |
computer-core/src/main/java/org/apache/hugegraph/computer/core/sender/MessageSendManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
| this.expectedFinishMessages); | ||
| this.finishMessagesFuture = new CompletableFuture<>(); | ||
| if (!this.finishMessageCount.compareAndSet(0, this.expectedFinishMessages)) { | ||
| throw new ComputerException("The origin count must be 0"); |
There was a problem hiding this comment.
The origin finishMessageCount ...
There was a problem hiding this comment.
UT will fail if using compareAndSet() here. Also, I find it doesn't have concurrency issue, so I modify to directly use set().
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/network/DataClientManager.java
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
| "finish-messages received in %s ms in superstep %s", | ||
| e, | ||
| this.expectedFinishMessages, | ||
| this.waitFinishMessagesTimeout, |
There was a problem hiding this comment.
prefer to catch timeout exception individually
There was a problem hiding this comment.
Do you mean we should catch them with different error message?
MessageRecvManager and DataClientManager for fast-fail
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
computer-core/src/main/java/org/apache/hugegraph/computer/core/receiver/MessageRecvManager.java
Outdated
Show resolved
Hide resolved
|
@Radeity Please fix the code style, you can import https://github.com/apache/incubator-hugegraph/blob/master/hugegraph-style.xml to IDEA |
Good reminder 🤔 Update: put the idea-config-doc here first Also, |
I use IDEA for remote development, which really annoy me that I can not import files. May I ask do we have other ways to make this format script effective? BTW, maybe we can consider to use |
Purpose of the PR
MessageRecvManagerandDataClientManagerFast-Fail #211Main Changes
This PR implements method
exceptionCaughtofMessageRecvManagerandDataClientManagerin different ways.ComputerJobwill get Fail status (according to default configurationAUTO_DESTROY_PODis true, the CR will be directly deleted).I find it's the most light weight implementation, IN FACT, it can be seen as a lazy-fail, rather than block as before.
Verifying these changes
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - NO Need