-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement] Task fast fail once blocks fail to send #332
Conversation
Codecov Report
@@ Coverage Diff @@
## master #332 +/- ##
============================================
+ Coverage 58.17% 58.24% +0.07%
- Complexity 1529 1543 +14
============================================
Files 192 192
Lines 10606 10636 +30
Branches 924 931 +7
============================================
+ Hits 6170 6195 +25
- Misses 4068 4073 +5
Partials 368 368
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
PTAL @jerqi |
Is it necessary that we use |
I have some concern whether the |
If the request is very slow or hang, grpc future could cancel the request, otherwise we have to wait.
I have no other idea to implement this feature. If using the |
If we use Netty to transfer the data, will this feature bring difficulties to us? |
I think this is not a problem for netty. |
I think we could shorten the thread sleep time from 100ms -> 10ms in GrpcFuture wait |
Is |
Sorry I don’t get your thought |
In rpc system, there is a concept called |
Got it. But in this case, we want to cancel the request instead of callback |
PTAL @jerqi |
Wait for a moment. I think this is a typical mechanism. I want to find whether any other system have similar mechanism. |
OK. |
client-spark/spark2/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java
Outdated
Show resolved
Hide resolved
client-spark/spark3/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java
Show resolved
Hide resolved
return false; | ||
} | ||
|
||
Uninterruptibles.sleepUninterruptibly(100, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this 100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm… It maybe better to set 10ms due to some rpc response time >10ms
client-spark/spark3/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java
Outdated
Show resolved
Hide resolved
7668df7
to
38fc84e
Compare
38fc84e
to
a76df6d
Compare
client-spark/spark2/src/main/java/org/apache/spark/shuffle/writer/RssShuffleWriter.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java
Outdated
Show resolved
Hide resolved
client/src/main/java/org/apache/uniffle/client/impl/ShuffleWriteClientImpl.java
Outdated
Show resolved
Hide resolved
client/src/test/java/org/apache/uniffle/client/impl/ShuffleWriteClientImplTest.java
Outdated
Show resolved
Hide resolved
return false; | ||
} | ||
|
||
Uninterruptibles.sleepUninterruptibly(10, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should 10
be a configuration option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have some concern about performance.
Could we pass a timeout parameter when we call the future.get
to solve this problem?
Could we avoid sleeping in such critical path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pass a timeout parameter when we call the future.get to solve this problem?
Could we avoid sleeping in such critical path?
Good idea!
a125d76
to
10cf623
Compare
We'd better consider compatible problems with Netty rpc. |
Got you point. Maybe we could remove the rpc retry in this PR. And add it after netty design finished. What do u think? @jerqi |
Netty may not have GrpcFuture, too. |
I think you misunderstand my thought. In this PR, I do the three changes
So I could remove the 3th change in this PR.
By the way, this rpc retry relies on the dedicated rpc implementation. So the |
Ok. |
6cd5543
to
975d187
Compare
975d187
to
ee0650f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @zuston
What changes were proposed in this pull request?
[Improvement] Task fast fail once blocks fail to send
GrpcFuture
to cancel it)Why are the changes needed?
After this patch, fail time is limited in 2min. Before, it will be for 10min+
Does this PR introduce any user-facing change?
No
How was this patch tested?