[Improvement] Task fast fail once blocks fail to send #332

zuston · 2022-11-17T06:46:38Z

What changes were proposed in this pull request?

[Improvement] Task fast fail once blocks fail to send

In single replica mechanism, single one batch data sent failed should make task fast fail.
When some remaining block events in dataTransferPool wait to be sent, we should abandon it.
More precisely, we need to interrupt send requests belonged to failed tasks. (Using the GrpcFuture to cancel it)

Why are the changes needed?

When shuffle-sever is down, in current codebase, the shuffle-write client will block and retry too much times. Actually, it should fast fail once partial blocks fail to send.
When using the custom retry policy in rpc layer like feat: support stateful upgrade of shuffle server #308 , this PR will solve the potential problem of waiting too long wait time when specifying the 1min retry time.

After this patch, fail time is limited in 2min. Before, it will be for 10min+

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs
Online real tests

codecov-commenter · 2022-11-17T07:11:15Z

Codecov Report

Merging #332 (a6c4a38) into master (d09b40b) will increase coverage by 0.07%.
The diff coverage is 80.48%.

@@             Coverage Diff              @@
##             master     #332      +/-   ##
============================================
+ Coverage     58.17%   58.24%   +0.07%     
- Complexity     1529     1543      +14     
============================================
  Files           192      192              
  Lines         10606    10636      +30     
  Branches        924      931       +7     
============================================
+ Hits           6170     6195      +25     
- Misses         4068     4073       +5     
  Partials        368      368

Impacted Files	Coverage Δ
...he/uniffle/client/impl/ShuffleWriteClientImpl.java	`34.28% <80.00%> (+4.14%)`	⬆️
...g/apache/hadoop/mapred/SortWriteBufferManager.java	`80.10% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zuston · 2022-11-17T07:17:19Z

PTAL @jerqi

jerqi · 2022-11-17T09:36:04Z

Is it necessary that we use GrpcFuture to cancel the request?

jerqi · 2022-11-17T09:59:19Z

I have some concern whether the isValidTaskId is a good implement way.

zuston · 2022-11-17T10:22:56Z

Is it necessary that we use GrpcFuture to cancel the request?

If the request is very slow or hang, grpc future could cancel the request, otherwise we have to wait.

I have some concern whether the isValidTaskId is a good implement way.

I have no other idea to implement this feature. If using the thread.interrupt, it wont abandon the events have been in dataTransferPool.

jerqi · 2022-11-17T10:42:52Z

If we use Netty to transfer the data, will this feature bring difficulties to us?

zuston · 2022-11-17T10:46:03Z

If we use Netty to transfer the data, will this feature bring difficulties to us?

I think this is not a problem for netty.

zuston · 2022-11-17T10:54:33Z

I think we could shorten the thread sleep time from 100ms -> 10ms in GrpcFuture wait

jerqi · 2022-11-17T11:02:20Z

Is isValidTaskId rpc failureCallback?

zuston · 2022-11-17T11:09:01Z

Is isValidTaskId rpc failureCallback?

Sorry I don’t get your thought

jerqi · 2022-11-17T11:22:41Z

In rpc system, there is a concept called failureCallback, it is used to process the failure result. You can refer to brpc.
https://brpc.apache.org/docs/client/basics/#use-newcallback

zuston · 2022-11-17T12:31:28Z

Got it. But in this case, we want to cancel the request instead of callback

zuston · 2022-11-18T03:19:56Z

PTAL @jerqi

jerqi · 2022-11-18T03:32:46Z

Wait for a moment. I think this is a typical mechanism. I want to find whether any other system have similar mechanism.

zuston · 2022-11-18T06:25:19Z

Wait for a moment. I think this is a typical mechanism. I want to find whether any other system have similar mechanism.

OK.

client-spark/spark2/src/main/java/org/apache/spark/shuffle/RssShuffleManager.java