Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Task fast fail once blocks fail to send #332

Merged
merged 7 commits into from
Nov 24, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Nov 17, 2022

What changes were proposed in this pull request?

[Improvement] Task fast fail once blocks fail to send

  1. In single replica mechanism, single one batch data sent failed should make task fast fail.
  2. When some remaining block events in dataTransferPool wait to be sent, we should abandon it.
  3. More precisely, we need to interrupt send requests belonged to failed tasks. (Using the GrpcFuture to cancel it)

Why are the changes needed?

  1. When shuffle-sever is down, in current codebase, the shuffle-write client will block and retry too much times. Actually, it should fast fail once partial blocks fail to send.
  2. When using the custom retry policy in rpc layer like feat: support stateful upgrade of shuffle server #308 , this PR will solve the potential problem of waiting too long wait time when specifying the 1min retry time.

After this patch, fail time is limited in 2min. Before, it will be for 10min+

Does this PR introduce any user-facing change?

No

How was this patch tested?

  1. UTs
  2. Online real tests

@zuston zuston requested a review from jerqi November 17, 2022 06:54
@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2022

Codecov Report

Merging #332 (a6c4a38) into master (d09b40b) will increase coverage by 0.07%.
The diff coverage is 80.48%.

@@             Coverage Diff              @@
##             master     #332      +/-   ##
============================================
+ Coverage     58.17%   58.24%   +0.07%     
- Complexity     1529     1543      +14     
============================================
  Files           192      192              
  Lines         10606    10636      +30     
  Branches        924      931       +7     
============================================
+ Hits           6170     6195      +25     
- Misses         4068     4073       +5     
  Partials        368      368              
Impacted Files Coverage Δ
...he/uniffle/client/impl/ShuffleWriteClientImpl.java 34.28% <80.00%> (+4.14%) ⬆️
...g/apache/hadoop/mapred/SortWriteBufferManager.java 80.10% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@zuston
Copy link
Member Author

zuston commented Nov 17, 2022

PTAL @jerqi

@jerqi
Copy link
Contributor

jerqi commented Nov 17, 2022

Is it necessary that we use GrpcFuture to cancel the request?

@jerqi
Copy link
Contributor

jerqi commented Nov 17, 2022

I have some concern whether the isValidTaskId is a good implement way.

@zuston
Copy link
Member Author

zuston commented Nov 17, 2022

Is it necessary that we use GrpcFuture to cancel the request?

If the request is very slow or hang, grpc future could cancel the request, otherwise we have to wait.

I have some concern whether the isValidTaskId is a good implement way.

I have no other idea to implement this feature. If using the thread.interrupt, it wont abandon the events have been in dataTransferPool.

@jerqi
Copy link
Contributor

jerqi commented Nov 17, 2022

If we use Netty to transfer the data, will this feature bring difficulties to us?

@zuston
Copy link
Member Author

zuston commented Nov 17, 2022

If we use Netty to transfer the data, will this feature bring difficulties to us?

I think this is not a problem for netty.

@zuston
Copy link
Member Author

zuston commented Nov 17, 2022

I think we could shorten the thread sleep time from 100ms -> 10ms in GrpcFuture wait

@jerqi
Copy link
Contributor

jerqi commented Nov 17, 2022

Is isValidTaskId rpc failureCallback?

@zuston
Copy link
Member Author

zuston commented Nov 17, 2022

Is isValidTaskId rpc failureCallback?

Sorry I don’t get your thought

@jerqi
Copy link
Contributor

jerqi commented Nov 17, 2022

In rpc system, there is a concept called failureCallback, it is used to process the failure result. You can refer to brpc.
https://brpc.apache.org/docs/client/basics/#use-newcallback

@zuston
Copy link
Member Author

zuston commented Nov 17, 2022

Got it. But in this case, we want to cancel the request instead of callback

@zuston
Copy link
Member Author

zuston commented Nov 18, 2022

PTAL @jerqi

@jerqi
Copy link
Contributor

jerqi commented Nov 18, 2022

Wait for a moment. I think this is a typical mechanism. I want to find whether any other system have similar mechanism.

@zuston
Copy link
Member Author

zuston commented Nov 18, 2022

Wait for a moment. I think this is a typical mechanism. I want to find whether any other system have similar mechanism.

OK.

return false;
}

Uninterruptibles.sleepUninterruptibly(100, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 100?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm… It maybe better to set 10ms due to some rpc response time >10ms

return false;
}

Uninterruptibles.sleepUninterruptibly(10, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 10 be a configuration option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have some concern about performance.
Could we pass a timeout parameter when we call the future.get to solve this problem?
Could we avoid sleeping in such critical path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass a timeout parameter when we call the future.get to solve this problem?
Could we avoid sleeping in such critical path?

Good idea!

@jerqi
Copy link
Contributor

jerqi commented Nov 22, 2022

We'd better consider compatible problems with Netty rpc.

@zuston
Copy link
Member Author

zuston commented Nov 22, 2022

We'd better consider compatible problems with Netty rpc.

Got you point. Maybe we could remove the rpc retry in this PR. And add it after netty design finished. What do u think? @jerqi

@jerqi
Copy link
Contributor

jerqi commented Nov 23, 2022

We'd better consider compatible problems with Netty rpc.

Got you point. Maybe we could remove the rpc retry in this PR. And add it after netty design finished. What do u think? @jerqi

Netty may not have GrpcFuture, too.

@zuston
Copy link
Member Author

zuston commented Nov 23, 2022

We'd better consider compatible problems with Netty rpc.

Got you point. Maybe we could remove the rpc retry in this PR. And add it after netty design finished. What do u think? @jerqi

Netty may not have GrpcFuture, too.

I think you misunderstand my thought. In this PR, I do the three changes

  1. In single replica mechanism, single one batch data sent failed should make task fast fail.
  2. When some remaining block events in dataTransferPool wait to be sent, we should abandon it.
  3. More precisely, we need to interrupt send requests belonged to failed tasks. (Using the GrpcFuture to cancel it)

So I could remove the 3th change in this PR.

Netty may not have GrpcFuture, too

By the way, this rpc retry relies on the dedicated rpc implementation. So the GrpcFuture is only for the grpc instead of bare netty. And I also wont use this to be compatible with bare netty.

@jerqi
Copy link
Contributor

jerqi commented Nov 23, 2022

We'd better consider compatible problems with Netty rpc.

Got you point. Maybe we could remove the rpc retry in this PR. And add it after netty design finished. What do u think? @jerqi

Netty may not have GrpcFuture, too.

I think you misunderstand my thought. In this PR, I do the three changes

  1. In single replica mechanism, single one batch data sent failed should make task fast fail.
  2. When some remaining block events in dataTransferPool wait to be sent, we should abandon it.
  3. More precisely, we need to interrupt send requests belonged to failed tasks. (Using the GrpcFuture to cancel it)

So I could remove the 3th change in this PR.

Netty may not have GrpcFuture, too

By the way, this rpc retry relies on the dedicated rpc implementation. So the GrpcFuture is only for the grpc instead of bare netty. And I also wont use this to be compatible with bare netty.

Ok.

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @zuston

@jerqi jerqi merged commit e127253 into apache:master Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants