Skip to content

HDDS-11291. Fix timeout logic in XceiverServerRatis.submitRequest#7046

Closed
symious wants to merge 4 commits intoapache:masterfrom
symious:HDDS-11291
Closed

HDDS-11291. Fix timeout logic in XceiverServerRatis.submitRequest#7046
symious wants to merge 4 commits intoapache:masterfrom
symious:HDDS-11291

Conversation

@symious
Copy link
Contributor

@symious symious commented Aug 7, 2024

What changes were proposed in this pull request?

We met the following issue: Datanode command handler executing close container request, but the timeout logic is not correct, so it blocks all requests from SCM.

The jstack shows as follows:

"Command processor thread" #215 daemon prio=5 os_prio=0 tid=0x00007fcef3262000 nid=0xa56 waiting on condition [0x00007fcf63f9d000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fd4ab6dcd38> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
        at org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816)
        at org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436)
        at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
        at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
        at org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436)
        at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611)
        at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105)
        at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103)
        at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593)
        at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748) 

The direct reason is the timeout logic is not working, because in Ratis the executeSubmitClientRequestAsync is a join() operation, and it will block the timeout on the outer CompletableFuture.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11291

How was this patch tested?

(Please explain how this patch was tested. Ex: unit tests, manual tests, workflow run on the fork git repo.)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this.)

@symious symious requested review from adoroszlai and szetszwo August 7, 2024 23:50
@szetszwo
Copy link
Contributor

szetszwo commented Aug 8, 2024

... in Ratis the executeSubmitClientRequestAsync is a join() operation, and it will block the timeout on the outer CompletableFuture.

The join() to join the request submission but not the future completion. So, it is correct. It seems that the request submission did not go through in Datanode. We should check why ContainerStateMachine.startTransaction(RaftClientRequest request) was blocked. Are you able to reproduce the problem?

For Ratis, I would see if there is a better way to eliminate join().

@symious
Copy link
Contributor Author

symious commented Aug 10, 2024

We should check why ContainerStateMachine.startTransaction(RaftClientRequest request) was blocked. Are you able to reproduce the problem?

Not sure if it can be reproduced, but we suspect it's related to peer removal when leader is doing container close.

@adoroszlai adoroszlai marked this pull request as draft October 18, 2024 05:36
@adoroszlai
Copy link
Contributor

@symious Based on @szetszwo's comment, I think the root cause needs further investigation. Can this be closed?

@adoroszlai adoroszlai closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments