HDDS-3291. Write operation when both OM followers are shutdown. #733
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add IPC client time out, so that the client will fail with socket time out exception in cases of 2 OM node failures.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3291
Please replace this section with the link to the Apache JIRA)
How was this patch tested?
Tested it in docker compose cluster with 1 minute, and see that it failed finally, instead of hanging.
To repro this test, we need to change leader.election.time.out value also to large value, as we need this request to be submitted to ratis, and as ratis server keeps on retry then only we will see this issue.
2020-03-27 16:25:27,625 [main] INFO RetryInvocationHandler:411 - com.google.protobuf.ServiceException: java.net.SocketTimeoutException: Call From c5263a1df1ad/172.22.0.3 to om2:9862 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.22.0.3:56460 remote=om2/172.22.0.4:9862]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout, while invoking $Proxy19.submitRequest over nodeId=om2,nodeAddress=om2:9862 after 15 failover attempts. Trying to failover immediately.
2020-03-27 16:25:27,626 [main] ERROR OMFailoverProxyProvider:285 - Failed to connect to OMs: [nodeId=om1,nodeAddress=om1:9862, nodeId=om3,nodeAddress=om3:9862, nodeId=om2,nodeAddress=om2:9862]. Attempted 15 failovers.
For now, the changed config affects all rpc clients in the system, in the future if we want to add a new RPC client time out, we can add a new config. For now, using the same config of ipc client.