HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.#2247
Conversation
bshashikant
left a comment
There was a problem hiding this comment.
Looks good. @bharatviswa504 , can you please add some comments on why the laederId is made volatile and being cached and why performFailover. change is incorporated.
|
Thank You @bshashikant for the review. |
dineshchitlangia
left a comment
There was a problem hiding this comment.
Overall LGTM, minor suggestions inline regarding the comments.
Thank you @bharatviswa504 for the improvement.
...rk/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMBlockLocationFailoverProxyProvider.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/hadoop/hdds/scm/proxy/SCMContainerLocationFailoverProxyProvider.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMSecurityProtocolFailoverProxyProvider.java
Outdated
Show resolved
Hide resolved
Thank You @dineshchitlangia for the review. I have incorporated your comments. |
db813a9 to
85d305a
Compare
|
Thank You @bshashikant and @dineshchitlangia for the review. @dineshchitlangia I have incorporated your comments on code comments so went ahead and committed this. |
…ing failover wrongly. (apache#2247) Change-Id: I6993103b0a931d6fd44343248ffad45db5aebc99
What changes were proposed in this pull request?
In OzoneManager, SCM client is shared across RpcHandler threads.
Where we have observed that failOver across multiple threads causing failover to happen incorrectly on same SCM and is exhausting retry count.
And also one thing I have observed is
If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863
If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863
This is because our performFailOver is a no-op and if failOver is needed we update currentSCMProxyNodeID in shouldRetry in RetryPolicy.
For example
2 Threads contacted SCM3, and got NoRouteToHostException, so shouldRetry from first thread will move the currentSCMProxyNodeID to scm1 and other thread, after this move currentSCMProxyNodeID to scm2.
Hadoop Proxy RetryInvocationHandler already takes care of if there is another thread trying to perform failOver it will not call performFailOver again. We shall see below WARN message, and get the currentProxy and contact that node.
om3_1 | 2021-05-14 05:04:28,699 [IPC Server handler 34 on default port 9862] WARN retry.RetryInvocationHandler: A failover has occurred since the start of call #24329 $Proxy32.send over nodeId=scm3,nodeAddress=scm3/192.168.0.6:9863
Solution here is to use performFailOver to update scmNodeID instead of using shouldRetry to update currentSCMProxyNodeID.
And also made a few more changes to make the logic common across classes for proxy creation.
Opened a Jira to avoid duplication HDDS-5227
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5216
How was this patch tested?
Tested locally, and now observed that it will not perform failOver again and exhausting retry counts.