Skip to content

HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.#2247

Merged
bharatviswa504 merged 8 commits intoapache:masterfrom
bharatviswa504:HDDS-5216-2
May 17, 2021
Merged

HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.#2247
bharatviswa504 merged 8 commits intoapache:masterfrom
bharatviswa504:HDDS-5216-2

Conversation

@bharatviswa504
Copy link
Contributor

@bharatviswa504 bharatviswa504 commented May 14, 2021

What changes were proposed in this pull request?

In OzoneManager, SCM client is shared across RpcHandler threads.
Where we have observed that failOver across multiple threads causing failover to happen incorrectly on same SCM and is exhausting retry count.

And also one thing I have observed is

If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863

2021-05-11 05:59:53,202 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm1,nodeAddress=scm1/172.31.0.9:9863 after 9 failover attempts. Trying to failover after sleeping for 2000ms.
If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863

If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863

2021-05-11 05:59:59,345 [IPC Server handler 10 on default port 9862] WARN ipc.Client: Address change detected. Old: scm3/172.31.0.5:9863 New: scm3:9863
2021-05-11 05:59:59,347 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm2,nodeAddress=scm2/172.31.0.6:9863 after 10 failover attempts. Trying to failover after sleeping for 2000ms.

This is because our performFailOver is a no-op and if failOver is needed we update currentSCMProxyNodeID in shouldRetry in RetryPolicy.

For example
2 Threads contacted SCM3, and got NoRouteToHostException, so shouldRetry from first thread will move the currentSCMProxyNodeID to scm1 and other thread, after this move currentSCMProxyNodeID to scm2.

Hadoop Proxy RetryInvocationHandler already takes care of if there is another thread trying to perform failOver it will not call performFailOver again. We shall see below WARN message, and get the currentProxy and contact that node.

om3_1 | 2021-05-14 05:04:28,699 [IPC Server handler 34 on default port 9862] WARN retry.RetryInvocationHandler: A failover has occurred since the start of call #24329 $Proxy32.send over nodeId=scm3,nodeAddress=scm3/192.168.0.6:9863

Solution here is to use performFailOver to update scmNodeID instead of using shouldRetry to update currentSCMProxyNodeID.

And also made a few more changes to make the logic common across classes for proxy creation.

Opened a Jira to avoid duplication HDDS-5227

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5216

How was this patch tested?

Tested locally, and now observed that it will not perform failOver again and exhausting retry counts.

@bharatviswa504 bharatviswa504 changed the title Hdds 5216: Fix race condition causing failOverProxy which is causing failover wrongly. Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly. May 14, 2021
Copy link
Contributor

@bshashikant bshashikant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. @bharatviswa504 , can you please add some comments on why the laederId is made volatile and being cached and why performFailover. change is incorporated.

@bharatviswa504
Copy link
Contributor Author

Thank You @bshashikant for the review.
I have updated code with code comments.

Copy link
Contributor

@dineshchitlangia dineshchitlangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, minor suggestions inline regarding the comments.
Thank you @bharatviswa504 for the improvement.

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented May 17, 2021

Overall LGTM, minor suggestions inline regarding the comments.
Thank you @bharatviswa504 for the improvement.

Thank You @dineshchitlangia for the review. I have incorporated your comments.

@bharatviswa504 bharatviswa504 changed the title Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly. HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly. May 17, 2021
@bharatviswa504 bharatviswa504 merged commit 2254abf into apache:master May 17, 2021
@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented May 17, 2021

Thank You @bshashikant and @dineshchitlangia for the review.

@dineshchitlangia I have incorporated your comments on code comments so went ahead and committed this.

bharatviswa504 added a commit to bharatviswa504/hadoop-ozone that referenced this pull request Jul 25, 2021
…ing failover wrongly. (apache#2247)

Change-Id: I6993103b0a931d6fd44343248ffad45db5aebc99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants