HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly. by bharatviswa504 · Pull Request #2247 · apache/ozone

bharatviswa504 · 2021-05-14T05:08:01Z

What changes were proposed in this pull request?

In OzoneManager, SCM client is shared across RpcHandler threads.
Where we have observed that failOver across multiple threads causing failover to happen incorrectly on same SCM and is exhausting retry count.

And also one thing I have observed is

If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863

2021-05-11 05:59:53,202 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm1,nodeAddress=scm1/172.31.0.9:9863 after 9 failover attempts. Trying to failover after sleeping for 2000ms.
If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863

If we observe the error is no route to scm3, but retry happened on scm1/172.31.0.9:9863

2021-05-11 05:59:59,345 [IPC Server handler 10 on default port 9862] WARN ipc.Client: Address change detected. Old: scm3/172.31.0.5:9863 New: scm3:9863
2021-05-11 05:59:59,347 [IPC Server handler 10 on default port 9862] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from  om1/172.31.0.11 to scm3:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see:  http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy32.send over nodeId=scm2,nodeAddress=scm2/172.31.0.6:9863 after 10 failover attempts. Trying to failover after sleeping for 2000ms.

This is because our performFailOver is a no-op and if failOver is needed we update currentSCMProxyNodeID in shouldRetry in RetryPolicy.

For example
2 Threads contacted SCM3, and got NoRouteToHostException, so shouldRetry from first thread will move the currentSCMProxyNodeID to scm1 and other thread, after this move currentSCMProxyNodeID to scm2.

Hadoop Proxy RetryInvocationHandler already takes care of if there is another thread trying to perform failOver it will not call performFailOver again. We shall see below WARN message, and get the currentProxy and contact that node.

om3_1 | 2021-05-14 05:04:28,699 [IPC Server handler 34 on default port 9862] WARN retry.RetryInvocationHandler: A failover has occurred since the start of call #24329 $Proxy32.send over nodeId=scm3,nodeAddress=scm3/192.168.0.6:9863

Solution here is to use performFailOver to update scmNodeID instead of using shouldRetry to update currentSCMProxyNodeID.

And also made a few more changes to make the logic common across classes for proxy creation.

Opened a Jira to avoid duplication HDDS-5227

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5216

How was this patch tested?

Tested locally, and now observed that it will not perform failOver again and exhausting retry counts.

bshashikant

Looks good. @bharatviswa504 , can you please add some comments on why the laederId is made volatile and being cached and why performFailover. change is incorporated.

bharatviswa504 · 2021-05-17T04:23:00Z

Thank You @bshashikant for the review.
I have updated code with code comments.

dineshchitlangia

Overall LGTM, minor suggestions inline regarding the comments.
Thank you @bharatviswa504 for the improvement.

...rk/src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMBlockLocationFailoverProxyProvider.java

...rc/main/java/org/apache/hadoop/hdds/scm/proxy/SCMContainerLocationFailoverProxyProvider.java

...src/main/java/org/apache/hadoop/hdds/scm/proxy/SCMSecurityProtocolFailoverProxyProvider.java

bharatviswa504 · 2021-05-17T05:11:59Z

Overall LGTM, minor suggestions inline regarding the comments.
Thank you @bharatviswa504 for the improvement.

Thank You @dineshchitlangia for the review. I have incorporated your comments.

bharatviswa504 · 2021-05-17T10:27:31Z

Thank You @bshashikant and @dineshchitlangia for the review.

@dineshchitlangia I have incorporated your comments on code comments so went ahead and committed this.

…ing failover wrongly. (apache#2247) Change-Id: I6993103b0a931d6fd44343248ffad45db5aebc99

bharatviswa504 requested a review from bshashikant May 14, 2021 05:08

bharatviswa504 changed the title ~~Hdds 5216: Fix race condition causing failOverProxy which is causing failover wrongly.~~ Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly. May 14, 2021

bshashikant reviewed May 14, 2021

View reviewed changes

bharatviswa504 requested a review from bshashikant May 17, 2021 04:23

dineshchitlangia requested changes May 17, 2021

View reviewed changes

bharatviswa504 requested a review from dineshchitlangia May 17, 2021 05:12

bharatviswa504 added 6 commits May 17, 2021 11:47

HDDS-5216. Fix race condition in failover proxy provider.

10fd1a1

add sync

b94436b

fix bug in logic

6944396

fix failover bug

0fc0537

address review comments

cf7473e

a

85d305a

bharatviswa504 force-pushed the HDDS-5216-2 branch from db813a9 to 85d305a Compare May 17, 2021 06:18

bharatviswa504 added 2 commits May 17, 2021 11:53

dinesh comments

dedac9c

Trigger notification

ff4cb4a

bharatviswa504 changed the title ~~Hdds 5216: Fix race condition causing SCM failOverProxy which is causing failover wrongly.~~ HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly. May 17, 2021

bharatviswa504 added the scm-ha label May 17, 2021

bshashikant approved these changes May 17, 2021

View reviewed changes

bharatviswa504 merged commit 2254abf into apache:master May 17, 2021

bharatviswa504 added a commit to bharatviswa504/hadoop-ozone that referenced this pull request Jul 25, 2021

HDDS-5216. Fix race condition causing SCM failOverProxy which is caus…

292ccaa

…ing failover wrongly. (apache#2247) Change-Id: I6993103b0a931d6fd44343248ffad45db5aebc99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.#2247

HDDS-5216. Fix race condition causing SCM failOverProxy which is causing failover wrongly.#2247
bharatviswa504 merged 8 commits intoapache:masterfrom
bharatviswa504:HDDS-5216-2

bharatviswa504 commented May 14, 2021 •

edited

Loading

Uh oh!

bshashikant left a comment

Uh oh!

bharatviswa504 commented May 17, 2021

Uh oh!

dineshchitlangia left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bharatviswa504 commented May 17, 2021 •

edited

Loading

Uh oh!

bharatviswa504 commented May 17, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bharatviswa504 commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

bshashikant left a comment

Choose a reason for hiding this comment

Uh oh!

bharatviswa504 commented May 17, 2021

Uh oh!

dineshchitlangia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bharatviswa504 commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bharatviswa504 commented May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bharatviswa504 commented May 14, 2021 •

edited

Loading

bharatviswa504 commented May 17, 2021 •

edited

Loading

bharatviswa504 commented May 17, 2021 •

edited

Loading