Skip to content

HDDS-5116. Secure datanode/OM may exit if it cannot connect to SCM.#2162

Merged
bshashikant merged 4 commits intoapache:masterfrom
bharatviswa504:HDDS-5116
Apr 21, 2021
Merged

HDDS-5116. Secure datanode/OM may exit if it cannot connect to SCM.#2162
bshashikant merged 4 commits intoapache:masterfrom
bharatviswa504:HDDS-5116

Conversation

@bharatviswa504
Copy link
Contributor

@bharatviswa504 bharatviswa504 commented Apr 20, 2021

What changes were proposed in this pull request?

Following changes are done:

  1. For Datanode used max retryCount so that Datanode will retry for ever during startup to get Signed Cert from SCM.
  2. For OM/SCM used fixed duration to give response to end-user performing init/bootstrap.
  3. Updated to use max retryCount for fetching CAList which is required during DN/OM startup.
  4. Updated to use max retry count for get certificate From SCM which is used in BlockToken Verification/OMToken Verification when cert is not there in its local cache.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5116

How was this patch tested?

Tested manually, started OM/DN before SCM Startup and they are retrying more than default 15 retry count.

docker-compose up --build om1 datanode1

After some time greater than default retry completed start scm
docker-compose up --build scm1.org scm2.org scm3.org

om1_1 | 2021-04-20 07:15:09,675 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.NoRouteToHostException: No Route to Host from om1/172.25.0.111 to scm1.org:9863 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost, while invoking $Proxy31.send over nodeId=scm1,nodeAddress=scm1.org/172.25.0.116:9863 after 45 failover attempts. Trying to failover after sleeping for 2000ms.

datanode1_1 | 2021-04-20 07:15:35,048 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From 9cb343c107ed/172.25.0.102 to scm3.org:9961 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy17.submitRequest over nodeId=scm3,nodeAddress=scm3.org/172.25.0.118:9961 after 35 failover attempts. Trying to failover after sleeping for 2000ms.

And once SCM is booted up DN and OM are able to successfully startup.

@bshashikant bshashikant merged commit 8cdabec into apache:master Apr 21, 2021
@bshashikant
Copy link
Contributor

Thanks @bharatviswa504 for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants