Skip to content

HDDS-9420. [Compatibility]Enabling GRPC encryption causes SCM startup failure#5561

Merged
adoroszlai merged 2 commits intoapache:masterfrom
ChenSammi:HDDS-9420
Nov 10, 2023
Merged

HDDS-9420. [Compatibility]Enabling GRPC encryption causes SCM startup failure#5561
adoroszlai merged 2 commits intoapache:masterfrom
ChenSammi:HDDS-9420

Conversation

@ChenSammi
Copy link
Contributor

What changes were proposed in this pull request?

Resolve the backward compatibility issue introduced in HDDS-8588.

The root cause is that the listCA() call during SCM, will try to call SCM's SCMSecurityProtocolServer API, but this SCMSecurityProtocolServer is not ready at that time. The call has a max retry policy. So SCM will stuck in the retry and cannot startup.

The fix avoids the remote API call, use local on disk info to build the TrustChain.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9420

How was this patch tested?

Tested it manually. Here is the step

  1. enable ozone security, ozone.security.enabled
  2. enable grpc security, hdds.grpc.tls.enabled
  3. Install a 1.3.0 OM cluster with above properties, do "scm --init", start scm, and then stop scm
  4. upgrade the cluster to master branch, start scm, scm hang with following stack, stop scm
 "main" #1 prio=5 os_prio=31 tid=0x0000000142009000 nid=0x2203 waiting on condition [0x000000016bf51000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:131)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:108)
        - locked <0x00000005c48670c8> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy11.submitRequest(Unknown Source)
        at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
        at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
        at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:952)
        at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:940)
        at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getTrustChain(DefaultCertificateClient.java:420)
        - locked <0x00000005c107c2d8> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
        at org.apache.hadoop.hdds.security.ssl.ReloadingX509KeyManager.loadKeyManager(ReloadingX509KeyManager.java:204)
        at org.apache.hadoop.hdds.security.ssl.ReloadingX509KeyManager.<init>(ReloadingX509KeyManager.java:85)
        at org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory.createKeyManagers(PemFileBasedKeyStoresFactory.java:83)
        at org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory.init(PemFileBasedKeyStoresFactory.java:104)
        - locked <0x00000005c4698000> (a org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory)
        at org.apache.hadoop.hdds.security.x509.keys.SecurityUtil.getServerKeyStoresFactory(SecurityUtil.java:103)
        at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getServerKeyStoresFactory(DefaultCertificateClient.java:967)
        - locked <0x00000005c107c2d8> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
        at org.apache.hadoop.hdds.scm.ha.HASecurityUtils.createSCMRatisTLSConfig(HASecurityUtils.java:341)
        at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.<init>(SCMRatisServerImpl.java:109)
        at org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.<init>(SCMHAManagerImpl.java:97)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:650)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:403)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:601)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:613)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:171)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
        at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
  1. upgrade the cluster to master with this patch, start scm successfully. There is message "Key manager is loaded with certificate chain" found in the SCM log.

@ChenSammi
Copy link
Contributor Author

@fapifta , getTrustChain is used for SCM, OM and DN. For OM and DN, even root certificate is not considered, sub ca certificate still need be included in the TrustChain. So the mentioned removing the root certificate in certificate bundle solution is not tried here. It can be a different new improvement to the feature.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChenSammi for the fix. I have verified it using secure upgrade acceptance test locally (HDDS-5506, in-progress).

@adoroszlai adoroszlai merged commit db55221 into apache:master Nov 10, 2023
@ChenSammi
Copy link
Contributor Author

Thank you, @adoroszlai .

ibrusentsev pushed a commit to ibrusentsev/ozone that referenced this pull request Nov 14, 2023
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 1, 2024
…s SCM startup failure (apache#5561)

(cherry picked from commit db55221)
Change-Id: Ic8b4fb2c83a8efa75da04a49ef9c43f89abce388
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants