HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294

bharatviswa504 · 2021-06-01T10:15:19Z

What changes were proposed in this pull request?

After unclean SCM shutdown, SCM may not come out of safemode.

Attached a document to Jira with the problem statement and the proposal.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5263

How was this patch tested?

Tested the fix on a cluster.
Testing is described here link
Performed failover testing also to check leader status is properly propogated.
Added tests for SCMContext new API.

bharatviswa504 · 2021-06-02T09:50:59Z

With this approach we see an issue

Before restart, 2 pipelines closed, and let's say it removed and create a new pipeline. But in the SCM pipeline table it has old 2 pipelines, as remove/new pipeline are not persisted to DB as SCM is force killed.

As we call refresh and validate we exit safe mode after 2nd pipeline remove transaction, and we validate pipeline rules for each applyTransaction so safemode pipeline rules will be validated, and we do not wait for all the pending transactions. In this case we come out of safemode early and reads/write might fail.

This causes problems like reading/write will fail, even after SCM is out of safe mode.

2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 0, pipeline's with at least one datanode reported threshold count is 0

After an offline discussion with @bshashikant

We thought we shall refresh SCM safe mode rule once after leader Ready on all SCMs.
And start DN RPC port only after leader ready, so that SCM does not come out of safe mode early by considering not upto date DB.

...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

bharatviswa504 · 2021-06-04T07:10:38Z

Tested this on cluster
with the scenario, close pipeline, scrubber removed and created new pipeline. Restarted SCM. (As SCM will have in its DB the old pipeline which is closed/removed before fix SCM would never come out of safe mode, as it reads old pipeline info in DB during rule setup)

And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)

2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1

021-06-04 07:05:29,184 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 9642160f-49ab-4400-a66c-e7f4210f4ca0{ip: xx, host: bv-unsec-2.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-04 07:05:30,038 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : b44ed911-3a93-4b83-9e81-88fe47369a84{ip: xx, host: bv-unsec-3.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 37d28fa1-41db-4f44-963e-45675b822884{ip: xx, host: bv-unsec-1.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}

2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode.

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMContext.java

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java

… of SCM.

…externally to make bg services work.

…der ready

…pply for follower after refresh

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

… of SCM. (apache#2294) (cherry picked from commit ac7166b) Change-Id: I12dad469f1b395286a430a49eb0b48f2455d1fc3

vivekratnavel · 2021-08-24T22:28:05Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java

+      // 2. Start DN Rpc server.
+      if (!refreshedAfterLeaderReady.get()) {
+        scm.getScmSafeModeManager().refresh();
+        LOG.info("bharat starting from sm");


@bharatviswa504 I just noticed that this line got committed by mistake. Can we rectify this in a separate jira or combine it in your next patch?

bharatviswa504 requested review from bshashikant and GlenGeng-awx June 1, 2021 10:15

bharatviswa504 added the scm-ha label Jun 1, 2021

bshashikant reviewed Jun 3, 2021

View reviewed changes

bharatviswa504 force-pushed the HDDS-5263 branch from b5df455 to adb43b4 Compare June 3, 2021 16:26

bharatviswa504 requested a review from bshashikant June 4, 2021 07:13

bshashikant reviewed Jun 9, 2021

View reviewed changes

bharatviswa504 added 18 commits June 10, 2021 10:15

HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown…

f1a7e0a

… of SCM.

fix cs

44e472c

fix fb

2878e92

fix ci

5ef1951

fix tests

0d8c8df

fix cs

f4346fe

add leader ready check for background services start

c6f08c0

remove leader ready check in SCMContext, as it needs to be triggered …

650e1e1

…externally to make bg services work.

add leader ready in a daemon thread and update context only after lea…

d3f997e

…der ready

add a check to print rule validated

7395f38

add logging

ccd2b36

add java doc and remove unwanted change

24e06a0

rework to refresh after leader ready and also refreshandvalidate in a…

e90a2d5

…pply for follower after refresh

revert

4108e1b

fix cs

bd78662

fix test

fc47f58

remove leader check in apply transaction

1940c02

fix review comments

bd16d60

bharatviswa504 force-pushed the HDDS-5263 branch from d0bb6b9 to bd16d60 Compare June 10, 2021 06:21

fix review comment

55b8fe0

bharatviswa504 requested a review from bshashikant June 10, 2021 06:26

bharatviswa504 added 2 commits June 10, 2021 13:27

fix more issues found during testing and also update test

db8f8e8

fix cs

1995436

bshashikant reviewed Jun 10, 2021

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java Show resolved Hide resolved

bshashikant reviewed Jun 10, 2021

View reviewed changes

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java Show resolved Hide resolved

bharatviswa504 added 2 commits June 10, 2021 20:46

fix review

8ed4516

fix code causing test failure

8b3aea8

bharatviswa504 requested a review from bshashikant June 11, 2021 05:21

bshashikant approved these changes Jun 11, 2021

View reviewed changes

bshashikant merged commit ac7166b into apache:master Jun 11, 2021

vivekratnavel reviewed Aug 24, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294

HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294

bharatviswa504 commented Jun 1, 2021 •

edited by bshashikant

bharatviswa504 commented Jun 2, 2021 •

edited

bharatviswa504 commented Jun 4, 2021 •

edited

vivekratnavel Aug 24, 2021

HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294

HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294

Conversation

bharatviswa504 commented Jun 1, 2021 • edited by bshashikant

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

bharatviswa504 commented Jun 2, 2021 • edited

bharatviswa504 commented Jun 4, 2021 • edited

vivekratnavel Aug 24, 2021

Choose a reason for hiding this comment

bharatviswa504 commented Jun 1, 2021 •

edited by bshashikant

bharatviswa504 commented Jun 2, 2021 •

edited

bharatviswa504 commented Jun 4, 2021 •

edited