Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294

Merged
merged 23 commits into from Jun 11, 2021

Conversation

bharatviswa504
Copy link
Contributor

@bharatviswa504 bharatviswa504 commented Jun 1, 2021

What changes were proposed in this pull request?

After unclean SCM shutdown, SCM may not come out of safemode.

Attached a document to Jira with the problem statement and the proposal.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5263

How was this patch tested?

Tested the fix on a cluster.
Testing is described here link
Performed failover testing also to check leader status is properly propogated.
Added tests for SCMContext new API.

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented Jun 2, 2021

With this approach we see an issue

Before restart, 2 pipelines closed, and let's say it removed and create a new pipeline. But in the SCM pipeline table it has old 2 pipelines, as remove/new pipeline are not persisted to DB as SCM is force killed.

As we call refresh and validate we exit safe mode after 2nd pipeline remove transaction, and we validate pipeline rules for each applyTransaction so safemode pipeline rules will be validated, and we do not wait for all the pending transactions. In this case we come out of safemode early and reads/write might fail.

This causes problems like reading/write will fail, even after SCM is out of safe mode.

2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
2021-06-02 05:51:04,208 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 0, healthy pipeline threshold count is 0
2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2021-06-02 05:51:04,209 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 0, pipeline's with at least one datanode reported threshold count is 0

After an offline discussion with @bshashikant

  1. We thought we shall refresh SCM safe mode rule once after leader Ready on all SCMs.
  2. And start DN RPC port only after leader ready, so that SCM does not come out of safe mode early by considering not upto date DB.

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented Jun 4, 2021

Tested this on cluster
with the scenario, close pipeline, scrubber removed and created new pipeline. Restarted SCM. (As SCM will have in its DB the old pipeline which is closed/removed before fix SCM would never come out of safe mode, as it reads old pipeline info in DB during rule setup)

And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)

2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 1, healthy pipeline threshold count is 1
2021-06-04 07:05:28,646 INFO org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 1, pipeline's with at least one datanode reported threshold count is 1
021-06-04 07:05:29,184 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 9642160f-49ab-4400-a66c-e7f4210f4ca0{ip: xx, host: bv-unsec-2.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-04 07:05:30,038 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : b44ed911-3a93-4b83-9e81-88fe47369a84{ip: xx, host: bv-unsec-3.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Registered Data node : 37d28fa1-41db-4f44-963e-45675b822884{ip: xx, host: bv-unsec-1.bv-unsec.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}

2021-06-04 07:05:30,098 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:31,146 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:32,009 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:32,059 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:33,841 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:33,888 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:33,921 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:39,331 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated
2021-06-04 07:05:39,332 INFO org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode.

@bshashikant bshashikant merged commit ac7166b into apache:master Jun 11, 2021
bharatviswa504 added a commit to bharatviswa504/hadoop-ozone that referenced this pull request Jul 25, 2021
… of SCM. (apache#2294)

(cherry picked from commit ac7166b)
Change-Id: I12dad469f1b395286a430a49eb0b48f2455d1fc3
// 2. Start DN Rpc server.
if (!refreshedAfterLeaderReady.get()) {
scm.getScmSafeModeManager().refresh();
LOG.info("bharat starting from sm");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharatviswa504 I just noticed that this line got committed by mistake. Can we rectify this in a separate jira or combine it in your next patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants