New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-5263. SCM may stay in safe mode forever after a unclean shutdown of SCM. #2294
Conversation
With this approach we see an issue Before restart, 2 pipelines closed, and let's say it removed and create a new pipeline. But in the SCM pipeline table it has old 2 pipelines, as remove/new pipeline are not persisted to DB as SCM is force killed. As we call refresh and validate we exit safe mode after 2nd pipeline remove transaction, and we validate pipeline rules for each applyTransaction so safemode pipeline rules will be validated, and we do not wait for all the pending transactions. In this case we come out of safemode early and reads/write might fail. This causes problems like reading/write will fail, even after SCM is out of safe mode.
After an offline discussion with @bshashikant
|
...java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
Tested this on cluster And once after leader ready, rules refreshed and data nodes registered, rules are successfully validated. (Unlike before after remove pipeline, pipeline rules are not successfully validated)
|
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMContext.java
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Outdated
Show resolved
Hide resolved
...erver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/HealthyPipelineSafeModeRule.java
Show resolved
Hide resolved
…externally to make bg services work.
…pply for follower after refresh
d0bb6b9
to
bd16d60
Compare
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Show resolved
Hide resolved
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Show resolved
Hide resolved
… of SCM. (apache#2294) (cherry picked from commit ac7166b) Change-Id: I12dad469f1b395286a430a49eb0b48f2455d1fc3
// 2. Start DN Rpc server. | ||
if (!refreshedAfterLeaderReady.get()) { | ||
scm.getScmSafeModeManager().refresh(); | ||
LOG.info("bharat starting from sm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bharatviswa504 I just noticed that this line got committed by mistake. Can we rectify this in a separate jira or combine it in your next patch?
What changes were proposed in this pull request?
After unclean SCM shutdown, SCM may not come out of safemode.
Attached a document to Jira with the problem statement and the proposal.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5263
How was this patch tested?
Tested the fix on a cluster.
Testing is described here link
Performed failover testing also to check leader status is properly propogated.
Added tests for SCMContext new API.