HDDS-14868. Avoid full scan of container list during refreshAndValidate of ContainerSafemodeRule.#9953
HDDS-14868. Avoid full scan of container list during refreshAndValidate of ContainerSafemodeRule.#9953sadanand48 wants to merge 17 commits intoapache:masterfrom
Conversation
…te of ContainerSafemodeRule.
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
|
@sadanand48 , thanks for working on this! How about refreshing the safemode rules every 5s, instead of doing it in applyTransactions? |
Thanks @szetszwo for the input, we could make this behaviour configurable i.e periodic or based on applyTransaction. I'm saying because smaller clusters or cluster's without any pending logs may be impacted by redundant refresh calls. |
Refreshing the safemode rules in applyTransaction actually is a big mistake -- applyTransaction is the critical path of the StateMachine, adding unnecessary operations there is going to slow down everything. In contrast, refreshing the safemode rules every 5s is not going to have any measurable performance impact. Hypothetically, if refreshing every 5s is not okay, then refreshing it applyTransaction is definitely much worse since there are thousands of applyTransaction ops per second. |
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
@sadanand48 , thanks for the update
- Since the current code in SCMStateMachine use SCMSafeModeManager to refresh, it is better to do refresh in SCMSafeModeManager.
- When refresh is enabled, SCMStateMachine should not refresh.
- During refreshing, if it is NOT in safemode, we can stop the executor. Then, we don't need any stop method.
- It is better to create a non-mock test using MiniOzoneCluster.
See https://issues.apache.org/jira/secure/attachment/13081501/9953_review.patch
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
...ver-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/AbstractContainerSafeModeRule.java
Outdated
Show resolved
Hide resolved
|
Thanks @szetszwo for the review, updated as per your patch
With this, all the safemode rules will have the same behaviour, I guess that should be okay. I will add a non-mock test |
szetszwo
left a comment
There was a problem hiding this comment.
@sadanand48 , thanks for the update!
Quick question:
- Would it work if we don't make the changes in AbstractContainerSafeModeRule and other code logic changes such as isScmRatisApplyCaughtUpToCommit?
If it works, this PR should only change the refreshing time (i.e. periodic refreshing instead of doing it in SCMStateMachine.) Other code logic changes/improvement can be done in a separate PR.
@sadanand48 , any thought? |
|
Yes @szetszwo , it should work without other changes. The other change is only about isScmRatisApplyCaughtUpToCommit where we don't refresh if there are no new pending transactions. This is just an optimization . In the current revision of the patch I have removed this |
szetszwo
left a comment
There was a problem hiding this comment.
@sadanand48 , thanks for the update. Please see the commnets inlined.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
@sadanand48 , very sorry that this idea actually is bad since the safeModeLogExecutor is new feature by HDDS-14012. Not sure if it is rock solid. Let don't use it. We may simply start a thread instead; see https://issues.apache.org/jira/secure/attachment/13081633/9953_review2.patch |
szetszwo
left a comment
There was a problem hiding this comment.
@sadanand48 , thanks for the update! Please see the comments inlined.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SafeModeExitRule.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
...op-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/SCMSafeModeManager.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
The failed test (TestContainerStateMachine) seems unrelated. Please take a look.
What changes were proposed in this pull request?
Periodic refresh — Run refresh on a ~5s (configurable) schedule instead of on every applyTransaction / refresh(false) path.
https://issues.apache.org/jira/browse/HDDS-14868