HDDS-9075. QUASI_CLOSED Replica with incorrect sequenceID should be deleted by SCM.#5120
HDDS-9075. QUASI_CLOSED Replica with incorrect sequenceID should be deleted by SCM.#5120arp7 merged 2 commits intoapache:masterfrom
Conversation
siddhantsangwan
left a comment
There was a problem hiding this comment.
Thanks @nandakumar131 for working on this. I took a quick look. The general idea and changes in the new Replication Manager look good. It would be great to have another test case in TestRatisContainerReplicaCount for the new behaviour. I'll do a comprehensive review tomorrow.
errose28
left a comment
There was a problem hiding this comment.
LGTM but a test like @siddhantsangwan mentioned would be good to have as well. How long do we plan to keep updating the LegacyReplicationManager before removing it entirely?
|
It's important to note the impact of changing
For confidence, we can add unit tests to |
siddhantsangwan
left a comment
There was a problem hiding this comment.
We will need to modify the following part in RatisUnderReplicationHandler to ensure replication for any QUASI_CLOSED replicas with wrong sequence IDs when they're the only replicas remaining. They will be queued by RatisUnhealthyReplicationCheckHandler.
Predicate<ContainerReplica> predicate;
if (replicaCount.getHealthyReplicaCount() == 0) {
predicate = replica -> replica.getState() == State.UNHEALTHY;
} else {
predicate = replica -> replica.getState() == State.CLOSED ||
replica.getState() == State.QUASI_CLOSED;
}
|
@errose28, thanks for the review.
We can mark the LegacyReplicationManager as deprecated and remove it after some time. We have to keep updating the LegacyReplicationManager until we have the config to switch back to the LegacyReplicationManager. |
|
@siddhantsangwan, thanks for the review.
The initial idea was to make SCM inform Datanode about the mismatch and the Datanode move the replica to The current change is to unblock and fix the issue without making large code change and break the test. The definition of I will update the PR with additional tests. |
siddhantsangwan
left a comment
There was a problem hiding this comment.
@nandakumar131 Thanks for adding the tests and keeping changes minimal. Looks good to me! Waiting for CI to be green.
|
Merged based on @siddhantsangwan's LGTM above. |
… should be deleted by SCM. (apache#5120) (cherry picked from commit 52edf7a) Change-Id: I8eb312eda990fec455ba4d31f56960b7d3b9250e
What changes were proposed in this pull request?
When a Container is in CLOSED state with one of its replicas in QUASI_CLOSED state and has incorrect sequenceID, that replica should be deleted by SCM as it's inconsistent replica.
What is the link to the Apache JIRA
HDDS-9075
How was this patch tested?
Added unit test to verify if SCM is re-replicating the container when there is an inconsistent quasi_closed container and then deleting the inconsistent replica when the container is in CLOSED state.