Skip to content

HDDS-9075. QUASI_CLOSED Replica with incorrect sequenceID should be deleted by SCM.#5120

Merged
arp7 merged 2 commits intoapache:masterfrom
nandakumar131:HDDS-9075
Jul 28, 2023
Merged

HDDS-9075. QUASI_CLOSED Replica with incorrect sequenceID should be deleted by SCM.#5120
arp7 merged 2 commits intoapache:masterfrom
nandakumar131:HDDS-9075

Conversation

@nandakumar131
Copy link
Contributor

What changes were proposed in this pull request?

When a Container is in CLOSED state with one of its replicas in QUASI_CLOSED state and has incorrect sequenceID, that replica should be deleted by SCM as it's inconsistent replica.

What is the link to the Apache JIRA

HDDS-9075

How was this patch tested?

Added unit test to verify if SCM is re-replicating the container when there is an inconsistent quasi_closed container and then deleting the inconsistent replica when the container is in CLOSED state.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nandakumar131 for working on this. I took a quick look. The general idea and changes in the new Replication Manager look good. It would be great to have another test case in TestRatisContainerReplicaCount for the new behaviour. I'll do a comprehensive review tomorrow.

@errose28 errose28 self-requested a review July 26, 2023 18:56
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but a test like @siddhantsangwan mentioned would be good to have as well. How long do we plan to keep updating the LegacyReplicationManager before removing it entirely?

@siddhantsangwan
Copy link
Contributor

It's important to note the impact of changing RatisContainerReplicaCount:

  1. We are changing the definition of the "unhealthy" term from meaning replicas with UNHEALTHY state to replicas with either UNHEALTHY state, or QUASI_CLOSED and sequence id not matching the container's sequence id. There's extensive existing documentation where "unhealthy" is meant for only UNHEALTHY replicas, so we'll have to change that. Perhaps in a future PR.
  2. We have a RatisUnhealthyReplicationCheckHandler whose purpose was to detect over/under replication when only UNHEALTHY replicas are present in the cluster. Since we're changing the getUnhealthyReplicaCount() API in RatisContainerReplicaCount such that it also returns QUASI_CLOSED replicas with incorrect sequence, this class's logic will now also include these replicas. This also affects RatisUnderReplicationHandler.
  3. RatisOverReplicationHandler already has logic to try and delete unhealthy replicas or replicas whose state doesn't match the container's state first. So in theory, this class should work well with this change.

For confidence, we can add unit tests to TestRatisContainerReplicaCount, TestRatisUnderReplicationHandler, TestRatisOverReplicationHandler, and TestRatisReplicationCheckHandler. Ideally something in TestRatisUnhealthyReplicationCheckHandler too. For integration testing, we have testContainerReplicationWithLegacyReplicationManagerDisabled in TestContainerReplication (though having QUASI_CLOSED replicas in an integration test is better suited for a different PR).

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to modify the following part in RatisUnderReplicationHandler to ensure replication for any QUASI_CLOSED replicas with wrong sequence IDs when they're the only replicas remaining. They will be queued by RatisUnhealthyReplicationCheckHandler.

    Predicate<ContainerReplica> predicate;
    if (replicaCount.getHealthyReplicaCount() == 0) {
      predicate = replica -> replica.getState() == State.UNHEALTHY;
    } else {
      predicate = replica -> replica.getState() == State.CLOSED ||
          replica.getState() == State.QUASI_CLOSED;
    }

@nandakumar131
Copy link
Contributor Author

nandakumar131 commented Jul 27, 2023

@errose28, thanks for the review.

How long do we plan to keep updating the LegacyReplicationManager before removing it entirely?

We can mark the LegacyReplicationManager as deprecated and remove it after some time. We have to keep updating the LegacyReplicationManager until we have the config to switch back to the LegacyReplicationManager.

@nandakumar131
Copy link
Contributor Author

@siddhantsangwan, thanks for the review.

We are changing the definition of the "unhealthy" term from meaning replicas with UNHEALTHY state to replicas with either UNHEALTHY state, or QUASI_CLOSED and sequence id not matching the container's sequence id.

The initial idea was to make SCM inform Datanode about the mismatch and the Datanode move the replica to UNHEALTHY state. This way Datanode will be the one who marks the replica as UNHEALTHY and SCM can handle the Replica with incorrect sequence Id same as unhealthy replica.

The current change is to unblock and fix the issue without making large code change and break the test.

The definition of UNHEALTHY replica is that it's missing some data, under this definition, Replica with incorrect sequence Id for a CLOSED container is also UNHEALTHY.

I will update the PR with additional tests.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandakumar131 Thanks for adding the tests and keeping changes minimal. Looks good to me! Waiting for CI to be green.

@arp7 arp7 merged commit 52edf7a into apache:master Jul 28, 2023
@arp7
Copy link
Contributor

arp7 commented Jul 28, 2023

Merged based on @siddhantsangwan's LGTM above.

@nandakumar131 nandakumar131 deleted the HDDS-9075 branch July 28, 2023 04:20
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Sep 20, 2023
… should be deleted by SCM. (apache#5120)

(cherry picked from commit 52edf7a)
Change-Id: I8eb312eda990fec455ba4d31f56960b7d3b9250e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants