Skip to content

HDDS-9592. Replication Manager: Save UNHEALTHY replicas with highest BCSID for a QUASI_CLOSED container#5794

Merged
siddhantsangwan merged 5 commits intoapache:masterfrom
siddhantsangwan:HDDS-9592
Dec 20, 2023
Merged

HDDS-9592. Replication Manager: Save UNHEALTHY replicas with highest BCSID for a QUASI_CLOSED container#5794
siddhantsangwan merged 5 commits intoapache:masterfrom
siddhantsangwan:HDDS-9592

Conversation

@siddhantsangwan
Copy link
Contributor

@siddhantsangwan siddhantsangwan commented Dec 14, 2023

What changes were proposed in this pull request?

A QUASI_CLOSED container may have some UNHEALTHY replicas with the same sequence id as the container, while there are no healthy replicas with the correct sequence id. Such UNHEALTHY replicas cannot be deleted and must be kept around.

If the DN hosting such an UNHEALTHY replica is put in decommission, then decommission will stay blocked because the UNHEALTHY cannot be lost, but at the same time RM currently does nothing about it. We try to do something about these vulnerable UNHEALTHY replicas in this PR so that decommission can be successful.

Changes introduced:

  1. A new handler, VulnerableUnhealthyReplicasHandler, leverages the existing replicaCount.getVulnerableUnhealthyReplicas API to find such UNHEALTHY replicas. If found, the container is marked as under replicated and added to the under replication queue.
  2. The under replicated container is then handled in RatisUnderReplicationHandler. It tries to find a new target DN for each UNHEALTHY replica and sends replicate commands. The logic is similar to what we have already done for legacy RM. Some additional changes were required to correctly find out the used and excluded nodes to pass into the placement policy API for finding target DNs.
  3. Changes to the decommission monitor so that both RMs use the replicaSet.isHealthyEnoughForOffline API.

The third point above basically solves ReplicationManager: Unhealthy replicas of a sufficiently replicated container can block decommissioning. If required, this can be split off into its own PR since this one is quite large.

Need to add some more tests to TestRatisUnderReplicationHandler.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9592

How was this patch tested?

New tests.

@siddhantsangwan
Copy link
Contributor Author

I've added some tests to TestRatisUnderReplicationHandler as well. Ideally there should be more, but if the reviewers are satisfied then we can take that up in another Jira.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. LGTM.

@kerneltime kerneltime requested a review from errose28 December 18, 2023 17:16
@siddhantsangwan
Copy link
Contributor Author

@sodonnel Thanks for the review. Merging to the master branch.

@siddhantsangwan siddhantsangwan merged commit faa1990 into apache:master Dec 20, 2023
symious pushed a commit that referenced this pull request Dec 20, 2023
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 1, 2024
…ith highest BCSID for a QUASI_CLOSED container (apache#5794)

(cherry picked from commit faa1990)
Change-Id: Id7e86699c64a48eaa0005018cfaffb348243aeef
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants