Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-7058. EC: ReplicationManager - Implement ratis container replication check handler #3802

Merged
merged 6 commits into from
Oct 19, 2022

Conversation

sodonnel
Copy link
Contributor

@sodonnel sodonnel commented Oct 6, 2022

What changes were proposed in this pull request?

Create a handler for the new replication manager to process Ratis container and detect under / over / mis-replication issues.

The logic is largely unchanged from the LegacyReplication manager - simply packaged into the new "handler" structure.

At the moment, this code will not be executed by the new replication manager, as all non-EC container will be directed to the Legacy Replication Manager for processing.

This Jira is part of the work to remove the Legacy Replication Manager.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7058

How was this patch tested?

New unit tests

@kerneltime
Copy link
Contributor

@aswinshakil

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments left inline. Have some confusion on certain flows & race condition it would be really great if you could clear my doubts there. Haven't checked unit tests yet.

*
* @return True if the container is over replicated, false otherwise.
*/
public boolean isOverReplicated(boolean includePendingDelete) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.

*
* @return True if the container is over replicated, false otherwise.
*/
public boolean isOverReplicated(boolean includePendingDelete) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be changed to getExcessRedundancyCanBeCalled(includePending)>0 to avoid redundancy of logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have changed this.

@@ -106,6 +106,9 @@ public static class UnHealthyResult extends ContainerHealthResult {
private final int remainingRedundancy;
private final boolean dueToDecommission;
private final boolean sufficientlyReplicatedAfterPending;
private boolean dueToMisReplication = false;
private boolean isMisReplicated = false;
private boolean isMisReplicatedAfterPending = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have another constructor which initializes the following arguments?
public UnderReplicatedHealthResult(ContainerInfo containerInfo,
int remainingRedundancy, boolean dueToDecommission, boolean replicatedOkWithPending, boolean unrecoverable,boolean dueToMisReplication, boolean isMisReplicated, boolean isMisReplicatedAfterPending)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the existing constructor, as then we need to change it everywhere it is used. Adding a new constructor starts a bad pattern where each new parameter needs a new constructor, and what we really need is a builder.

At the moment I think these 3 parameters have good defaults for the common case and then using the settings when needed is a good compromise.

/**
* @return Return Excess Redundancy replica nums.
*/
public int getExcessRedundancy(boolean includePendingDelete) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a boolean with includePendingAdd as well. I see redundant duplicate code logic in sufficientlyReplicated & isOverReplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the logic is very similar. I have added a new private method both can call.

return false;
}

public ContainerHealthResult checkHealth(ContainerCheckRequest request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make this method as public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably could add @VisibleForTesting if you want to add unit tests or could even change the access specifier using replication using reflection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will need to be public when we implement the under / over replication handler to process the under / over replicated queue. This is the same as in the ECHandler, where it has this same method public for that reason.

pendingDelete, requiredNodes, minReplicasForMaintenance);

ContainerPlacementStatus placementStatus =
getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
getPlacementStatus(replicas, requiredNodes, Collections.EMPTY_LIST);
getPlacementStatus(replicas, requiredNodes, Collections.emptyList());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this.

replicaDns.add(op.getTarget());
} else if (op.getOpType() == ContainerReplicaOp.PendingOpType.DELETE) {
replicaDns.remove(op.getTarget());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a case where there is pending ADD & pending DELETE to the same node? Some kind of race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use a Map<DataNodeDetails,Integer> in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be able to happen, as if a node has a replica it cannot get another copy of it. For a delete to be scheduled it must have a copy which will prevent an add etc.

However I will change this to two IF statements rather than if .. else if

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel The handling logic looks good. I haven't checked the tests yet.

Comment on lines 260 to 264
public boolean isSufficientlyReplicated(boolean includePendingAdd) {
// Positive for under-rep, negative for over-rep
int delta = missingReplicas();
if (includePendingAdd) {
delta -= inFlightAdd;
}
return delta <= 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we deliberately not considering pending deletes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct - I have missed this. We should be removing the inflight deletes as per the original method defined just above this one. I will fix this and modidy a test to validate it.

request.getReplicationQueue().enqueue(underHealth);
}
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Let's add a new line after this brace for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - I added that in.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good! I just have 2 minor comments for the tests.

}

@Test
public void testOverReplicatedContainerDueToMaintenance() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're testing the HEALTHY case, let's change the test's name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I added IsHealthy to the end of it.

Set<ContainerReplica> replicas = createReplicas(container.containerID(),
Pair.of(IN_SERVICE, 0), Pair.of(IN_SERVICE, 0),
Pair.of(IN_SERVICE, 0), Pair.of(IN_MAINTENANCE, 0),
Pair.of(IN_MAINTENANCE, 2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replica index should be 0 instead of 2 I think. (IN_MAINTENANCE, 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, well spotted.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending green CI.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sodonnel sodonnel merged commit 237a9a1 into apache:master Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants