Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-6589. Add a new replication manager and change the existing one to legacy #3352

Merged
merged 8 commits into from May 4, 2022

Conversation

JacksonYao287
Copy link
Contributor

What changes were proposed in this pull request?

this is a initial patch for refactoring RM, in this patch
1 Rename the existing RM class to LegacyReplicationManager, and expose the processContainer() method as protected or public. Remove the scheduling from this RM.
2 Create a new replication Manager, called ReplicationManager, then have it scheduled periodically like the current one.
3 When the new RM processes a container, if it is EC, it goes down the new path. If it is not EC, we call LegacyReplicationManager.processContainer(...).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6589

How was this patch tested?

current UT

@JacksonYao287
Copy link
Contributor Author

@sodonnel PTAL, thanks!

@kaijchen
Copy link
Contributor

@JacksonYao287 thanks for the patch. Seems the CI failure is related to the change, please check.

@sodonnel
Copy link
Contributor

This is a very big patch that moves the ReplicationManager class to LegacyReplicationManager, but there is also a lot of change inside the moved class. It would be better if this PR was a series of commits rather than one large one (this is true for any large PR), as it makes it much easier to review. Eg 1 commit for moving ReplicationManager -> LegacyReplicationManager (probably a refactor in Intellij), then a commit for each logical set of changes. It doesn't need to be a series of Jiras, just a series of commits in the PR that are not squashed until the end.

Could you re-post this as a series of commits, or do you not have the changes like that on your local branch?

@kerneltime
Copy link
Contributor

This is a very big patch that moves the ReplicationManager class to LegacyReplicationManager, but there is also a lot of change inside the moved class. It would be better if this PR was a series of commits rather than one large one (this is true for any large PR), as it makes it much easier to review. Eg 1 commit for moving ReplicationManager -> LegacyReplicationManager (probably a refactor in Intellij), then a commit for each logical set of changes. It doesn't need to be a series of Jiras, just a series of commits in the PR that are not squashed until the end.

Could you re-post this as a series of commits, or do you not have the changes like that on your local branch?

+1 Refactors without change in logic is best done as a set followed by any other code changes.

@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Apr 27, 2022

thanks @sodonnel @kerneltime @kaijchen for the review!

Could you re-post this as a series of commits, or do you not have the changes like that on your local branch?

i do have those commits in my local branch earlier, but I accidentally squash them before creating this patch, so it is a little hard to find them back now.

though this is a big patch , the logic in this patch seems clear and not complex to review. the steps in this patch are as follows:

1 after check again , i remove "ReplicationActivityStatus.java" and "ReplicationActivityStatusMXBean.java" from this project, since they seem not used anywhere.

2 move "ReplicationManager.java" to the directory "hdds.scm/container/replication", where "ReplicationActivityStatus.java" and "ReplicationActivityStatusMXBean.java" existed, so all the replication manager related work will be in this directory.

3 copy "ReplicationManager.java" to "LegacyReplicationManager.java", so we have to two RM class now.

4 for "LegacyReplicationManager.java", i do not let it implement the "SCMService" interface, and remove all the scheduling logic from it , so now, it is only used to process non-EC container.

5 for "ReplicationManager.java" , i only keep the logic of scheduling and inject the instance of "LegacyReplicationManager " into it to process non-EC container.

6 for those public functions in ReplicatioManager class, which are called from other classes, i just keep them in the new ReplicatioManager class , and delegate the calls to LegacyReplicationManager class. these functions can be refactored or rewrite in a separate patch in the future.

7 fix all the influenced test case.

so , when reviewing , please pay more attention to "LegacyReplicationManager.java" and "ReplicationManager.java" , other parts of this patch almost have nothing to do with logic, they are all adaptive work.

@kaijchen
Copy link
Contributor

i do have those commits in my local branch earlier, but I accidentally squash them before creating this patch, so it is a little hard to find them back now.

You can try git reflog.

@JacksonYao287
Copy link
Contributor Author

i have force push this patch with a series of commits, PTAL! @kerneltime @sodonnel @kaijchen

@JacksonYao287 JacksonYao287 self-assigned this Apr 27, 2022
MoveDataNodePair mp)
throws ContainerNotFoundException, NodeNotFoundException {
CompletableFuture<MoveResult> ret = new CompletableFuture<>();
if (!isRunning()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the balancer going to keep working OK if this RM thread is not running any more? From this check, it must have needed / wanted it to be running before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the legacy RM does not have any scheduling logic, it is just a container processor now and does not know whether the new RM is running. the check is needed, and has been moved to the new RM , please see
the check in new RM

@@ -17,12 +17,29 @@
*/
package org.apache.hadoop.hdds.scm.container.replication;

import com.google.common.annotations.VisibleForTesting;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is showing as a rename from ReplicationActivityStatusMXBean.java → ...container/replication/InflightAction.java, which is a bit strange. Should this InflightAction class not be a new file? Something strange has happened as you have moved things around I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea , i notice this too.
as per our discussion , we will extract InFlightReplicaion and InFlightDeletion into a standalone class with associated test class, so i extract inflightAction a standalone class here for later refactor.

i just delete ReplicationActivityStatusMXBean.java in the commit b398eef
and add InflightAction.java in another commit f0c5034, the options are clear in these two commits. but in the summary of Filie changed, it show ReplicationActivityStatusMXBean.java is moved to InflightAction.java, i am not sure what happens here, do you have any idea about this?

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me. I just have one comment on changing a notify() to interrupt(), but aside from that, it looks good.

if (t instanceof InterruptedException) {
Thread.currentThread().interrupt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we know the thread is exiting, I think you are still supposed to keep the line Thread.currentThread().interrupt(); incase something else needs it before the thread exits.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for taking the changes on.

@JacksonYao287
Copy link
Contributor Author

@sodonnel thanks for the review and comments.
seems there are some flaky test case failed. pending for a clean CI.

@umamaheswararao umamaheswararao merged commit c99fd20 into apache:master May 4, 2022
@umamaheswararao
Copy link
Contributor

Great work @JacksonYao287 and thanks for @sodonnel for thorough reviews !!!

@JacksonYao287
Copy link
Contributor Author

thanks @sodonnel for the review and @umamaheswararao for the merge

@JacksonYao287 JacksonYao287 deleted the HDDS-6589 branch May 5, 2022 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants