Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16995: Configure replication behaviour for each replica type #2207

Merged
merged 5 commits into from
Feb 19, 2024

Conversation

pvcnt
Copy link
Contributor

@pvcnt pvcnt commented Jan 19, 2024

https://issues.apache.org/jira/browse/SOLR-16995

Description

I am resuming refactoring started earlier (#1928, #1981, #2039) with the goal of simplifying/centralising code related to replica types. The end goal would be to avoid having if (replicaType == Replica.Type.*) scattered across the codebase, and to make it easier to add new replica types.

Solution

This PR adds a new attribute to replica types that specify whether they should replicate from their leader.

Tests

Existing tests should pass, no change of behaviour.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide


// don't commit on leader version zero for PULL replicas as PULL should only get its index
// state from leader
boolean skipCommitOnLeaderVersionZero = switchTransactionLog;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are in this method, either we are a TLOG replica and switchTransactionLog is true, hence skipCommitOnLeaderVersionZero is true, or we are a PULL replica and switchTransactionLog is false but we force skipCommitOnLeaderVersionZero to true below...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the case since @HoustonPutman's change in SOLR-14679, where the call from ZkController.rejoinShardLeaderElection() to startReplicationFromLeader() changed its parameter from false to true.

// stop replicate from old leader
zkController.stopReplicationFromLeader(coreName);
}
if (replicaType == Replica.Type.TLOG) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept this reference to Replica.Type, it's not related to replication.

@@ -1801,6 +1801,8 @@ private static Long readIntervalNs(String interval) {
// in case of TLOG replica, if leaderVersion = zero, don't do commit
// otherwise updates from current tlog won't copied over properly to the new tlog, leading to data
// loss
// don't commit on leader version zero for PULL replicas as PULL should only get its index
// state from leader
public static final String SKIP_COMMIT_ON_LEADER_VERSION_ZERO = "skipCommitOnLeaderVersionZero";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit dubious over how this parameter is used overall. When replicating from the leader, its value is always true. When recovering, it's true only for TLOG replicas (does it mean that the reason for setting it to true when replicating does not hold when recovering?).

}
log.debug(
"Replica {} skipping election because it's type is {}", coreZkNodeName, Type.PULL);
startReplicationFromLeader(coreName, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly, we were starting the replication for PULL replicas here, and lower for TLOG replicas. We were also starting unconditionally the replication for PULL replicas, even if we were entering in recovery mode (where on of the first actions is to stop the replication). Replication is now started for both replica types at the same place.

@pvcnt pvcnt force-pushed the replicate-support branch 2 times, most recently from 9e52cb8 to a3b7d72 Compare January 19, 2024 09:26
solr/core/src/java/org/apache/solr/core/CoreContainer.java Outdated Show resolved Hide resolved
Type(boolean leaderEligible, String numReplicasPropertyName) {
Type(
boolean leaderEligible,
boolean requireTransactionLog,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems leaderEligible and requireTransactionLog always have the same value (tru for TLOG and NRT, false for PULL). Do we need two parameters?

(I know ZERO replicas - see SIP-20 - in their current form might differ, but we're not there yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely prefer to separate by semantics, instead of assuming one is always true when the other is. Otherwise, we might as well have isNrt, isPull and isTlog booleans. :)


public final boolean leaderEligible;
public final boolean requireTransactionLog;
public final boolean replicateFromLeader;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need Javadoc for these variables.
Regarding replicaFromLeader, I believe NRT also replicates from leader during recovery (but not in steady state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all replica types do replica during recovery. This property might not be named correctly, do you have a suggestion for a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added Javadoc.

@@ -417,8 +419,10 @@ private final void doReplicateOnlyRecovery(SolrCore core) throws InterruptedExce
log.error("Error while trying to recover. core={}", coreName, e);
} finally {
if (successfulRecovery) {
log.info("Restarting background replicate from leader process");
zkController.startReplicationFromLeader(coreName, false);
if (replicaType.replicateFromLeader) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were we previously (before this PR) starting replication even for NRT that does not need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because this method is only called for PULL replicas. But adding this as a safeguard nonetheless.

@@ -583,7 +587,7 @@ public final void doSyncOrReplicateRecovery(SolrCore core) throws Exception {
}
}

if (replicaType == Replica.Type.TLOG) {
if (replicaType.replicateFromLeader) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many places (this class and other classes) the test for the replica being TLOG is replaced by the replica replicating from the leader (i.e. TLOG or PULL).
Are we therefore doing for PULL replicas things that previously were not done, or is the change the result of moving code execution around (but couldn't really find things we no longer do with this PR, except the unification of starting replication for TLOG and PULL in ZkController L. 1394).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because this method is only called for NRT and TLOG replicas.


// don't commit on leader version zero for PULL replicas as PULL should only get its index
// state from leader
boolean skipCommitOnLeaderVersionZero = switchTransactionLog;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the case since @HoustonPutman's change in SOLR-14679, where the call from ZkController.rejoinShardLeaderElection() to startReplicationFromLeader() changed its parameter from false to true.

@@ -2400,15 +2402,15 @@ public void rejoinShardLeaderElection(SolrParams params) {

try (SolrCore core = cc.getCore(coreName)) {
Replica.Type replicaType = core.getCoreDescriptor().getCloudDescriptor().getReplicaType();
if (replicaType == Type.TLOG) {
if (replicaType.replicateFromLeader) {
String leaderUrl =
getLeader(
core.getCoreDescriptor().getCloudDescriptor(), cloudConfig.getLeaderVoteWait());
if (!leaderUrl.equals(ourUrl)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now checked for PULL replicas as well and is always true.
Should the if test the replica's leaderEligible flag in addition to replicateFromLeader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in rejoinShardLeaderElection, I assumed we can ever enter in this method for a PULL (or any non leader-eligible) replica, or can we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case, it doesn't hurt, since it is only used to verify whether a replica is the leader; PULL replica will not be, we would just have wasted time checking this. But I don't think we can ever end up in this method for a PULL replica.

public final boolean requireTransactionLog;

/**
* Whether replicas of this type continuously replica from the leader, if they are not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo replica/replicate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed the change

@murblanc murblanc merged commit 2ec0cbf into apache:main Feb 19, 2024
4 checks passed
@pvcnt pvcnt deleted the replicate-support branch February 20, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants