-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-16995: Configure replication behaviour for each replica type #2207
Conversation
|
||
// don't commit on leader version zero for PULL replicas as PULL should only get its index | ||
// state from leader | ||
boolean skipCommitOnLeaderVersionZero = switchTransactionLog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we are in this method, either we are a TLOG replica and switchTransactionLog
is true, hence skipCommitOnLeaderVersionZero
is true, or we are a PULL replica and switchTransactionLog
is false but we force skipCommitOnLeaderVersionZero
to true below...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the case since @HoustonPutman's change in SOLR-14679, where the call from ZkController.rejoinShardLeaderElection()
to startReplicationFromLeader()
changed its parameter from false
to true
.
// stop replicate from old leader | ||
zkController.stopReplicationFromLeader(coreName); | ||
} | ||
if (replicaType == Replica.Type.TLOG) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept this reference to Replica.Type
, it's not related to replication.
@@ -1801,6 +1801,8 @@ private static Long readIntervalNs(String interval) { | |||
// in case of TLOG replica, if leaderVersion = zero, don't do commit | |||
// otherwise updates from current tlog won't copied over properly to the new tlog, leading to data | |||
// loss | |||
// don't commit on leader version zero for PULL replicas as PULL should only get its index | |||
// state from leader | |||
public static final String SKIP_COMMIT_ON_LEADER_VERSION_ZERO = "skipCommitOnLeaderVersionZero"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit dubious over how this parameter is used overall. When replicating from the leader, its value is always true. When recovering, it's true only for TLOG replicas (does it mean that the reason for setting it to true when replicating does not hold when recovering?).
} | ||
log.debug( | ||
"Replica {} skipping election because it's type is {}", coreZkNodeName, Type.PULL); | ||
startReplicationFromLeader(coreName, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weirdly, we were starting the replication for PULL replicas here, and lower for TLOG replicas. We were also starting unconditionally the replication for PULL replicas, even if we were entering in recovery mode (where on of the first actions is to stop the replication). Replication is now started for both replica types at the same place.
9e52cb8
to
a3b7d72
Compare
a3b7d72
to
8814e4b
Compare
Type(boolean leaderEligible, String numReplicasPropertyName) { | ||
Type( | ||
boolean leaderEligible, | ||
boolean requireTransactionLog, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems leaderEligible and requireTransactionLog always have the same value (tru for TLOG and NRT, false for PULL). Do we need two parameters?
(I know ZERO replicas - see SIP-20 - in their current form might differ, but we're not there yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely prefer to separate by semantics, instead of assuming one is always true when the other is. Otherwise, we might as well have isNrt, isPull and isTlog booleans. :)
|
||
public final boolean leaderEligible; | ||
public final boolean requireTransactionLog; | ||
public final boolean replicateFromLeader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need Javadoc for these variables.
Regarding replicaFromLeader
, I believe NRT
also replicates from leader during recovery (but not in steady state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all replica types do replica during recovery. This property might not be named correctly, do you have a suggestion for a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added Javadoc.
@@ -417,8 +419,10 @@ private final void doReplicateOnlyRecovery(SolrCore core) throws InterruptedExce | |||
log.error("Error while trying to recover. core={}", coreName, e); | |||
} finally { | |||
if (successfulRecovery) { | |||
log.info("Restarting background replicate from leader process"); | |||
zkController.startReplicationFromLeader(coreName, false); | |||
if (replicaType.replicateFromLeader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were we previously (before this PR) starting replication even for NRT
that does not need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because this method is only called for PULL replicas. But adding this as a safeguard nonetheless.
@@ -583,7 +587,7 @@ public final void doSyncOrReplicateRecovery(SolrCore core) throws Exception { | |||
} | |||
} | |||
|
|||
if (replicaType == Replica.Type.TLOG) { | |||
if (replicaType.replicateFromLeader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many places (this class and other classes) the test for the replica being TLOG
is replaced by the replica replicating from the leader (i.e. TLOG
or PULL
).
Are we therefore doing for PULL
replicas things that previously were not done, or is the change the result of moving code execution around (but couldn't really find things we no longer do with this PR, except the unification of starting replication for TLOG
and PULL
in ZkController
L. 1394).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because this method is only called for NRT and TLOG replicas.
|
||
// don't commit on leader version zero for PULL replicas as PULL should only get its index | ||
// state from leader | ||
boolean skipCommitOnLeaderVersionZero = switchTransactionLog; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the case since @HoustonPutman's change in SOLR-14679, where the call from ZkController.rejoinShardLeaderElection()
to startReplicationFromLeader()
changed its parameter from false
to true
.
@@ -2400,15 +2402,15 @@ public void rejoinShardLeaderElection(SolrParams params) { | |||
|
|||
try (SolrCore core = cc.getCore(coreName)) { | |||
Replica.Type replicaType = core.getCoreDescriptor().getCloudDescriptor().getReplicaType(); | |||
if (replicaType == Type.TLOG) { | |||
if (replicaType.replicateFromLeader) { | |||
String leaderUrl = | |||
getLeader( | |||
core.getCoreDescriptor().getCloudDescriptor(), cloudConfig.getLeaderVoteWait()); | |||
if (!leaderUrl.equals(ourUrl)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now checked for PULL
replicas as well and is always true
.
Should the if
test the replica's leaderEligible
flag in addition to replicateFromLeader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in rejoinShardLeaderElection
, I assumed we can ever enter in this method for a PULL (or any non leader-eligible) replica, or can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the worst case, it doesn't hurt, since it is only used to verify whether a replica is the leader; PULL replica will not be, we would just have wasted time checking this. But I don't think we can ever end up in this method for a PULL replica.
public final boolean requireTransactionLog; | ||
|
||
/** | ||
* Whether replicas of this type continuously replica from the leader, if they are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo replica/replicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed the change
https://issues.apache.org/jira/browse/SOLR-16995
Description
I am resuming refactoring started earlier (#1928, #1981, #2039) with the goal of simplifying/centralising code related to replica types. The end goal would be to avoid having
if (replicaType == Replica.Type.*)
scattered across the codebase, and to make it easier to add new replica types.Solution
This PR adds a new attribute to replica types that specify whether they should replicate from their leader.
Tests
Existing tests should pass, no change of behaviour.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.