Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RATIS-1769. Avoid changing priorities in TransferCommand unless necessary #808

Merged
merged 23 commits into from
Feb 21, 2023

Conversation

kaijchen
Copy link
Contributor

@kaijchen kaijchen commented Jan 18, 2023

What changes were proposed in this pull request?

This is a followup of RATIS-1762.
Try to avoid changing priorities before transfer leadership in TransferCommand.
It will fallback to "transfer leadership by changing priority" for backward compatibility.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-1769

How was this patch tested?

Integration test: ElectionCommandIntegrationTest#testElectionTransferCommand() and ElectionCommandIntegrationTest#testElectionTransferCommandToHigherPriority()

Manually

Transfer between peers with same priority. (new feature)

$ bin/ratis sh election transfer -peers 127.0.0.1:10024,127.0.0.1:10124,127.0.0.1:11124 -address 127.0.0.1:10124
[main] INFO org.reflections.Reflections - Reflections took 122 ms to scan 1 urls, producing 5 keys and 18 values
[main] WARN org.apache.ratis.metrics.MetricRegistries - Found multiple MetricRegistries implementations: class org.apache.ratis.metrics.impl.MetricRegistriesImpl, class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl. Using first found implementation: org.apache.ratis.metrics.impl.MetricRegistriesImpl@1e67a849
Transferring leadership to peer n1 with address 127.0.0.1:10124
Transferring leadership initiated

Backward compatibility

Ratis shell version: 3.0.0-SNAPSHOT.
Ratis server version: 2.4.1.

$ bin/ratis sh election transfer -peers 127.0.0.1:10024,127.0.0.1:10124,127.0.0.1:11124 -address 127.0.0.1:10124
[main] INFO org.reflections.Reflections - Reflections took 135 ms to scan 1 urls, producing 5 keys and 18 values
[main] WARN org.apache.ratis.metrics.MetricRegistries - Found multiple MetricRegistries implementations: class org.apache.ratis.metrics.impl.MetricRegistriesImpl, class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl. Using first found implementation: org.apache.ratis.metrics.impl.MetricRegistriesImpl@1e67a849
Transferring leadership to peer n1 with address 127.0.0.1:10124
Changing priority of peer n1 with address 127.0.0.1:10124 to 4
Transferring leadership to peer n1 with address 127.0.0.1:10124
Changing priority of peer n1 with address 127.0.0.1:10124 to 5
Transferring leadership initiated

In case of failure

In most cases, just a retry will fix the problem. And users can also set timeout manually by -timeout option.

$ bin/ratis sh election transfer -peers 127.0.0.1:10024,127.0.0.1:10124,127.0.0.1:11124 -address 127.0.0.1:10124
[main] INFO org.reflections.Reflections - Reflections took 135 ms to scan 1 urls, producing 5 keys and 18 values
[main] WARN org.apache.ratis.metrics.MetricRegistries - Found multiple MetricRegistries implementations: class org.apache.ratis.metrics.impl.MetricRegistriesImpl, class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl. Using first found implementation: org.apache.ratis.metrics.impl.MetricRegistriesImpl@1e67a849
Transferring leadership to peer n1 with address 127.0.0.1:10124
Failed to transfer peer n1 with address 127.0.0.1:10124: org.apache.ratis.protocol.exceptions.TransferLeadershipException: n2@group-ABB3109A44C1: Failed to transfer leadership to n1 (timed out 3000ms): current leader is n2
	at org.apache.ratis.server.impl.TransferLeadership$PendingRequest.complete(TransferLeadership.java:67)
	at org.apache.ratis.server.impl.TransferLeadership.lambda$finish$7(TransferLeadership.java:163)
	at java.util.Optional.ifPresent(Optional.java:159)
	at org.apache.ratis.server.impl.TransferLeadership.finish(TransferLeadership.java:163)
	at org.apache.ratis.server.impl.TransferLeadership.lambda$start$4(TransferLeadership.java:136)
	at org.apache.ratis.util.TimeoutTimer.lambda$onTimeout$2(TimeoutTimer.java:101)
	at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
	at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:79)
	at org.apache.ratis.util.TimeoutTimer$Task.run(TimeoutTimer.java:55)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)

@kaijchen
Copy link
Contributor Author

Sadly this will break backward compatibility. But version 3.0 hasn't been released, so it might be OK.

@szetszwo @codings-dan what do you think?

@kaijchen kaijchen marked this pull request as ready for review February 3, 2023 02:28
@kaijchen
Copy link
Contributor Author

kaijchen commented Feb 3, 2023

I have come up with 3 options to preserve backward compatibility.
@szetszwo @codings-dan what do you think?

  1. Check server version first and send command accordingly.
  2. Add a new command and deprecate the old command.
  3. Just break the backward compatibility, since next version is 3.0.

Personally I prever the 2nd way.

@codings-dan
Copy link
Contributor

@kaijchen Thanks for working on this. I prefer the second option, we can add a new commad to ensure compatibility.

@kaijchen kaijchen changed the title RATIS-1769. Do not change priority in TransferCommand RATIS-1769. Add TransferLeaderCommand and deprecate TransferCommand Feb 7, 2023
@kaijchen
Copy link
Contributor Author

kaijchen commented Feb 7, 2023

@codings-dan @szetszwo PTAL, thanks.

Copy link
Contributor

@codings-dan codings-dan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaijchen Thanks for working on this, could you add a unit test for the new command? see ElectionCommandIntegrationTest#testElectionTransferCommand

@kaijchen
Copy link
Contributor Author

kaijchen commented Feb 9, 2023

could you add a unit test for the new command? see ElectionCommandIntegrationTest#testElectionTransferCommand

Thanks @codings-dan, I have added ElectionCommandIntegrationTest#testElectionTransferLeaderCommand.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaijchen , thanks a lot for working on this. The change looks good.

We should have test to transfer a leader who does not have the highest priority. Then, it should get some exception.

@kaijchen
Copy link
Contributor Author

We should have test to transfer a leader who does not have the highest priority. Then, it should get some exception.

Thanks @szetszwo for reviewing. Good idea.

@szetszwo
Copy link
Contributor

@kaijchen , I am thinking how to deal with the existing and the new commands.

  • The new election transferLeader command, since it won't change priorities, actually is yielding the leadership. How about we call it election yield?

  • The existing election transfer command is still useful since it can transfer the leader to a lower priority peer. How about we don't deprecate it?

@kaijchen
Copy link
Contributor Author

kaijchen commented Feb 14, 2023

  • The new election transferLeader command, since it won't change priorities, actually is yielding the leadership. How about we call it election yield?

I'm OK to call it election yield, but since raft paper call it transfer leadership, users may get confused.

  • The existing election transfer command is still useful since it can transfer the leader to a lower priority peer. How about we don't deprecate it?

The existing election transfer is equal to peer setPriority plus election transferLeader. User can still achieve the same functionality, and the priority will be set by user explicitly (so the user knows exactly what is going on).

Based on these reasons, I chose to deprecate the old command. @szetszwo What do you think?

@szetszwo
Copy link
Contributor

@kaijchen , I got a new idea -- we may update the election transfer as following:

  • Since we already have the getRaftGroup() returned from the server, we can check priorities to decide whether to call setConfiguration. It will call setConfiguration only if the transferee has smaller priority.
  • Then, we don't need a new command.

We do you think?

@kaijchen
Copy link
Contributor Author

  • Then, we don't need a new command.

The old server (prior to 2.4.1) always requires the transferee to have the highest priority for transfer leadership.
We are keeping the old transfer command for backward compatibility. Is there a way to get the server version?

  • Since we already have the getRaftGroup() returned from the server, we can check priorities to decide whether to call setConfiguration. It will call setConfiguration only if the transferee has smaller priority.

Personally I prefer to print a message like "transferee does not have highest priority, please call setPriority first".
Or add a -force option, to set the transferee to highest priority.

Example

let's say we have a cluster like this,

Peer A B C D E
Priority 3 2 2 1 1

New server

If we want to transfer leader from A to E, we should only raise E's priority to highest.
So if peer E is down afterwards, the old priority remains.

Peer A B C D E
Priority 3 2 2 1 3

Old server

Since the old server requires transferee to be the only peer with highest priority.
The best we can do is:

Peer A B C D E
Priority 3 2 2 1 4

However, the old command will reset every node's priority, and the old priority is erased.

Peer A B C D E
Priority 1 1 1 1 2

@szetszwo
Copy link
Contributor

@kaijchen , that's a good point. We may change the existing election transfer command as you described:

  • If the target peer already has the highest priority, do not call setConfiguration.
  • If not, call setConfiguration to change only the target peer to the highest priority.

Does it sound good?

@kaijchen
Copy link
Contributor Author

kaijchen commented Feb 15, 2023

  • If the target peer already has the highest priority, do not call setConfiguration.
  • If not, call setConfiguration to change only the target peer to the highest priority.

Does it sound good?

Yes. But how do we deal with old server (backward compatibility)?
For example, for new servers (with RATIS-1762), we don't need to change priority in the following situation.
But for old servers (without RATIS-1762), we have to increase target peer's priority to 2.

Peer A B C D E
Priority 1 1 1 1 1

Can we just drop backward compatibility?
i.e. The transfer command of Ratis shell 3.0.0 won't work with Ratis 2.4.1 servers.

@szetszwo
Copy link
Contributor

... how do we deal with old server (backward compatibility)?

@kaijchen , When a new client talking to an old server, the client in that case should get a TransferLeadershipException saying that the target server does not have the highest priority. Then, we can send the setConfiguration in that case.

@kaijchen
Copy link
Contributor Author

@kaijchen , When a new client talking to an old server, the client in that case should get a TransferLeadershipException saying that the target server does not have the highest priority. Then, we can send the setConfiguration in that case.

OK, sounds good.

@kaijchen
Copy link
Contributor Author

Manually test backward compatibility, Ratis shell version 3.0.0-SNAPSHOT, Ratis server version 2.4.1:

$ bin/ratis sh election transfer -peers 127.0.0.1:10024,127.0.0.1:10124,127.0.0.1:11124 -address 127.0.0.1:11124
[main] INFO org.reflections.Reflections - Reflections took 164 ms to scan 1 urls, producing 5 keys and 18 values
[main] WARN org.apache.ratis.metrics.MetricRegistries - Found multiple MetricRegistries implementations: class org.apache.ratis.metrics.impl.MetricRegistriesImpl, class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl. Using first found implementation: org.apache.ratis.metrics.impl.MetricRegistriesImpl@1e67a849
Transferring leadership to server with address <127.0.0.1:11124> 
Changing priority of <127.0.0.1:11124> to 1
: caught an error when executing transfer: n0@group-ABB3109A44C1 refused to transfer leadership to peer n2 as it does not has highest priority 9: peers:[n0|rpc:127.0.0.1:10024|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER, n1|rpc:127.0.0.1:10124|admin:|client:|dataStream:|priority:0|startupRole:FOLLOWER, n2|rpc:127.0.0.1:11124|admin:|client:|dataStream:|priority:1|startupRole:FOLLOWER]|listeners:[], old=null
Transferring leadership to server with address <127.0.0.1:11124> 
Changing priority of <127.0.0.1:11124> to 2
: Transferring leadership initiated

@kaijchen kaijchen changed the title RATIS-1769. Add TransferLeaderCommand and deprecate TransferCommand RATIS-1769. Try to avoid changing priorities in TransferCommand Feb 16, 2023
@kaijchen kaijchen changed the title RATIS-1769. Try to avoid changing priorities in TransferCommand RATIS-1769. Avoid changing priorities in TransferCommand unless necessary Feb 16, 2023
@kaijchen
Copy link
Contributor Author

Note: sometimes transfer leadership succeeds, but the request timed out.
This is due to the old leader hasn't received any messages from the new leader yet.
Especially in default timeout, i.e. TransferLeadershipRequest.timeout = 0,
the timeout will default to server.randomElectionTimeout().

This will lead to TransferCommand to return -1, and cause flaky test.
So is it better to set a larger default timeout?

To fix this problem, we should wait for the new leader to be elected in the shell.
But AbstractRatisCommand seems didn't expect the leader to change.

@kaijchen
Copy link
Contributor Author

Note: sometimes transfer leadership succeeds, but the request timed out. This is due to the old leader hasn't received any messages from the new leader yet. Especially in default timeout, i.e. TransferLeadershipRequest.timeout = 0, the timeout will default to server.randomElectionTimeout().

This will lead to TransferCommand to return -1, and cause flaky test. So is it better to set a larger default timeout?

Set default timeout to 3s in the new mode. The default timeout in legacy mode (fallback) is kept as 60s.

To fix this problem, we should wait for the new leader to be elected in the shell. But AbstractRatisCommand seems didn't expect the leader to change.

Added TODO in comments. It's better to keep changes small in this PR.

@kaijchen
Copy link
Contributor Author

@szetszwo please take another look, thanks.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaijchen , thanks for the update. Please see the comments inlined.

Comment on lines 64 to 68
final long timeoutDefault = 3_000L;
// Default timeout for legacy mode matches with the legacy command (version 2.4.x and older).
final long timeoutLegacy = 60_000L;
final Optional<Long> timeout = !cl.hasOption(TIMEOUT_OPTION_NAME) ? Optional.empty() :
Optional.of(Long.parseLong(cl.getOptionValue(TIMEOUT_OPTION_NAME)) * 1000L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TimeDuration for supporting different units such as 100ms, 1min.

    final TimeDuration timeoutDefault = TimeDuration.valueOf(3, TimeUnit.SECONDS);
    // Default timeout for legacy mode matches with the legacy command (version 2.4.x and older).
    final TimeDuration timeoutLegacy = TimeDuration.valueOf(60, TimeUnit.SECONDS);
    final Optional<TimeDuration> timeout = !cl.hasOption(TIMEOUT_OPTION_NAME) ? Optional.empty() :
        Optional.of(TimeDuration.valueOf(cl.getOptionValue(TIMEOUT_OPTION_NAME), TimeUnit.SECONDS));

.mapToInt(RaftPeer::getPriority).max().orElse(0);
RaftPeer newLeader = getRaftGroup().getPeers().stream()
.filter(peer -> peer.getAddress().equals(strAddr)).findAny().orElse(null);
if (newLeader == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print an error message.

      printf("Peer with address %s not found.", strAddr);

Comment on lines 92 to 108
private Throwable tryTransfer(RaftClient client, RaftPeer newLeader, int highestPriority, long timeout) {
printf("Transferring leadership to server with address <%s> %n", newLeader.getAddress());
try {
// lift the current leader to the highest priority,
if (newLeader.getPriority() < highestPriority) {
setPriority(client, newLeader.getAddress(), highestPriority);
}
RaftClientReply transferLeadershipReply =
client.admin().transferLeadership(newLeader.getId(), timeout);
processReply(transferLeadershipReply, () -> "election failed");
} catch (Throwable t) {
printf("caught an error when executing transfer: %s%n", t.getMessage());
return t;
}
println("Transferring leadership initiated");
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return a boolean to indicate if the transfer is success.

  private boolean tryTransfer(RaftClient client, RaftPeer newLeader, int highestPriority,
      TimeDuration timeout) throws IOException {
    printf("Transferring leadership to server with address <%s> %n", newLeader.getAddress());
    try {
      // lift the current leader to the highest priority,
      if (newLeader.getPriority() < highestPriority) {
        setPriority(client, newLeader.getAddress(), highestPriority);
      }
      RaftClientReply transferLeadershipReply =
          client.admin().transferLeadership(newLeader.getId(), timeout.toLong(TimeUnit.MILLISECONDS));
      processReply(transferLeadershipReply, () -> "election failed");
    } catch (TransferLeadershipException tle) {
      if (tle.getMessage().contains("it does not has highest priority")) {
        return false;
      }
      throw tle;
    }
    println("Transferring leadership initiated");
    return true;
  }

Comment on lines 79 to 87
Throwable err = tryTransfer(client, newLeader, highestPriority, timeout.orElse(timeoutDefault));
if (err instanceof TransferLeadershipException
&& err.getMessage().contains("it does not has highest priority")) {
// legacy mode, transfer leadership by setting priority.
err = tryTransfer(client, newLeader, highestPriority + 1, timeout.orElse(timeoutLegacy));
}
if (err != null) {
return -1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing tryTransfer to return a boolean, we could make the corresponding change as below:

    try (RaftClient client = RaftUtils.createClient(getRaftGroup())) {
      // transfer leadership
      if (!tryTransfer(client, newLeader, highestPriority, timeout.orElse(timeoutDefault))) {
        // legacy mode, transfer leadership by setting priority.
        tryTransfer(client, newLeader, highestPriority + 1, timeout.orElse(timeoutLegacy));
      }
    } catch(Throwable t) {
      printf("Failed to transfer peer %s with address %s: ",
          newLeader.getId(), newLeader.getAddress());
      t.printStackTrace(getPrintStream());
      return -1;
    }
    return 0;

@kaijchen
Copy link
Contributor Author

Thanks @szetszwo for the review, updated.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 4bb35f0 into apache:master Feb 21, 2023
@kaijchen kaijchen deleted the RATIS-1769 branch February 22, 2023 02:15
szetszwo pushed a commit that referenced this pull request Mar 5, 2023
kuszz pushed a commit to kuszz/ratis that referenced this pull request Mar 7, 2023
symious pushed a commit to symious/ratis that referenced this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants