RegionMigrate: Fix migrating region with ratisConsensus cause the region is unavailable. by 133tosakarin · Pull Request #13178 · apache/iotdb

133tosakarin · 2024-08-15T02:57:48Z

If the current region configuration is [a, b, c], we execute the migrate region id from c to d, and the leader is a. During the region migration process, due to the need to catch up with the new node d, if the data is large, it may take a long time.

Assuming that the operations before the execution of the Remove phase are successful, the region configuration at this time should be [a, b, c, d], and then start executing RemoveRegion, but due to some reasons, the RemoveRegion operation has not been completed, resulting in the failure of the DELETE_REGION_PEER phase timeout in the executeFromState function of RemoveProcessure, and then enter the DELETE_OLD_REGION_PEER, the new node d has not caught up with a successfully, and the old node c has been deleted.

If the leaderTransfership is executed midway, such as a to b or a to d, then the entire region group will be in a state without a leader, which is permanent because c thinks it is not in the region group, but other nodes [a, b, d] It is believed that if it is in the region group and the new node d has not caught up with the old leader, d will not execute the vote (due to the lifeCycle of ratis), then this region group will never be elected as the leader because the candidate will not receive the majority vote.

We modify the member change part and wait for the raft member change to complete or throw an exception before returning

In the previous region migration implementation, we randomly selected an available node a and relinquished the existing leader privileges to this available node a. This may result in selecting a new node whose data has not yet caught up with the previous leader. This useless election is wasted because other nodes will never vote for node a.
In this PR, we fixed this issue, and if the migrated node is not the leader, then we do not perform the leader transfer

OneSizeFitsQuorum · 2024-08-15T07:49:39Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

+      Integer leaderId = configManager.getLoadManager().getRegionLeaderMap().get(regionId);
+
+      if (leaderId != -1) {
+        // The migrated node is not leader, so we specify the transfer leader to current leader node


then we do not need to send rpc to datanode?

I think so, we can report success

OneSizeFitsQuorum · 2024-08-15T07:49:48Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

        status);
    configManager.getLoadManager().removeRegionCache(regionId, deprecatedLocation.getDataNodeId());
-    configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority();
+    // configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority();


do not comment this

ok, i have changed it

OneSizeFitsQuorum · 2024-08-15T07:50:05Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

   */
  public void transferRegionLeader(TConsensusGroupId regionId, TDataNodeLocation originalDataNode)
      throws ProcedureException, InterruptedException {
+    List<TDataNodeLocation> excludeDataNode = new ArrayList<>();


Collections.singletonList()

OneSizeFitsQuorum · 2024-08-15T07:53:03Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

  public Optional<TDataNodeLocation> filterDataNodeWithOtherRegionReplica(
      TConsensusGroupId regionId, TDataNodeLocation filterLocation, NodeStatus... allowingStatus) {
+    List<TDataNodeLocation> excludeLocations = new ArrayList<>();
+    excludeLocations.add(filterLocation);


same as above

I have changed

OneSizeFitsQuorum · 2024-08-15T07:53:14Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

                regionId,
                new ConsensusGroupHeartbeatSample(timestamp, newLeaderNode.get().getDataNodeId())));
-    configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority();
+    // configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority();


do not comment

OneSizeFitsQuorum · 2024-08-15T07:56:52Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

        Optional.ofNullable(getGroupInfo(raftGroupId))
            .orElseThrow(() -> new ConsensusGroupNotExistException(groupId));
-
+    if (raftGroup.getPeers() == null) {


if we forbid leader trasnfer during regionMigration, thus we do not need this judgement?

This commit is draft version, I'll delete these code in future commit

OK, waiting for yuheng's pr

OneSizeFitsQuorum · 2024-08-15T07:57:29Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

-
    final RaftClientReply reply;
    try {
+      Peer leader = getLeader(groupId);


same as above, not necessary if we consider this in condignode

OneSizeFitsQuorum · 2024-08-15T07:58:02Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

    return reply;
  }

+  private RaftClientReply sendReconfigurationWithRetry(RaftGroup newGroupConf)


I have used another policy

OneSizeFitsQuorum · 2024-08-15T07:58:18Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

+  // The two fields are used to control the retry times and wait time for setConfiguration
+  // reconfiguration may take many time, so we need to set a long wait time and retry times
+  private final long setConfigurationWaitTime = 10000;
+  private final int setConfigurationRetryTimes = 50;


these codes have been deleted

OneSizeFitsQuorum · 2024-08-15T07:58:23Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

                    this.config.getImpl().getRetryWaitMillis(), TimeUnit.MILLISECONDS))
            .build();

+    regionMigrateRetryPolicy =


I'm considering which one to use

OneSizeFitsQuorum · 2024-08-15T09:34:54Z

...c/main/java/org/apache/iotdb/confignode/procedure/impl/region/RemoveRegionPeerProcedure.java

+    handler.forceUpdateRegionCache(consensusGroupId, targetDataNode, RegionStatus.Removing);
+    List<TDataNodeLocation> excludeDataNode = new ArrayList<>();
+    excludeDataNode.add(targetDataNode);
+    excludeDataNode.add(coordinator);


It feels like the semantics here should be allowed, and in the end, if the judgment is the same, do not send rpc?

This just exclude origin node and target node, we don't need to transferLeadership to a new member, and the originNode will be deleted, we also filter it out. Considering that there may be other nodes that need to be filtered, I simply used List to avoid adding additional parameters to the function

OneSizeFitsQuorum · 2024-08-15T09:35:43Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

   */
  public Optional<TDataNodeLocation> filterDataNodeWithOtherRegionReplica(
      TConsensusGroupId regionId, TDataNodeLocation filterLocation) {
+    List<TDataNodeLocation> filterLocations = new ArrayList<>();


Collections.singletonList

why not change?

I didn't see it

OneSizeFitsQuorum · 2024-08-15T09:36:32Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

+  public void transferRegionLeader(
+      TConsensusGroupId regionId,
+      TDataNodeLocation originalDataNode,
+      List<TDataNodeLocation> excludeDataNode)


seems we only need one node which will be deleted in the future? not a list?

It's convenient to use to determine whether an object is included

OneSizeFitsQuorum · 2024-08-15T09:36:57Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

        Optional.ofNullable(getGroupInfo(raftGroupId))
            .orElseThrow(() -> new ConsensusGroupNotExistException(groupId));
-
+    if (raftGroup.getPeers() == null) {


OK, waiting for yuheng's pr

OneSizeFitsQuorum · 2024-08-15T09:38:31Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

-          client.getRaftClient().admin().setConfiguration(new ArrayList<>(newGroupConf.getPeers()));
-      if (!reply.isSuccess()) {
+      int basicWaitTime = 500;
+      int maxWaitTime = 10000;


It feels like we can just check regularly here, without adding idempotent logic, because it doesn't make a big difference.

This is also okay.

OneSizeFitsQuorum · 2024-08-15T09:40:19Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

+          break;
+        }
+        if (reply.getException() instanceof ReconfigurationInProgressException
+            || reply.getException() instanceof LeaderSteppingDownException


when we need thses two exceptions? will this cause forever loop?

it will loop until reply return success or throw other exception. In fact, I want to use RaftServer.LifeCycle as loop factor, if majority RaftServer.LifeCycle is RUNNING, then we can return, but I can't get raftServers in newGroupConf's peers

liyuheng55555

Please take a look.

Besides, please break down the first paragraph of the PR description into smaller paragraphs to improve readability.

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

133tosakarin · 2024-08-16T05:20:11Z

...c/main/java/org/apache/iotdb/confignode/procedure/impl/region/RemoveRegionPeerProcedure.java

+    List<TDataNodeLocation> excludeDataNode = new ArrayList<>();
+    excludeDataNode.add(targetDataNode);
+    excludeDataNode.add(coordinator);
+    handler.transferRegionLeader(consensusGroupId, targetDataNode, excludeDataNode);


Then I'll change it again.

liyuheng55555 · 2024-08-16T04:13:27Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java

+
+    static {
+      String str = "";
+      // 50, 500ms, 40, 1000ms, 30, 1500ms, 20, 2000ms, 10, 2500ms


What is this comment's meaning? Describe your algorithm more clearly.

And, why define an algorithm in static chunk?

it's a pairArray, pair such as (n, t), use the i-th pair, then client will retry n times, each time sleep t ms.

I cancat these pair as str, The MultipleLinearRandomRetry will parse it.

I want to calculate only once, so I think static chunk will only be executed once

I think just using "50, 500ms, 40, 1000ms, 30, 1500ms, 20, 2000ms, 10, 2500ms" will be more clear than building a String by loop.

But, this strategy seems not actually endless ?

yes, the retry time should give a large number.
if the number is enough large, then the loop is ok

liyuheng55555 · 2024-08-16T04:17:02Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java

+      // Ratis guarantees that event.getCause() is instance of IOException.
+      // We should allow RaftException or IOException(StatusRuntimeException, thrown by gRPC) to be
+      // retried.
+      Optional<Throwable> unexpectedCause =
+          Optional.ofNullable(event.getCause())
+              .filter(RaftException.class::isInstance)
+              .map(Throwable::getCause)
+              .filter(StatusRuntimeException.class::isInstance);


What is RaftException or IOException(StatusRuntimeException, thrown by gRPC)? Words between () is confusing, StatusRuntimeException is not a kind of IOException.

I copy from RatisRetryPolicy

133tosakarin · 2024-08-16T05:08:08Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

  private final RaftClientRpc clientRpc;

  private final IClientManager<RaftGroup, RatisClient> clientManager;
+  private final IClientManager<RaftGroup, RatisClient> reconfClientManager;


133tosakarin · 2024-08-16T05:07:49Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

    }
  }

+  private RatisClient getConfRaftClient(RaftGroup group) throws ClientManagerException {


liyuheng55555 · 2024-08-16T04:27:55Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

+    private boolean isReConfiguration;
+
+    RatisClientPoolFactory(boolean isReConfiguration) {
+      this.isReConfiguration = isReConfiguration;
+    }


isReconfiguration

what's mean?

OneSizeFitsQuorum · 2024-08-16T09:21:47Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java

+
+    @Override
+    public Action handleAttemptFailure(Event event) {
+      if (event.getCause() instanceof ReconfigurationInProgressException) {


why not use if（A || B || C)

haha! I'm lazy here :) . This part is the copilot prompt, and the tab is generated with one click.

Why not change this?

I have changed

liyuheng55555 · 2024-08-16T09:32:40Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

+    if (!newLeaderNode.isPresent()) {
+      // If we have no choice, we use it
+      newLeaderNode = Optional.of(coodinator);
+    }


Better than my thought, nice and clean code!

liyuheng55555 · 2024-08-16T09:33:30Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java

  // This policy is used to raft configuration change
+  //


Use javadoc /** */ instead

liyuheng55555 · 2024-08-16T09:34:52Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

    RatisClientPoolFactory(boolean isReConfiguration) {
-      this.isReConfiguration = isReConfiguration;
+      this.isReconfiguration = isReConfiguration;


you forget something

liyuheng55555 · 2024-08-16T09:35:57Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisClient.java

+    RatisEndlessRetryPolicy() {
+      // about 1 hour wait Time.
      defaultPolicy =
-          MultipleLinearRandomRetry.parseCommaSeparated(str.substring(0, str.length() - 1));
+          RetryPolicies.retryForeverWithSleep(TimeDuration.valueOf(2, TimeUnit.SECONDS));
    }


Isn't retry every 2 seconds too frequently ?

Why still // about 1 hour wait Time.

if reply return false, then resend request after 2s. Instead of sending it every 2 seconds

sorry, I forget delete comment

OneSizeFitsQuorum · 2024-08-16T11:33:10Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java


+  private RatisClient getConfigurationRaftClient(RaftGroup group) throws ClientManagerException {
+    try {
+      return reconfigurationClientManager.borrowClient(group);


consistent with clientManager

OneSizeFitsQuorum · 2024-08-16T11:36:34Z

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

        Optional.ofNullable(getGroupInfo(raftGroupId))
            .orElseThrow(() -> new ConsensusGroupNotExistException(groupId));
-
+    if (raftGroup.getPeers() == null) {


OneSizeFitsQuorum · 2024-08-16T11:38:59Z

...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java

+    excludeDataNode.add(coodinator);
    while (System.nanoTime() - startTime < TimeUnit.SECONDS.toNanos(findNewLeaderTimeLimitSecond)) {
-      newLeaderNode = filterDataNodeWithOtherRegionReplica(regionId, originalDataNode);
+      newLeaderNode = filterDataNodeWithOtherRegionReplica(regionId, excludeDataNode);


cause redundant sleep when replica is 1?

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java

liyuheng55555

LGTM

…ion is unavailable. (apache#13178)

…ion is unavailable. (#13178) (#13230)

fix migrate region transfer leadership

b6df9ac

133tosakarin changed the title ~~RegionMigrate: Fix choosing the transferLeadership on migrateRegion~~ RegionMigrate: Fix migrating region cause the region is unavailable. Aug 15, 2024

fix wait time

2bdb950

133tosakarin changed the title ~~RegionMigrate: Fix migrating region cause the region is unavailable.~~ RegionMigrate: Fix migrating region with ratisConsensus cause the region is unavailable. Aug 15, 2024

133tosakarin added 2 commits August 15, 2024 12:02

format

5d02557

fix

abb6c61

OneSizeFitsQuorum reviewed Aug 15, 2024

View reviewed changes

133tosakarin added 4 commits August 15, 2024 15:59

fix leaderTransfer

0859f8e

fix

403a9bc

fix

6c86359

format

b91a9fc

OneSizeFitsQuorum reviewed Aug 15, 2024

View reviewed changes

133tosakarin added 2 commits August 15, 2024 21:08

change retry policy

4385137

fix some problem

601281a

liyuheng55555 requested changes Aug 16, 2024

View reviewed changes

133tosakarin added 2 commits August 16, 2024 14:51

Considering one replication case

643231f

fix handleEvent

3ee7338

OneSizeFitsQuorum reviewed Aug 16, 2024

View reviewed changes

liyuheng55555 requested changes Aug 16, 2024

View reviewed changes

133tosakarin added 2 commits August 16, 2024 17:52

format

7a2a233

format

e895a12

OneSizeFitsQuorum reviewed Aug 16, 2024

View reviewed changes

133tosakarin added 3 commits August 16, 2024 20:03

fix some problem

24cefb0

remove useless codes

64ee5a9

fix some problem

09f697e

OneSizeFitsQuorum reviewed Aug 19, 2024

View reviewed changes

iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java Outdated Show resolved Hide resolved

133tosakarin added 2 commits August 19, 2024 12:38

fix

80af61c

format

f08d408

liyuheng55555 approved these changes Aug 19, 2024

View reviewed changes

OneSizeFitsQuorum merged commit 45ab21b into apache:master Aug 19, 2024

133tosakarin mentioned this pull request Aug 19, 2024

[To rc/1.3.3]: Region Migrate: Fix migrating region with ratisConsensus cause the region is unavailable. #13230

Merged

133tosakarin added a commit to 133tosakarin/iotdb that referenced this pull request Aug 19, 2024

RegionMigrate: Fix migrating region with ratisConsensus cause the reg…

c255380

…ion is unavailable. (apache#13178)

133tosakarin added a commit to 133tosakarin/iotdb that referenced this pull request Aug 19, 2024

RegionMigrate: Fix migrating region with ratisConsensus cause the reg…

5706c3f

…ion is unavailable. (apache#13178)

OneSizeFitsQuorum pushed a commit that referenced this pull request Aug 20, 2024

RegionMigrate: Fix migrating region with ratisConsensus cause the reg…

7455e00

…ion is unavailable. (#13178) (#13230)

Conversation

133tosakarin commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

133tosakarin Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyuheng55555 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

133tosakarin commented Aug 15, 2024 •

edited

Loading

133tosakarin Aug 15, 2024 •

edited

Loading

liyuheng55555 left a comment •

edited

Loading