RegionMigrate: Fix migrating region with ratisConsensus cause the region is unavailable.#13178
Conversation
| Integer leaderId = configManager.getLoadManager().getRegionLeaderMap().get(regionId); | ||
|
|
||
| if (leaderId != -1) { | ||
| // The migrated node is not leader, so we specify the transfer leader to current leader node |
There was a problem hiding this comment.
then we do not need to send rpc to datanode?
There was a problem hiding this comment.
I think so, we can report success
| status); | ||
| configManager.getLoadManager().removeRegionCache(regionId, deprecatedLocation.getDataNodeId()); | ||
| configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority(); | ||
| // configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority(); |
There was a problem hiding this comment.
do not comment this
There was a problem hiding this comment.
ok, i have changed it
| */ | ||
| public void transferRegionLeader(TConsensusGroupId regionId, TDataNodeLocation originalDataNode) | ||
| throws ProcedureException, InterruptedException { | ||
| List<TDataNodeLocation> excludeDataNode = new ArrayList<>(); |
There was a problem hiding this comment.
Collections.singletonList()
| public Optional<TDataNodeLocation> filterDataNodeWithOtherRegionReplica( | ||
| TConsensusGroupId regionId, TDataNodeLocation filterLocation, NodeStatus... allowingStatus) { | ||
| List<TDataNodeLocation> excludeLocations = new ArrayList<>(); | ||
| excludeLocations.add(filterLocation); |
There was a problem hiding this comment.
I have changed
| regionId, | ||
| new ConsensusGroupHeartbeatSample(timestamp, newLeaderNode.get().getDataNodeId()))); | ||
| configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority(); | ||
| // configManager.getLoadManager().getRouteBalancer().balanceRegionLeaderAndPriority(); |
| Optional.ofNullable(getGroupInfo(raftGroupId)) | ||
| .orElseThrow(() -> new ConsensusGroupNotExistException(groupId)); | ||
|
|
||
| if (raftGroup.getPeers() == null) { |
There was a problem hiding this comment.
if we forbid leader trasnfer during regionMigration, thus we do not need this judgement?
There was a problem hiding this comment.
This commit is draft version, I'll delete these code in future commit
There was a problem hiding this comment.
OK, waiting for yuheng's pr
|
|
||
| final RaftClientReply reply; | ||
| try { | ||
| Peer leader = getLeader(groupId); |
There was a problem hiding this comment.
same as above, not necessary if we consider this in condignode
| return reply; | ||
| } | ||
|
|
||
| private RaftClientReply sendReconfigurationWithRetry(RaftGroup newGroupConf) |
There was a problem hiding this comment.
I have used another policy
| // The two fields are used to control the retry times and wait time for setConfiguration | ||
| // reconfiguration may take many time, so we need to set a long wait time and retry times | ||
| private final long setConfigurationWaitTime = 10000; | ||
| private final int setConfigurationRetryTimes = 50; |
There was a problem hiding this comment.
these codes have been deleted
| this.config.getImpl().getRetryWaitMillis(), TimeUnit.MILLISECONDS)) | ||
| .build(); | ||
|
|
||
| regionMigrateRetryPolicy = |
There was a problem hiding this comment.
I'm considering which one to use
| handler.forceUpdateRegionCache(consensusGroupId, targetDataNode, RegionStatus.Removing); | ||
| List<TDataNodeLocation> excludeDataNode = new ArrayList<>(); | ||
| excludeDataNode.add(targetDataNode); | ||
| excludeDataNode.add(coordinator); |
There was a problem hiding this comment.
It feels like the semantics here should be allowed, and in the end, if the judgment is the same, do not send rpc?
There was a problem hiding this comment.
This just exclude origin node and target node, we don't need to transferLeadership to a new member, and the originNode will be deleted, we also filter it out. Considering that there may be other nodes that need to be filtered, I simply used List to avoid adding additional parameters to the function
| */ | ||
| public Optional<TDataNodeLocation> filterDataNodeWithOtherRegionReplica( | ||
| TConsensusGroupId regionId, TDataNodeLocation filterLocation) { | ||
| List<TDataNodeLocation> filterLocations = new ArrayList<>(); |
There was a problem hiding this comment.
Collections.singletonList
There was a problem hiding this comment.
I didn't see it
| public void transferRegionLeader( | ||
| TConsensusGroupId regionId, | ||
| TDataNodeLocation originalDataNode, | ||
| List<TDataNodeLocation> excludeDataNode) |
There was a problem hiding this comment.
seems we only need one node which will be deleted in the future? not a list?
There was a problem hiding this comment.
It's convenient to use to determine whether an object is included
| Optional.ofNullable(getGroupInfo(raftGroupId)) | ||
| .orElseThrow(() -> new ConsensusGroupNotExistException(groupId)); | ||
|
|
||
| if (raftGroup.getPeers() == null) { |
There was a problem hiding this comment.
OK, waiting for yuheng's pr
| client.getRaftClient().admin().setConfiguration(new ArrayList<>(newGroupConf.getPeers())); | ||
| if (!reply.isSuccess()) { | ||
| int basicWaitTime = 500; | ||
| int maxWaitTime = 10000; |
There was a problem hiding this comment.
It feels like we can just check regularly here, without adding idempotent logic, because it doesn't make a big difference.
There was a problem hiding this comment.
This is also okay.
| break; | ||
| } | ||
| if (reply.getException() instanceof ReconfigurationInProgressException | ||
| || reply.getException() instanceof LeaderSteppingDownException |
There was a problem hiding this comment.
when we need thses two exceptions? will this cause forever loop?
There was a problem hiding this comment.
it will loop until reply return success or throw other exception. In fact, I want to use RaftServer.LifeCycle as loop factor, if majority RaftServer.LifeCycle is RUNNING, then we can return, but I can't get raftServers in newGroupConf's peers
...onfignode/src/main/java/org/apache/iotdb/confignode/procedure/env/RegionMaintainHandler.java
Show resolved
Hide resolved
| List<TDataNodeLocation> excludeDataNode = new ArrayList<>(); | ||
| excludeDataNode.add(targetDataNode); | ||
| excludeDataNode.add(coordinator); | ||
| handler.transferRegionLeader(consensusGroupId, targetDataNode, excludeDataNode); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Then I'll change it again.
|
|
||
| static { | ||
| String str = ""; | ||
| // 50, 500ms, 40, 1000ms, 30, 1500ms, 20, 2000ms, 10, 2500ms |
There was a problem hiding this comment.
What is this comment's meaning? Describe your algorithm more clearly.
And, why define an algorithm in static chunk?
There was a problem hiding this comment.
it's a pairArray, pair such as (n, t), use the i-th pair, then client will retry n times, each time sleep t ms.
I cancat these pair as str, The MultipleLinearRandomRetry will parse it.
I want to calculate only once, so I think static chunk will only be executed once
There was a problem hiding this comment.
I think just using "50, 500ms, 40, 1000ms, 30, 1500ms, 20, 2000ms, 10, 2500ms" will be more clear than building a String by loop.
But, this strategy seems not actually endless ?
There was a problem hiding this comment.
yes, the retry time should give a large number.
if the number is enough large, then the loop is ok
| // Ratis guarantees that event.getCause() is instance of IOException. | ||
| // We should allow RaftException or IOException(StatusRuntimeException, thrown by gRPC) to be | ||
| // retried. | ||
| Optional<Throwable> unexpectedCause = | ||
| Optional.ofNullable(event.getCause()) | ||
| .filter(RaftException.class::isInstance) | ||
| .map(Throwable::getCause) | ||
| .filter(StatusRuntimeException.class::isInstance); |
There was a problem hiding this comment.
What is RaftException or IOException(StatusRuntimeException, thrown by gRPC)? Words between () is confusing, StatusRuntimeException is not a kind of IOException.
There was a problem hiding this comment.
I copy from RatisRetryPolicy
| private final RaftClientRpc clientRpc; | ||
|
|
||
| private final IClientManager<RaftGroup, RatisClient> clientManager; | ||
| private final IClientManager<RaftGroup, RatisClient> reconfClientManager; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| } | ||
| } | ||
|
|
||
| private RatisClient getConfRaftClient(RaftGroup group) throws ClientManagerException { |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| private boolean isReConfiguration; | ||
|
|
||
| RatisClientPoolFactory(boolean isReConfiguration) { | ||
| this.isReConfiguration = isReConfiguration; | ||
| } |
|
|
||
| @Override | ||
| public Action handleAttemptFailure(Event event) { | ||
| if (event.getCause() instanceof ReconfigurationInProgressException) { |
There was a problem hiding this comment.
why not use if(A || B || C)
There was a problem hiding this comment.
haha! I'm lazy here :) . This part is the copilot prompt, and the tab is generated with one click.
There was a problem hiding this comment.
Why not change this?
There was a problem hiding this comment.
I have changed
| if (!newLeaderNode.isPresent()) { | ||
| // If we have no choice, we use it | ||
| newLeaderNode = Optional.of(coodinator); | ||
| } |
There was a problem hiding this comment.
Better than my thought, nice and clean code!
| // This policy is used to raft configuration change | ||
| // |
There was a problem hiding this comment.
Use javadoc /** */ instead
| RatisClientPoolFactory(boolean isReConfiguration) { | ||
| this.isReConfiguration = isReConfiguration; | ||
| this.isReconfiguration = isReConfiguration; |
There was a problem hiding this comment.
you forget something
| RatisEndlessRetryPolicy() { | ||
| // about 1 hour wait Time. | ||
| defaultPolicy = | ||
| MultipleLinearRandomRetry.parseCommaSeparated(str.substring(0, str.length() - 1)); | ||
| RetryPolicies.retryForeverWithSleep(TimeDuration.valueOf(2, TimeUnit.SECONDS)); | ||
| } |
There was a problem hiding this comment.
- Isn't retry every 2 seconds too frequently ?
- Why still // about 1 hour wait Time.
There was a problem hiding this comment.
-
if reply return false, then resend request after 2s. Instead of sending it every 2 seconds
-
sorry, I forget delete comment
|
|
||
| private RatisClient getConfigurationRaftClient(RaftGroup group) throws ClientManagerException { | ||
| try { | ||
| return reconfigurationClientManager.borrowClient(group); |
There was a problem hiding this comment.
consistent with clientManager
| Optional.ofNullable(getGroupInfo(raftGroupId)) | ||
| .orElseThrow(() -> new ConsensusGroupNotExistException(groupId)); | ||
|
|
||
| if (raftGroup.getPeers() == null) { |
| excludeDataNode.add(coodinator); | ||
| while (System.nanoTime() - startTime < TimeUnit.SECONDS.toNanos(findNewLeaderTimeLimitSecond)) { | ||
| newLeaderNode = filterDataNodeWithOtherRegionReplica(regionId, originalDataNode); | ||
| newLeaderNode = filterDataNodeWithOtherRegionReplica(regionId, excludeDataNode); |
There was a problem hiding this comment.
cause redundant sleep when replica is 1?
iotdb-core/consensus/src/main/java/org/apache/iotdb/consensus/ratis/RatisConsensus.java
Outdated
Show resolved
Hide resolved
…ion is unavailable. (apache#13178)
…ion is unavailable. (apache#13178)
If the current region configuration is [a, b, c], we execute the migrate region id from c to d, and the leader is a. During the region migration process, due to the need to catch up with the new node d, if the data is large, it may take a long time.
Assuming that the operations before the execution of the Remove phase are successful, the region configuration at this time should be [a, b, c, d], and then start executing RemoveRegion, but due to some reasons, the RemoveRegion operation has not been completed, resulting in the failure of the DELETE_REGION_PEER phase timeout in the executeFromState function of RemoveProcessure, and then enter the DELETE_OLD_REGION_PEER, the new node d has not caught up with a successfully, and the old node c has been deleted.
If the leaderTransfership is executed midway, such as a to b or a to d, then the entire region group will be in a state without a leader, which is permanent because c thinks it is not in the region group, but other nodes [a, b, d] It is believed that if it is in the region group and the new node d has not caught up with the old leader, d will not execute the vote (due to the lifeCycle of ratis), then this region group will never be elected as the leader because the candidate will not receive the majority vote.
We modify the member change part and wait for the raft member change to complete or throw an exception before returning
In the previous region migration implementation, we randomly selected an available node a and relinquished the existing leader privileges to this available node a. This may result in selecting a new node whose data has not yet caught up with the previous leader. This useless election is wasted because other nodes will never vote for node a.
In this PR, we fixed this issue, and if the migrated node is not the leader, then we do not perform the leader transfer