Skip to content

Conversation

@JAkutenshi
Copy link
Contributor

@JAkutenshi JAkutenshi commented Sep 30, 2024

JIRA Ticket: IGNITE-23019

The goal

The goal of this PR is to fix tests that are failed due to getLeader flag removing.

The reason

For now org.apache.ignite.internal.raft.RaftGroupServiceImpl#start consume getLeader flag that leads to:

  1. Awaiting for refreshLeader inside.
  2. CompletableFuture as a return type.

But this refreshLeader call is not nesseccary because in all raft-client's public methods refreshLeader triggers inside. Besides, when we're starting a cluster a numerous of nodes and its replication groups are starting and awaiting raft leader election while replication groups are started. This awaition doesn't need for starting, but it is slowing down the starting process.

Thus, there are reasons for such flag removal:

  1. Performance reason: replication gropus startup time reducing.
  2. Code simplification:
    • less async chaining on CompletableFuture result;
    • we shouldn't collect and then handling raft-clients starting futures on replication groups stoppings.

The solution

As a temporal solution before IGNITE-23010 is to set getLeader flag as false inside org.apache.ignite.internal.raft.RaftGroupServiceImpl#start. Then, several tests are failed. Below is a list of the failed tests that are fixed in this PR.

ItLearnersTest#testAddLearners

Reasons

We're waiting after raft-client's start that the leader will be the follower node, but as a result we got a null value.

Solution

Instead of getting the leader directly, let's do refreshLeader before:

  assertThat(
          service1.thenCompose(service -> service.refreshLeader()
                  .thenApply(v -> service.leader())),
          willBe(follower.asPeer())
  );

ItLearnersTest#testLearnersOnTheSameNodeAsPeers

Reasons

Two raft-clients are started and they're related to 2 nodes: one for the peer and another for the learner nodes respectively. The test checks that peer's client knows that it is related with the leader, but not learner. On the other side the learner's client checks that it has the peer as leader but not the learner. But the test fails because on each check we will get null value as a leader.

Solution

As the previous test case we will do refresh leader before each check (peer & learner cases) begins. Follow check has already refreshed leader.

        assertThat(peerService.thenCompose(
                service -> service.refreshLeader()
                    .thenApply(v -> service.leader())),
                willBe(peer)
        );
        assertThat(
                // the leader is already refreshed
                peerService.thenApply(RaftGroupService::leader),
                willBe(not(learner))
        );

        assertThat(learnerService.thenCompose(
                        service -> service.refreshLeader()
                                .thenApply(v -> service.leader())),
                willBe(peer)
        );
        assertThat(
                // the leader is already refreshed
                learnerService.thenApply(RaftGroupService::leader),
                willBe(not(learner))
        );

ItRaftGroupServiceTest#testTransferLeadership

Reasons

We will get an NPE in the test's beginning while we're trying to get the first node's leader. But there is no leader yet and then we will write null to leader variable and will trigger NPE when will try to call consistentId from the null-value leader.

Solution

Let's refresh leader before it's usage:

  Peer leader = nodes.get(0).raftGroupService
          .thenCompose(service -> service.refreshLeader().thenApply(v -> service.leader()))
          .join();

ItLozaTest#testRaftServiceUsingSharedExecutor

Reasons

Raft-client's start isn't the primary scope for the test, but there is a mock calls counter check of invoke inside of sendWithRetry, but the method won't be called because there are no more refreshLeader calls that are triggering sendWithRetry.

Solution

Let's add refreshLeader after the client's start with successful completion check then:

  RaftGroupService client = startClient(
          new TestReplicationGroupId(Integer.toString(i)),
          spyService.topologyService().localMember(),
          partitionsConfigurer
  );

  assertThat(client.refreshLeader(), willCompleteSuccessfully());

ItDisasterRecoverySystemViewTest

There are 2 tests that are failed, but with a common reason:

  • testGlobalPartitionStatesSystemView
  • testLocalPartitionStatesSystemView

Reasons

After a leader acquiring an log replication some partitions have LocalPartitionStateEnum#INITIALIZING state instead of HEALTHY (see LocalPartitionStateEnumWithLogIndex#of), that leads to UNAVAILABLE state in a system table for the corresponding partition instead of AVAILABLE (see DisasterRecoveryManager:734). That leads for a check fail on status column.

Solution

The observation actually is correct, the issue is in test's expectations there: test wants AVAILABLE and it will observe it, but the test isn't waiting for it now, it rely on a soon refresh leader completion. The solution there is just to add refresh leader on each partition with successful completion waiting. Then, the test will obtain the desired state:

private static void waitLeaderOnAllPartitions(String tableName, int partitionsCount) {
        IgniteImpl node = ((RestartProofIgnite) CLUSTER.node(0)).unwrap(IgniteImpl.class);

        TableImpl table = ((PublicApiThreadingTable) node.tables().table(tableName)).unwrap(TableImpl.class);

        int tableId = table.tableId();

        IntStream.range(0, partitionsCount).forEach(partId -> assertThat(
                node.replicaManager()
                        .replica(new TablePartitionId(tableId, partId))
                        .thenCompose(replica -> replica.raftClient().refreshLeader()),
                willCompleteSuccessfully()
        ));
    }

ActiveActorTest#testChangeLeaderWhenActualLeft

Reasons

There are 2 issues that aren't related to getLeader directly:

  • For a reason "localhost" check will fail (I got different 192.168.*.* addresses instead that wasn't resolved as localhost), but there no actual need to check hosts. Moreover without fix we will step to the block just because the host issue, but the leader is already changed.
  • Then, when leader was changed, we can't get it from leaderRef -- it isn't available there already and won't be updated.

Solution

The first should be fixed with port check only. The second one is fixed with leaderRefNoInitialNotify instead of leaderRef.

RaftGroupServiceTest

Reasons

There are unit tests that actively use getLeader flag explicitly for different tests invariant. It leads to invariants failure.

Solutions

The =getLeader= flag for raft clients creation method is removed, but new method startRaftGroupServiceWithRefreshLeader is introduced. The method creates a raft client with mock on leader requests and then it calls refreshLeader. So, in tests that are checking leader we will use the last one, and the original startRaftGroupService without the flag otherwise.

ItTableRaftSnapshotsTest

The most complicated test class, but in general there 2 major issues. They were happened in the follow tests list, but potentially every test in the class could be failed:

  • nodeCanInstallSnapshotsAfterSnapshotInstalledToIt
  • testChangeLeaderOnInstallSnapshotInMiddle
  • entriesKeepAppendedAfterSnapshotInstallation
  • snapshotInstallationRepeatsOnTimeout

AssertionError: Unexpected leadership change on group: 10_part_0

Reasons

This problem emerges in doSnapshotOn processing with the follow scenario:

  1. We do transfer leadership to a node e.g. 0.
  2. Then we want to do a snapshot on the leader node 0.
  3. Before snapshoting is beginning we're checking that the leader wasn't changed and the assertion fails.

It happens because in parallel there is primary replica acceptance is in process. The situation happens:

  1. We have nodes [0, 1, 2]
  2. Transfer leadership on 0
  3. LeaseGrantMessage goes to 1 with force flag (0 nor 1 doesn't know actual RAFT-leader yet, so we can't collocate lease and the leader, that's why node 1 will be forsed chosen).
  4. 0 is a leader
  5. LeaseGrantMessage with force == true triggers transfer leadership to the leaseholder -- node 1.
  6. Assert in doSnapshotOn fails because the leader was changed due to force lease acceptance on node 1.
Solution

The described scenario is legal for Ignite, but for the tests scenario it interferes with desired invariant checks. Thus, the simple and correct solution is just to wait while lease will be accepted and then doing the transfer leadership with waitForPrimaryReplica that was introduced. It should be placed after cluster initialization for test work in each case.

PrimaryReplicaMissException with currentEnlistmentConsistencyToken == 1

Reasons

In failed tests (fail rate for nodeCanInstallSnapshotsAfterSnapshotInstalledToIt is ~25%) we can see the picture:

  1. Starting nodes [0, 1, 2]
  2. Lease is granted for 1
  3. RAFT leader is transfered for 0
  4. Shutdown 2
  5. Put a row into a table and making a snapshot
  6. Restart node 2
  7. Transfer leadership to 2
  8. Stopping node 0
  9. Trying to put a new row again on node 2 -> failing with the titled exception.

The reason is that internally we have leaseStartTime that is currentEnlistmentConsistencyToken in UpdateCommandResponse. This request-response pair is a raft command. When we're trying to put a value, it's going to node 1 as a primary replica, it creates the command and send to the leader (after p.7 it is node 2). But the leader see that internally it still has uninitialized leaseStartTime == 1 that doesn't match with unxi-time-like token from the message from node 1. That's the reason why 0th sends the response with token mismatch flag turned on and its currentEnlistmentConsistencyToken == 1. The last question is why it is uninitialized: because some meta information (as leaseStartTime) that is stored in raft log wasn't restored from a snapshot installation on node 2 startup process. There is the corresponding issue: IGNITE-23379.

Solution

Due to all tests are checking snapshots are installed (e.g. with snapshotInstalledLatch) -- they're all may be affected before IGNITE-23379 will be fixed. So, this test class is disabled for a while.

JVM crash on ItPlacementDriverReplicaSideTest

Reasons

The main reason is an unhandled exception from raftClient.refreshLeader().get() that is actually TimeoutException on leader election process. We have retryTimeout=2000, but leader election may took up a little more time. Then, uncaught exception is thrown and we doesn't stop replicas that leads to closed rocksDB entity access attempt, that crash JVM.

Solution

There are 2 actions:

  • Increase the timer up to 4000 ms.
  • Move nodes set to @BeforeEach's ends and stopReplicationGroup(GROUP_ID, grpNodes) to @AfterEach's beginning.

@JAkutenshi JAkutenshi marked this pull request as ready for review October 8, 2024 14:29
assertNotNull(raftClient);

raftClient.refreshLeader().get();
raftClient.refreshLeader().get(WAIT_TIMEOUT_MILLIS, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need external timeout here? I mean that refreshLeader internally has time bounded limitations:

            if (currentTimeMillis() >= stopTime) {
                fut.completeExceptionally(
                        new TimeoutException(format("Send with retry timed out [retryCount = {}, groupId = {}].", retryCount, groupId)));

                return;
            }

?

Copy link
Contributor Author

@JAkutenshi JAkutenshi Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice internal limitation, just was triggered with get() without a timeout. Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check other places, e.g. on line 237.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sanpwc sanpwc merged commit 89094b9 into apache:main Oct 11, 2024
@sanpwc sanpwc deleted the ignite-23019 branch October 11, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants