Skip to content

Commit

Permalink
merge: #10776
Browse files Browse the repository at this point in the history
10776: Stop raft server when going inactive due to unrecoverable errors r=deepthidevaki a=deepthidevaki

## Description

Previously, it was transitioning to inactive. But, in the configuration the member is still marked as active. As a result, the member transition back to active when it gets a new message from the leader. We cannot change the configuration, and mark this member as inactive because that would mean we are changing the quorum. What we really requires is that this partition is "dead" (atleast temporarily) so that it doesn't become leader again. We also don't want it to become a follower because this can also lead to partial functionality which can cause problems. For example, in follower role, raft is replicating events, but the streamprocessor or snapshotting is not working because of this error. So it is not able to compact the logs. This will eventually leads to disk space full and thus affecting other possibly healthy partitions.

To fix this, in this PR we stop the raft server instead of only transitioning to inactive. The replication factor and quorum remains the same. But this node cannot become leader again until the member is restarted.

## Related issues

closes #9924



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
  • Loading branch information
zeebe-bors-camunda[bot] and deepthidevaki committed Oct 24, 2022
2 parents 23a30b9 + 346ce34 commit f3d2bb4
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 18 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -246,8 +246,8 @@ private boolean shouldStepDown() {
&& primary.get() != server.getMemberId();
}

public CompletableFuture<Void> goInactive() {
return server.goInactive();
public CompletableFuture<Void> stop() {
return server.stop();
}

public PartitionMetadata getMetadata() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -366,16 +366,15 @@ private void handleRecoverableFailure() {
context.getPartitionId(),
context.getCurrentRole(),
context.getCurrentTerm());
context.getRaftPartition().goInactive();
context.getRaftPartition().stop();
}
}

private void handleUnrecoverableFailure(final Throwable error) {
final var report = HealthReport.dead(this).withIssue(error);
healthMetrics.setDead();
zeebePartitionHealth.onUnrecoverableFailure(error);
transitionToInactive();
context.getRaftPartition().goInactive();
context.getRaftPartition().stop();
failureListeners.forEach((l) -> l.onUnrecoverableFailure(report));
context.notifyListenersOfBecomingInactive();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,7 @@ public void shouldNotTriggerTransitionOnPartitionTransitionException()
order.verify(transition).toLeader(2);
// after failing leader transition no other
// transitions are triggered
order.verify(raft, times(0)).goInactive();
order.verify(raft, times(0)).stop();
order.verify(transition, times(0)).toFollower(anyLong());
}

Expand All @@ -251,7 +251,7 @@ public void shouldGoInactiveAfterFailedFollowerTransition() throws InterruptedEx
});
when(raft.getRole()).thenReturn(Role.FOLLOWER);
when(ctx.getCurrentRole()).thenReturn(Role.FOLLOWER);
when(raft.goInactive())
when(raft.stop())
.then(
invocation -> {
partition.onNewRole(Role.INACTIVE, 2);
Expand All @@ -267,37 +267,28 @@ public void shouldGoInactiveAfterFailedFollowerTransition() throws InterruptedEx
// then
final InOrder order = inOrder(transition, raft);
order.verify(transition).toFollower(0L);
order.verify(raft).goInactive();
order.verify(raft).stop();
order.verify(transition).toInactive(anyLong());
}

@Test
public void shouldGoInactiveIfTransitionHasUnrecoverableFailure() throws InterruptedException {
// given
final CountDownLatch latch = new CountDownLatch(1);
when(transition.toLeader(anyLong()))
.thenReturn(
CompletableActorFuture.completedExceptionally(new UnrecoverableException("expected")));
when(transition.toInactive(anyLong()))
.then(
invocation -> {
latch.countDown();
return CompletableActorFuture.completed(null);
});
when(raft.getRole()).thenReturn(Role.LEADER);
when(raft.term()).thenReturn(1L);

// when
schedulerRule.submitActor(partition);
partition.onNewRole(raft.getRole(), raft.term());
schedulerRule.workUntilDone();
assertThat(latch.await(30, TimeUnit.SECONDS)).isTrue();

// then
final InOrder order = inOrder(transition, raft);
order.verify(transition).toLeader(0L);
order.verify(transition).toInactive(anyLong());
order.verify(raft).goInactive();
order.verify(raft).stop();
}

@Test
Expand Down

0 comments on commit f3d2bb4

Please sign in to comment.