When installing services fails during a transition, the partition does not recover and becomes inactive #8717

romansmirnov · 2022-02-02T17:18:23Z

Describe the bug

When the Raft layer transitions to the LEADER role, it will trigger a transition to the LEADER role on the Zeebe layer after committing the initial entry. However, the transition to the LEADER role might fail on the Zeebe layer. When this happens, the Zeebe layer forces the Raft layer to transition to the INACTIVE state. But transitioning to the INACTIVE state might be in conflict with an already ongoing transition in the Raft layer. This might result in a state, where the Zeebe node becomes inactive on this node eventually so that it will reject all incoming append requests.

The following illustrates how this state can be reached:

2022-01-28 13:18:13.387: The Raft layer transitions to the LEADER role on partition 22

2022-01-28 13:18:13.387 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to LEADER

2022-01-28 13:18:25.994: The Raft layer requests the Zeebe layer to transition to the LEADER role for term 2

2022-01-28 13:18:25.994 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] INFO 
      io.camunda.zeebe.broker.system - Transition to LEADER on term 2 requested.

2022-01-28 13:18:25.995: On the Zeebe layer, the transition to the LEADER role fails because the term changed in the meantime from 2 to 3. (Please note: in the meantime, another node got the LEADER on term 3 on partition 22.)

2022-01-28 13:18:25.995 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] ERROR
      io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3', but was not. Failing installation of 'LogStoragePartitionStep' on partition 22.
java.lang.IllegalStateException: Expected that current term '2' is same as raft term '3', but was not. Failing installation of 'LogStoragePartitionStep' on partition 22.
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.checkAndCreateAtomixLogStorage(LogStoragePartitionTransitionStep.java:123) ~[zeebe-broker-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	...

2022-01-28 13:18:25.996: The Zeebe layer forces the Raft layer to transition to the INACTIVE role.

2022-01-28 13:18:25.996 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] INFO 
      io.camunda.zeebe.broker.system - Unexpected failure occurred in partition 22 (role FOLLOWER, term 2), transitioning to inactive

2022-01-28 13:18:26.239: But before transitioning to the INACTIVE state in the Raft layer, it first handles a request from the current/new leader, and recognizes it again that the term changed and triggers a transition to FOLLOWER in the Raft layer.

2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.roles.LeaderRole - RaftServer{raft-partition-partition-22}{role=LEADER} - Received greater term from 1
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to FOLLOWER

2022-01-28 13:18:26.275: After transitioning to FOLLOWER in the Raft layer, it transitions to the INACTIVE role in the Raft layer

2022-01-28 13:18:26.275 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to INACTIVE

As a consequence, the Zeebe layer first tries to transition to FOLLOWER which gets canceled, and subsequently it transitions to INACTIVE:

2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] DEBUG
      io.camunda.zeebe.broker.system - Partition role transitioning from LEADER to FOLLOWER in term 3
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Transition to INACTIVE on term 2 requested.
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Received cancel signal for transition to FOLLOWER on term 3
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Cancelling transition to FOLLOWER on term 3
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Prepare transition from FOLLOWER on term 3 to INACTIVE
...
2022-01-28 13:18:26.279 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Transition to INACTIVE on term 2 completed

As a consequence, the node becomes inactive on partition 22. That means the node will reject all incoming requests on that partition. Thus, the leader is not able to replicate any entries to that follower.

Observed Behavior:
In a nutshell, there are two problems:

Raft's term might be updated asynchronously which is accessed by the LogStoragePartitionTransitionStep when transitioning to LEADER in the Zeebe layer:
https://github.com/camunda-cloud/zeebe/blob/bba1d425614f76698a335132f4f8c2d614e97537/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/LogStoragePartitionTransitionStep.java#L113-L128
Basically, the term changes so that this transition step fails in the Zeebe layer. Typically, if the term changes it will immediately transition (from LEADER) to the FOLLOWER in the Raft layer. Because the Zeebe layer forces the Raft layer to go in an inactive role, the Raft layer will become inactive:
https://github.com/camunda-cloud/zeebe/blob/bba1d425614f76698a335132f4f8c2d614e97537/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L357-L362
The requested transition from the Zeebe layer to the Raft layer is not bound to the current term. Meaning, if the Raft layer is already on term 3, but the request to transition to inactive happens on term 2, then the transition to inactive could be ignored (which is currently not the case).

Expected behavior

If the installation of the Zeebe services fails because of a term change, then a specific exception should be thrown:
- The exception indicates that the transition was aborted intentionally
- The exception should not be handled in the sense that no specific action will be taken to recover from that transient state.

Additional information:

Log files can be found here: https://drive.google.com/drive/folders/1NEkvFpbA7r-Lju5fvRlFgrza7w_uPIjf?usp=sharing

Environment:

Zeebe Version: 1.4.0-SNAPSHOT

related to SUPPORT-12825

The text was updated successfully, but these errors were encountered:

romansmirnov · 2022-02-02T17:20:11Z

@npepinpe, I summarized what we have discussed. Feel free to adjust the description if something is missing.

romansmirnov · 2022-02-02T17:25:48Z

@deepthidevaki, while I understand what happened in this case, there is still one missing piece that is unclear to me. Basically, what triggered the term change so that the LogStoragePartitionTransitionStep failed.

To me, it is clear, in which situations the terms changes (for example, when receiving a request with a higher term), but typically, I would expect that the Raft layer transitions immediately to the FOLLOWER role (if it is currently in the LEADER role). But in this specific case, the transition didn't happen. Also, it is clear that the term changes asynchronously in the Raft layer, but looking at the timestamps, i.e., when the exception happened and when the Raft layer eventually transitions to the FOLLOWER role, there is a gap of ~240ms. So somehow they do not match.

Do you see any case where the term changes but a role transition are not triggered? Maybe we can discuss this issue together 😄

deepthidevaki · 2022-02-03T10:11:58Z

However, the transition to the LEADER role might fail on the Zeebe layer. When this happens, the Zeebe layer forces the Raft layer to transition to the INACTIVE state.

As far as I remember, this is not expected. It should transition to FOLLOWER when transition to leader fails. If transition to follower fails, then it should transition to INACTIVE because there is no other fallback option. So I guess there is a bug there.

deepthidevaki · 2022-02-03T10:16:40Z

ERROR io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3', but was not.

This should ideally be not treated as an install failure, rather this is an expected case. We have this check to ensure consistency and failing the install is the way to prevent any inconsistency errors. We should handle this error differently, I guess.
We have an issue to reduce the log level fo this error #7309 , but I think that is not enough. Instead we should handle this error differently. Or better, prevent this case.

deepthidevaki · 2022-02-03T10:27:49Z

Do you see any case where the term changes but a role transition are not triggered?

The term can change when it is in the follower role. Then no transitions are triggered.

But in this case, I think the following happened:

Raft set the new term 3
LogStoragePartitionTransitionStep reads the new term and fails
Raft transition to follower in term 3.

Step 1 and 3 is executed in the raft thread. Step 2 is executed in Zeebe actors. This might be why there is this time gap.

romansmirnov · 2022-02-03T13:49:03Z

To that point:

As far as I remember, this is not expected. It should transition to FOLLOWER when transition to leader fails. If transition to follower fails, then it should transition to INACTIVE because there is no other fallback option.

The situation is a bit different:

On the Raft layer, it is already in the LEADER role
On the Zeebe layer, it is currently transitioning from FOLLOWER to LEADER. Because in the meantime the term changed in the Raft layer, it will transition to FOLLOWER (but no transition on the Zeebe layer has been requested yet), eventually.
Since the transition fails on the Zeebe layer, it will submit a transition to INACTIVE in the Raft layer (because the current role in the Zeebe layer is FOLLOWER - and not LEADER - basically, the fallback you mentioned applies here):
https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L356-L363

To avoid going into INACTIVE in the Raft layer, @npepinpe and I thought of bounding the requested transition to the term. Basically, when the Zeebe layer requests the Raft layer to transition to INACTIVE it passes its current known term to Raft. That way, Raft can compare its term with Zeebe's passed term:

If Raft's term is higher than Zeebe's passed term, then ignore that transition request and stay in the current role. (In the situation above, the Raft layer would remain in the FOLLOWER role, and not transition to INACTIVE).
If Raft's term is equal to Zeebe's passed term, then execute that transition request (i.e., transition to INACTIVE).
(If Raft's term is lower than Zeebe's passed term, does not apply here.)

deepthidevaki · 2022-02-03T14:11:16Z

else if (context.getCurrentRole() == Role.FOLLOWER) {

I would say this is the bug. Since transitioning to LEADER failed, it should consider that and transition to Follower. So here comparing with the currentRole which is "follower" is incorrect.

To avoid going into INACTIVE in the Raft layer, @npepinpe and I thought of bounding the requested transition to the term. Basically, when the Zeebe layer requests the Raft layer to transition to INACTIVE it passes its current known term to Raft. That way, Raft can compare its term with Zeebe's passed term:

This makes sense to me. If there is already a new transition triggered by Raft, then we can ignore the transition triggered by ZeebePartition. But Zeebe should trigger a transition to Inactive only when there is no other way to recover from the failure. In this case, Zeebe should not have transitioned to inactive. So I think we should look into it.

In addition, it would be also interesting to think how we can avoid this Error "ERROR io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3'" which triggered this transition.

npepinpe · 2022-02-03T14:33:37Z

Considering that transitions are asynchronous and concurrent between the two layers, does it make sense for a step in the ZeebePartition transition to peek directly at the current value of the Raft term? I understand what we wanted to achieve (i.e. short circuit appending if the term has changed since we requested its installation), but I wonder if perhaps this doesn't make it more complicated, mixing two different synchronization models (i.e. using actors/single thread for synchronization to execute work requests between the two contexts, and directly peeking at the state bypassing this model). Should the term of the appended be propagated all the way downstream until it hits the LeaderAppender, at which point it can be rejected? Though that could also be considered a leaky abstraction... 🤔 Possibly the instance of the ZeebeLogAppender implementation carries the term (since that instance is Raft specific) and appends it to requests to the LeaderAppender? I know currently the ZeebeLogAppender is directly the LeaderRole, but we could instead create a proxy tied to the transition's term? Anything to simplify the current state 😄

Anyhow, we should focus first on what you two mentioned: handling the recoverable error properly, and also tying the transition requests with a logical clock.

I will schedule this for alpha2. It's a little hard for me to estimate now if we should start immediately, but I think next week would be the latest if we're conservative. Let me know if you think otherwise. I'll probably ask one of you to work on this, and the other to review, but you're welcomed to pair as well if you'd like 🙂

deepthidevaki · 2022-02-03T14:39:48Z

The check is not only to short-circuit, but to prevent inconsistency. We had a bug where the LogStorageAppender from older term writes to the LeaderRole in the newer term, thus messing up the record positions. This check was added to prevent that case. I also agree, peeking at the raft term is not an ideal way. If we can find a better way to prevent the a service from older term accessing raft at a newer term, that would be preferred.

romansmirnov · 2022-02-03T15:23:46Z

@npepinpe, a bit unrelated to the discussion so far, but when we discussed this issue yesterday, we were wondering what caused the term to change. As @deepthidevaki mentioned the following happens asynchronously

But in this case, I think the following happened:

Raft set the new term 3

LogStoragePartitionTransitionStep reads the new term and fails

Raft transition to follower in term 3.

Step 1 and 3 is executed in the raft thread. Step 2 is executed in Zeebe actors. This might be why there is this time gap.

This is exactly what happens, but I was wondering why there is a "big" time gap of ~240ms:

2022-01-28 13:18:25.996 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] INFO 
      io.camunda.zeebe.broker.system - Unexpected failure occurred in partition 22 (role FOLLOWER, term 2), transitioning to inactive
...
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.roles.LeaderRole - RaftServer{raft-partition-partition-22}{role=LEADER} - Received greater term from 1
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to FOLLOWER

In a different context, I noticed that writes to Raft's metastore spikes up to ~350ms (and more). And that would explain the time gap between these two events. Broker 0 handles the vote request coming from broker 1:

https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/atomix/cluster/src/main/java/io/atomix/raft/roles/LeaderRole.java#L464-L469

Basically, the following happens:

While handling the vote request from broker 1, the Raft thread executes #updateTermAndLeader()
https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/atomix/cluster/src/main/java/io/atomix/raft/roles/AbstractRole.java#L116-L130
The Raft thread updates the term and writes back that changed term and the last vote to the metastore (at this stage, the term changed but no Raft transition has been triggered)
https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/atomix/cluster/src/main/java/io/atomix/raft/impl/RaftContext.java#L1021-L1022
The broker thread that executes the transition to LEADER in the Zeebe layer reads the new term and the installation fails
After ~240ms, the Raft thread finishes writing to the metastore and triggers the Raft transition to FOLLOWER, eventually.

To my knowledge, this would explain the term change and the delayed Raft transition.

romansmirnov · 2022-02-09T16:54:36Z

@deepthidevaki and I discussed potential solutions. Just listing a few of them:

Tie transition requests with a term
The idea would be to pass the current known term by the Zeebe layer to the Raft layer when requesting a transition to inactive. However, this approach may cause other issues. For example, when the transition to follower failed for whatever reason in the Zeebe layer then we want the Raft layer to become inactive. If this transition to inactive request is tied to the term, then it could happen that the Raft layer changes the term without changing its role (for example there is a new leader, or someone started an election round, etc.). In such a scenario, Zeebe would result in a state where probably some services are installed and others not and remain a follower in the Raft layer. Maybe the snapshot service is not installed so that no snapshots are taken on that partition (and no snapshots are replicated by the leader because the follower is not lagging). Without taking snapshots, there will be no compaction. Without compaction, disk space won't be freed so the broker might go out of disk space which potentially has an impact on the other partitions on that broker.
LogStorageAppender append blocks along with a term
Basically, don't check the terms when initializing the LogStorage, and instead, initialize the LogStorageAppender with the current term. So whenever the LogStorageAppender submits a block/fragment to the LogStorage (i.e., LeaderRole), Raft will check if the block to append is supposed to be for the same term as Raft is currently in. If not, then the block is dropped.
Don't request a transition to inactive
Actually, the exception thrown during the installation of the LogStorage is expected when a term change happened in the meantime (see Reduce the log level or better convey the message when leader install fails due to concurrent role transition #7309). The term change happens because the broker is no longer the leader for that partition. So we can be sure that a new Zeebe transition to follower gets scheduled. Eventually, the system gets into a consistent state, i.e., being a follower in the Raft and Zeebe layers. To reach that state, we could introduce a specific exception that gets handled by the ZeebePartition by "simply" dropping it.

We decided to follow the third solution to resolve that problem. Also, we decided to create two separate issues:

Refactor LogStoragePartitionTransitionStep by not accessing Raft's current term
Fix ZeebePartition#handleRecoverableFailure(), i.e., context.getCurrentRole() == FOLLOWER assumes that the Zeebe layer is already in the follower state which is not the case anymore (after refactoring the transition behavior in Zeebe)

@npepinpe, please let me know if you have any questions or concerns.

npepinpe · 2022-02-09T17:05:35Z

Sounds reasonable, and I understand the first approach has its own issues.

💭 I think generally the way the Raft layer and ZeebePartition layer communicate seems overly complicated. Even the proposed fixes, which are valuable, only fix specific cases - it's still easy to shoot ourselves in the foot in the future by accessing the Raft layer directly in one of transition steps, or forgetting to throw the right exception for recoverable cases. I'm just thinking out loud here, I still think you should go ahead with your plan, but I think it will soon be time to take a step back and try to come up with a better approach, or improve the current one, of coordinating these state transitions to make them less error prone.

8848: [Backport stable/1.3] fix: avoid transition to inactive when log storage installation fails r=romansmirnov a=romansmirnov ## Description When the installation of the `LogStorage` fails because the targeted term does not match Raft's term, then throw a specific exception. This is an expected state because while installing the partition a leadership change happened causing a new term in Raft. That specific exception is treated differently, meaning, no Raft transition is requested by Zeebe, because for sure a new Zeebe transition was scheduled that will (re-)install the services.  ## Related issues  backports #8767, relates to #8717 Co-authored-by: Roman <roman.smirnov@camunda.com>

8849: [Backport stable/1.2] fix: avoid transition to inactive when log storage installation fails r=romansmirnov a=romansmirnov ## Description When the installation of the `LogStorage` fails because the targeted term does not match Raft's term, then throw a specific exception. This is an expected state because while installing the partition a leadership change happened causing a new term in Raft. That specific exception is treated differently, meaning, no Raft transition is requested by Zeebe, because for sure a new Zeebe transition was scheduled that will (re-)install the services.  ## Related issues  backports #8767, relates to #8717 Co-authored-by: Roman <roman.smirnov@camunda.com>

8858: [Backport release-1.4.0-alpha2] fix: avoid transition to inactive when log storage installation fails r=oleschoenburg a=github-actions[bot] # Description Backport of #8767 to `release-1.4.0-alpha2`. relates to #8717 Co-authored-by: Roman <roman.smirnov@camunda.com>

romansmirnov added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user Impact: Availability labels Feb 2, 2022

romansmirnov added the support Marks an issue as related to a customer support request label Feb 2, 2022

romansmirnov self-assigned this Feb 4, 2022

npepinpe added this to Ready in Zeebe Feb 4, 2022

npepinpe moved this from Ready to Review in progress in Zeebe Feb 7, 2022

npepinpe moved this from Review in progress to In progress in Zeebe Feb 7, 2022

romansmirnov mentioned this issue Feb 7, 2022

fix(inactive): drop transition request if term is not current #8744

Closed

10 tasks

romansmirnov mentioned this issue Feb 10, 2022

fix: avoid transition to inactive when log storage installation fails #8767

Merged

10 tasks

npepinpe mentioned this issue Feb 11, 2022

Logstorage can't be opened in wrong term #7874

Closed

npepinpe moved this from In progress to Review in progress in Zeebe Feb 24, 2022

This was referenced Feb 28, 2022

[Backport stable/1.3] fix: avoid transition to inactive when log storage installation fails #8848

Merged

[Backport stable/1.2] fix: avoid transition to inactive when log storage installation fails #8849

Merged

romansmirnov mentioned this issue Feb 28, 2022

Improve coordination of Raft and Zeebe role transitions #8852

Open

1 task

ghost closed this as completed in a328924 Mar 1, 2022

Zeebe automation moved this from Review in progress to Done Mar 1, 2022

github-actions bot mentioned this issue Mar 1, 2022

[Backport release-1.4.0-alpha2] fix: avoid transition to inactive when log storage installation fails #8858

Merged

Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022

lenaschoenburg added the Release: 1.4.0-alpha2 label Mar 1, 2022

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

npepinpe added the Release: 8.0.0-rc1 label Mar 28, 2022

Zelldon mentioned this issue Apr 1, 2022

Error with: Failing installation of 'LogStoragePartitionStep' #9040

Closed

pihme added the Release: 8.0.0 label Apr 5, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When installing services fails during a transition, the partition does not recover and becomes inactive #8717

When installing services fails during a transition, the partition does not recover and becomes inactive #8717

romansmirnov commented Feb 2, 2022 •

edited

Loading

romansmirnov commented Feb 2, 2022

romansmirnov commented Feb 2, 2022

deepthidevaki commented Feb 3, 2022

deepthidevaki commented Feb 3, 2022

deepthidevaki commented Feb 3, 2022

romansmirnov commented Feb 3, 2022 •

edited

Loading

deepthidevaki commented Feb 3, 2022

npepinpe commented Feb 3, 2022 •

edited

Loading

deepthidevaki commented Feb 3, 2022 •

edited

Loading

romansmirnov commented Feb 3, 2022

romansmirnov commented Feb 9, 2022

npepinpe commented Feb 9, 2022

When installing services fails during a transition, the partition does not recover and becomes inactive #8717

When installing services fails during a transition, the partition does not recover and becomes inactive #8717

Comments

romansmirnov commented Feb 2, 2022 • edited Loading

romansmirnov commented Feb 2, 2022

romansmirnov commented Feb 2, 2022

deepthidevaki commented Feb 3, 2022

deepthidevaki commented Feb 3, 2022

deepthidevaki commented Feb 3, 2022

romansmirnov commented Feb 3, 2022 • edited Loading

deepthidevaki commented Feb 3, 2022

npepinpe commented Feb 3, 2022 • edited Loading

deepthidevaki commented Feb 3, 2022 • edited Loading

romansmirnov commented Feb 3, 2022

romansmirnov commented Feb 9, 2022

npepinpe commented Feb 9, 2022

romansmirnov commented Feb 2, 2022 •

edited

Loading

romansmirnov commented Feb 3, 2022 •

edited

Loading

npepinpe commented Feb 3, 2022 •

edited

Loading

deepthidevaki commented Feb 3, 2022 •

edited

Loading