Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When installing services fails during a transition, the partition does not recover and becomes inactive #8717

Closed
1 task done
romansmirnov opened this issue Feb 2, 2022 · 12 comments · Fixed by #8767
Closed
1 task done
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user support Marks an issue as related to a customer support request version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5

Comments

@romansmirnov
Copy link
Member

romansmirnov commented Feb 2, 2022

Describe the bug

When the Raft layer transitions to the LEADER role, it will trigger a transition to the LEADER role on the Zeebe layer after committing the initial entry. However, the transition to the LEADER role might fail on the Zeebe layer. When this happens, the Zeebe layer forces the Raft layer to transition to the INACTIVE state. But transitioning to the INACTIVE state might be in conflict with an already ongoing transition in the Raft layer. This might result in a state, where the Zeebe node becomes inactive on this node eventually so that it will reject all incoming append requests.

The following illustrates how this state can be reached:

  1. 2022-01-28 13:18:13.387: The Raft layer transitions to the LEADER role on partition 22
2022-01-28 13:18:13.387 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to LEADER
  1. 2022-01-28 13:18:25.994: The Raft layer requests the Zeebe layer to transition to the LEADER role for term 2
2022-01-28 13:18:25.994 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] INFO 
      io.camunda.zeebe.broker.system - Transition to LEADER on term 2 requested.
  1. 2022-01-28 13:18:25.995: On the Zeebe layer, the transition to the LEADER role fails because the term changed in the meantime from 2 to 3. (Please note: in the meantime, another node got the LEADER on term 3 on partition 22.)
2022-01-28 13:18:25.995 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] ERROR
      io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3', but was not. Failing installation of 'LogStoragePartitionStep' on partition 22.
java.lang.IllegalStateException: Expected that current term '2' is same as raft term '3', but was not. Failing installation of 'LogStoragePartitionStep' on partition 22.
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.checkAndCreateAtomixLogStorage(LogStoragePartitionTransitionStep.java:123) ~[zeebe-broker-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
	...
  1. 2022-01-28 13:18:25.996: The Zeebe layer forces the Raft layer to transition to the INACTIVE role.
2022-01-28 13:18:25.996 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] INFO 
      io.camunda.zeebe.broker.system - Unexpected failure occurred in partition 22 (role FOLLOWER, term 2), transitioning to inactive
  1. 2022-01-28 13:18:26.239: But before transitioning to the INACTIVE state in the Raft layer, it first handles a request from the current/new leader, and recognizes it again that the term changed and triggers a transition to FOLLOWER in the Raft layer.
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.roles.LeaderRole - RaftServer{raft-partition-partition-22}{role=LEADER} - Received greater term from 1
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to FOLLOWER
  1. 2022-01-28 13:18:26.275: After transitioning to FOLLOWER in the Raft layer, it transitions to the INACTIVE role in the Raft layer
2022-01-28 13:18:26.275 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to INACTIVE
  1. As a consequence, the Zeebe layer first tries to transition to FOLLOWER which gets canceled, and subsequently it transitions to INACTIVE:
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] DEBUG
      io.camunda.zeebe.broker.system - Partition role transitioning from LEADER to FOLLOWER in term 3
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Transition to INACTIVE on term 2 requested.
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Received cancel signal for transition to FOLLOWER on term 3
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Cancelling transition to FOLLOWER on term 3
...
2022-01-28 13:18:26.278 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Prepare transition from FOLLOWER on term 3 to INACTIVE
...
2022-01-28 13:18:26.279 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-1] INFO 
      io.camunda.zeebe.broker.system - Transition to INACTIVE on term 2 completed

As a consequence, the node becomes inactive on partition 22. That means the node will reject all incoming requests on that partition. Thus, the leader is not able to replicate any entries to that follower.

Observed Behavior:
In a nutshell, there are two problems:

  1. Raft's term might be updated asynchronously which is accessed by the LogStoragePartitionTransitionStep when transitioning to LEADER in the Zeebe layer:
    https://github.com/camunda-cloud/zeebe/blob/bba1d425614f76698a335132f4f8c2d614e97537/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/LogStoragePartitionTransitionStep.java#L113-L128
    Basically, the term changes so that this transition step fails in the Zeebe layer. Typically, if the term changes it will immediately transition (from LEADER) to the FOLLOWER in the Raft layer. Because the Zeebe layer forces the Raft layer to go in an inactive role, the Raft layer will become inactive:
    https://github.com/camunda-cloud/zeebe/blob/bba1d425614f76698a335132f4f8c2d614e97537/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L357-L362
  2. The requested transition from the Zeebe layer to the Raft layer is not bound to the current term. Meaning, if the Raft layer is already on term 3, but the request to transition to inactive happens on term 2, then the transition to inactive could be ignored (which is currently not the case).

Expected behavior

  • If the installation of the Zeebe services fails because of a term change, then a specific exception should be thrown:
    • The exception indicates that the transition was aborted intentionally
    • The exception should not be handled in the sense that no specific action will be taken to recover from that transient state.

Additional information:

Environment:

  • Zeebe Version: 1.4.0-SNAPSHOT

related to SUPPORT-12825

@romansmirnov romansmirnov added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user Impact: Availability labels Feb 2, 2022
@romansmirnov
Copy link
Member Author

@npepinpe, I summarized what we have discussed. Feel free to adjust the description if something is missing.

@romansmirnov
Copy link
Member Author

@deepthidevaki, while I understand what happened in this case, there is still one missing piece that is unclear to me. Basically, what triggered the term change so that the LogStoragePartitionTransitionStep failed.

To me, it is clear, in which situations the terms changes (for example, when receiving a request with a higher term), but typically, I would expect that the Raft layer transitions immediately to the FOLLOWER role (if it is currently in the LEADER role). But in this specific case, the transition didn't happen. Also, it is clear that the term changes asynchronously in the Raft layer, but looking at the timestamps, i.e., when the exception happened and when the Raft layer eventually transitions to the FOLLOWER role, there is a gap of ~240ms. So somehow they do not match.

Do you see any case where the term changes but a role transition are not triggered? Maybe we can discuss this issue together 😄

@romansmirnov romansmirnov added the support Marks an issue as related to a customer support request label Feb 2, 2022
@deepthidevaki
Copy link
Contributor

However, the transition to the LEADER role might fail on the Zeebe layer. When this happens, the Zeebe layer forces the Raft layer to transition to the INACTIVE state.

As far as I remember, this is not expected. It should transition to FOLLOWER when transition to leader fails. If transition to follower fails, then it should transition to INACTIVE because there is no other fallback option. So I guess there is a bug there.

@deepthidevaki
Copy link
Contributor

ERROR io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3', but was not.

This should ideally be not treated as an install failure, rather this is an expected case. We have this check to ensure consistency and failing the install is the way to prevent any inconsistency errors. We should handle this error differently, I guess.
We have an issue to reduce the log level fo this error #7309 , but I think that is not enough. Instead we should handle this error differently. Or better, prevent this case.

@deepthidevaki
Copy link
Contributor

Do you see any case where the term changes but a role transition are not triggered?

The term can change when it is in the follower role. Then no transitions are triggered.

But in this case, I think the following happened:

  1. Raft set the new term 3
  2. LogStoragePartitionTransitionStep reads the new term and fails
  3. Raft transition to follower in term 3.

Step 1 and 3 is executed in the raft thread. Step 2 is executed in Zeebe actors. This might be why there is this time gap.

@romansmirnov
Copy link
Member Author

romansmirnov commented Feb 3, 2022

To that point:

As far as I remember, this is not expected. It should transition to FOLLOWER when transition to leader fails. If transition to follower fails, then it should transition to INACTIVE because there is no other fallback option.

The situation is a bit different:

  1. On the Raft layer, it is already in the LEADER role
  2. On the Zeebe layer, it is currently transitioning from FOLLOWER to LEADER. Because in the meantime the term changed in the Raft layer, it will transition to FOLLOWER (but no transition on the Zeebe layer has been requested yet), eventually.
  3. Since the transition fails on the Zeebe layer, it will submit a transition to INACTIVE in the Raft layer (because the current role in the Zeebe layer is FOLLOWER - and not LEADER - basically, the fallback you mentioned applies here):
    https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L356-L363

To avoid going into INACTIVE in the Raft layer, @npepinpe and I thought of bounding the requested transition to the term. Basically, when the Zeebe layer requests the Raft layer to transition to INACTIVE it passes its current known term to Raft. That way, Raft can compare its term with Zeebe's passed term:

  • If Raft's term is higher than Zeebe's passed term, then ignore that transition request and stay in the current role. (In the situation above, the Raft layer would remain in the FOLLOWER role, and not transition to INACTIVE).
  • If Raft's term is equal to Zeebe's passed term, then execute that transition request (i.e., transition to INACTIVE).
  • (If Raft's term is lower than Zeebe's passed term, does not apply here.)

@deepthidevaki
Copy link
Contributor

else if (context.getCurrentRole() == Role.FOLLOWER) {

I would say this is the bug. Since transitioning to LEADER failed, it should consider that and transition to Follower. So here comparing with the currentRole which is "follower" is incorrect.

To avoid going into INACTIVE in the Raft layer, @npepinpe and I thought of bounding the requested transition to the term. Basically, when the Zeebe layer requests the Raft layer to transition to INACTIVE it passes its current known term to Raft. That way, Raft can compare its term with Zeebe's passed term:

This makes sense to me. If there is already a new transition triggered by Raft, then we can ignore the transition triggered by ZeebePartition. But Zeebe should trigger a transition to Inactive only when there is no other way to recover from the failure. In this case, Zeebe should not have transitioned to inactive. So I think we should look into it.

In addition, it would be also interesting to think how we can avoid this Error "ERROR io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3'" which triggered this transition.

@npepinpe
Copy link
Member

npepinpe commented Feb 3, 2022

Considering that transitions are asynchronous and concurrent between the two layers, does it make sense for a step in the ZeebePartition transition to peek directly at the current value of the Raft term? I understand what we wanted to achieve (i.e. short circuit appending if the term has changed since we requested its installation), but I wonder if perhaps this doesn't make it more complicated, mixing two different synchronization models (i.e. using actors/single thread for synchronization to execute work requests between the two contexts, and directly peeking at the state bypassing this model). Should the term of the appended be propagated all the way downstream until it hits the LeaderAppender, at which point it can be rejected? Though that could also be considered a leaky abstraction... 🤔 Possibly the instance of the ZeebeLogAppender implementation carries the term (since that instance is Raft specific) and appends it to requests to the LeaderAppender? I know currently the ZeebeLogAppender is directly the LeaderRole, but we could instead create a proxy tied to the transition's term? Anything to simplify the current state 😄

Anyhow, we should focus first on what you two mentioned: handling the recoverable error properly, and also tying the transition requests with a logical clock.

I will schedule this for alpha2. It's a little hard for me to estimate now if we should start immediately, but I think next week would be the latest if we're conservative. Let me know if you think otherwise. I'll probably ask one of you to work on this, and the other to review, but you're welcomed to pair as well if you'd like 🙂

@deepthidevaki
Copy link
Contributor

deepthidevaki commented Feb 3, 2022

The check is not only to short-circuit, but to prevent inconsistency. We had a bug where the LogStorageAppender from older term writes to the LeaderRole in the newer term, thus messing up the record positions. This check was added to prevent that case. I also agree, peeking at the raft term is not an ideal way. If we can find a better way to prevent the a service from older term accessing raft at a newer term, that would be preferred.

@romansmirnov
Copy link
Member Author

@npepinpe, a bit unrelated to the discussion so far, but when we discussed this issue yesterday, we were wondering what caused the term to change. As @deepthidevaki mentioned the following happens asynchronously

But in this case, I think the following happened:

  1. Raft set the new term 3
  2. LogStoragePartitionTransitionStep reads the new term and fails
  3. Raft transition to follower in term 3.

Step 1 and 3 is executed in the raft thread. Step 2 is executed in Zeebe actors. This might be why there is this time gap.

This is exactly what happens, but I was wondering why there is a "big" time gap of ~240ms:

2022-01-28 13:18:25.996 [Broker-0-ZeebePartition-22] [Broker-0-zb-actors-3] INFO 
      io.camunda.zeebe.broker.system - Unexpected failure occurred in partition 22 (role FOLLOWER, term 2), transitioning to inactive
...
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.roles.LeaderRole - RaftServer{raft-partition-partition-22}{role=LEADER} - Received greater term from 1
2022-01-28 13:18:26.239 [] [raft-server-0-raft-partition-partition-22] INFO 
      io.atomix.raft.impl.RaftContext - RaftServer{raft-partition-partition-22} - Transitioning to FOLLOWER

In a different context, I noticed that writes to Raft's metastore spikes up to ~350ms (and more). And that would explain the time gap between these two events. Broker 0 handles the vote request coming from broker 1:

https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/atomix/cluster/src/main/java/io/atomix/raft/roles/LeaderRole.java#L464-L469

Basically, the following happens:

  1. While handling the vote request from broker 1, the Raft thread executes #updateTermAndLeader()
    https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/atomix/cluster/src/main/java/io/atomix/raft/roles/AbstractRole.java#L116-L130
  2. The Raft thread updates the term and writes back that changed term and the last vote to the metastore (at this stage, the term changed but no Raft transition has been triggered)
    https://github.com/camunda-cloud/zeebe/blob/4d3ad96196be1bf1e3b6e83b6590fdd837bceae4/atomix/cluster/src/main/java/io/atomix/raft/impl/RaftContext.java#L1021-L1022
  3. The broker thread that executes the transition to LEADER in the Zeebe layer reads the new term and the installation fails
  4. After ~240ms, the Raft thread finishes writing to the metastore and triggers the Raft transition to FOLLOWER, eventually.

To my knowledge, this would explain the term change and the delayed Raft transition.

@romansmirnov romansmirnov self-assigned this Feb 4, 2022
@npepinpe npepinpe added this to Ready in Zeebe Feb 4, 2022
@npepinpe npepinpe moved this from Ready to Review in progress in Zeebe Feb 7, 2022
@npepinpe npepinpe moved this from Review in progress to In progress in Zeebe Feb 7, 2022
@romansmirnov
Copy link
Member Author

@deepthidevaki and I discussed potential solutions. Just listing a few of them:

  1. Tie transition requests with a term
    The idea would be to pass the current known term by the Zeebe layer to the Raft layer when requesting a transition to inactive. However, this approach may cause other issues. For example, when the transition to follower failed for whatever reason in the Zeebe layer then we want the Raft layer to become inactive. If this transition to inactive request is tied to the term, then it could happen that the Raft layer changes the term without changing its role (for example there is a new leader, or someone started an election round, etc.). In such a scenario, Zeebe would result in a state where probably some services are installed and others not and remain a follower in the Raft layer. Maybe the snapshot service is not installed so that no snapshots are taken on that partition (and no snapshots are replicated by the leader because the follower is not lagging). Without taking snapshots, there will be no compaction. Without compaction, disk space won't be freed so the broker might go out of disk space which potentially has an impact on the other partitions on that broker.

  2. LogStorageAppender append blocks along with a term
    Basically, don't check the terms when initializing the LogStorage, and instead, initialize the LogStorageAppender with the current term. So whenever the LogStorageAppender submits a block/fragment to the LogStorage (i.e., LeaderRole), Raft will check if the block to append is supposed to be for the same term as Raft is currently in. If not, then the block is dropped.

  3. Don't request a transition to inactive
    Actually, the exception thrown during the installation of the LogStorage is expected when a term change happened in the meantime (see Reduce the log level or better convey the message when leader install fails due to concurrent role transition #7309). The term change happens because the broker is no longer the leader for that partition. So we can be sure that a new Zeebe transition to follower gets scheduled. Eventually, the system gets into a consistent state, i.e., being a follower in the Raft and Zeebe layers. To reach that state, we could introduce a specific exception that gets handled by the ZeebePartition by "simply" dropping it.

We decided to follow the third solution to resolve that problem. Also, we decided to create two separate issues:

  1. Refactor LogStoragePartitionTransitionStep by not accessing Raft's current term
  2. Fix ZeebePartition#handleRecoverableFailure(), i.e., context.getCurrentRole() == FOLLOWER assumes that the Zeebe layer is already in the follower state which is not the case anymore (after refactoring the transition behavior in Zeebe)

@npepinpe, please let me know if you have any questions or concerns.

@npepinpe
Copy link
Member

npepinpe commented Feb 9, 2022

Sounds reasonable, and I understand the first approach has its own issues.

💭 I think generally the way the Raft layer and ZeebePartition layer communicate seems overly complicated. Even the proposed fixes, which are valuable, only fix specific cases - it's still easy to shoot ourselves in the foot in the future by accessing the Raft layer directly in one of transition steps, or forgetting to throw the right exception for recoverable cases. I'm just thinking out loud here, I still think you should go ahead with your plan, but I think it will soon be time to take a step back and try to come up with a better approach, or improve the current one, of coordinating these state transitions to make them less error prone.

@npepinpe npepinpe moved this from In progress to Review in progress in Zeebe Feb 24, 2022
ghost pushed a commit that referenced this issue Feb 28, 2022
8848: [Backport stable/1.3] fix: avoid transition to inactive when log storage installation fails r=romansmirnov a=romansmirnov

## Description

When the installation of the `LogStorage` fails because the targeted
term does not match Raft's term, then throw a specific exception. This
is an expected state because while installing the partition a leadership
change happened causing a new term in Raft.

That specific exception is treated differently, meaning, no Raft
transition is requested by Zeebe, because for sure a new Zeebe
transition was scheduled that will (re-)install the services.


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

backports #8767, relates to #8717



Co-authored-by: Roman <roman.smirnov@camunda.com>
ghost pushed a commit that referenced this issue Feb 28, 2022
8849: [Backport stable/1.2] fix: avoid transition to inactive when log storage installation fails r=romansmirnov a=romansmirnov

## Description

When the installation of the `LogStorage` fails because the targeted
term does not match Raft's term, then throw a specific exception. This
is an expected state because while installing the partition a leadership
change happened causing a new term in Raft.

That specific exception is treated differently, meaning, no Raft
transition is requested by Zeebe, because for sure a new Zeebe
transition was scheduled that will (re-)install the services.

<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->

backports #8767, relates to #8717



Co-authored-by: Roman <roman.smirnov@camunda.com>
@ghost ghost closed this as completed in a328924 Mar 1, 2022
Zeebe automation moved this from Review in progress to Done Mar 1, 2022
ghost pushed a commit that referenced this issue Mar 1, 2022
8858: [Backport release-1.4.0-alpha2] fix: avoid transition to inactive when log storage installation fails r=oleschoenburg a=github-actions[bot]

# Description
Backport of #8767 to `release-1.4.0-alpha2`.

relates to #8717

Co-authored-by: Roman <roman.smirnov@camunda.com>
@Zelldon Zelldon added Release: 1.2.11 version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5 labels Mar 1, 2022
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user support Marks an issue as related to a customer support request version:1.3.5 Marks an issue as being completely or in parts released in 1.3.5
Projects
None yet
6 participants