-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When installing services fails during a transition, the partition does not recover and becomes inactive #8717
Comments
@npepinpe, I summarized what we have discussed. Feel free to adjust the description if something is missing. |
@deepthidevaki, while I understand what happened in this case, there is still one missing piece that is unclear to me. Basically, what triggered the term change so that the To me, it is clear, in which situations the terms changes (for example, when receiving a request with a higher term), but typically, I would expect that the Raft layer transitions immediately to the Do you see any case where the term changes but a role transition are not triggered? Maybe we can discuss this issue together 😄 |
As far as I remember, this is not expected. It should transition to FOLLOWER when transition to leader fails. If transition to follower fails, then it should transition to INACTIVE because there is no other fallback option. So I guess there is a bug there. |
This should ideally be not treated as an install failure, rather this is an expected case. We have this check to ensure consistency and failing the install is the way to prevent any inconsistency errors. We should handle this error differently, I guess. |
The term can change when it is in the follower role. Then no transitions are triggered. But in this case, I think the following happened:
Step 1 and 3 is executed in the raft thread. Step 2 is executed in Zeebe actors. This might be why there is this time gap. |
To that point:
The situation is a bit different:
To avoid going into
|
I would say this is the bug. Since transitioning to LEADER failed, it should consider that and transition to Follower. So here comparing with the currentRole which is "follower" is incorrect.
This makes sense to me. If there is already a new transition triggered by Raft, then we can ignore the transition triggered by ZeebePartition. But Zeebe should trigger a transition to Inactive only when there is no other way to recover from the failure. In this case, Zeebe should not have transitioned to inactive. So I think we should look into it. In addition, it would be also interesting to think how we can avoid this Error "ERROR io.camunda.zeebe.broker.system - Expected that current term '2' is same as raft term '3'" which triggered this transition. |
Considering that transitions are asynchronous and concurrent between the two layers, does it make sense for a step in the Anyhow, we should focus first on what you two mentioned: handling the recoverable error properly, and also tying the transition requests with a logical clock. I will schedule this for alpha2. It's a little hard for me to estimate now if we should start immediately, but I think next week would be the latest if we're conservative. Let me know if you think otherwise. I'll probably ask one of you to work on this, and the other to review, but you're welcomed to pair as well if you'd like 🙂 |
The check is not only to short-circuit, but to prevent inconsistency. We had a bug where the LogStorageAppender from older term writes to the LeaderRole in the newer term, thus messing up the record positions. This check was added to prevent that case. I also agree, peeking at the raft term is not an ideal way. If we can find a better way to prevent the a service from older term accessing raft at a newer term, that would be preferred. |
@npepinpe, a bit unrelated to the discussion so far, but when we discussed this issue yesterday, we were wondering what caused the term to change. As @deepthidevaki mentioned the following happens asynchronously
This is exactly what happens, but I was wondering why there is a "big" time gap of ~240ms:
In a different context, I noticed that writes to Raft's metastore spikes up to ~350ms (and more). And that would explain the time gap between these two events. Broker 0 handles the vote request coming from broker 1: Basically, the following happens:
To my knowledge, this would explain the term change and the delayed Raft transition. |
@deepthidevaki and I discussed potential solutions. Just listing a few of them:
We decided to follow the third solution to resolve that problem. Also, we decided to create two separate issues:
@npepinpe, please let me know if you have any questions or concerns. |
Sounds reasonable, and I understand the first approach has its own issues. 💭 I think generally the way the Raft layer and ZeebePartition layer communicate seems overly complicated. Even the proposed fixes, which are valuable, only fix specific cases - it's still easy to shoot ourselves in the foot in the future by accessing the Raft layer directly in one of transition steps, or forgetting to throw the right exception for recoverable cases. I'm just thinking out loud here, I still think you should go ahead with your plan, but I think it will soon be time to take a step back and try to come up with a better approach, or improve the current one, of coordinating these state transitions to make them less error prone. |
8848: [Backport stable/1.3] fix: avoid transition to inactive when log storage installation fails r=romansmirnov a=romansmirnov ## Description When the installation of the `LogStorage` fails because the targeted term does not match Raft's term, then throw a specific exception. This is an expected state because while installing the partition a leadership change happened causing a new term in Raft. That specific exception is treated differently, meaning, no Raft transition is requested by Zeebe, because for sure a new Zeebe transition was scheduled that will (re-)install the services. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> backports #8767, relates to #8717 Co-authored-by: Roman <roman.smirnov@camunda.com>
8849: [Backport stable/1.2] fix: avoid transition to inactive when log storage installation fails r=romansmirnov a=romansmirnov ## Description When the installation of the `LogStorage` fails because the targeted term does not match Raft's term, then throw a specific exception. This is an expected state because while installing the partition a leadership change happened causing a new term in Raft. That specific exception is treated differently, meaning, no Raft transition is requested by Zeebe, because for sure a new Zeebe transition was scheduled that will (re-)install the services. <!-- Please explain the changes you made here. --> ## Related issues <!-- Which issues are closed by this PR or are related --> backports #8767, relates to #8717 Co-authored-by: Roman <roman.smirnov@camunda.com>
Describe the bug
When the Raft layer transitions to the
LEADER
role, it will trigger a transition to theLEADER
role on the Zeebe layer after committing the initial entry. However, the transition to theLEADER
role might fail on the Zeebe layer. When this happens, the Zeebe layer forces the Raft layer to transition to theINACTIVE
state. But transitioning to theINACTIVE
state might be in conflict with an already ongoing transition in the Raft layer. This might result in a state, where the Zeebe node becomes inactive on this node eventually so that it will reject all incoming append requests.The following illustrates how this state can be reached:
2022-01-28 13:18:13.387
: The Raft layer transitions to theLEADER
role on partition 222022-01-28 13:18:25.994
: The Raft layer requests the Zeebe layer to transition to theLEADER
role for term2
2022-01-28 13:18:25.995
: On the Zeebe layer, the transition to theLEADER
role fails because the term changed in the meantime from2
to3
. (Please note: in the meantime, another node got theLEADER
on term3
on partition 22.)2022-01-28 13:18:25.996
: The Zeebe layer forces the Raft layer to transition to theINACTIVE
role.2022-01-28 13:18:26.239
: But before transitioning to theINACTIVE
state in the Raft layer, it first handles a request from the current/new leader, and recognizes it again that the term changed and triggers a transition toFOLLOWER
in the Raft layer.2022-01-28 13:18:26.275
: After transitioning toFOLLOWER
in the Raft layer, it transitions to theINACTIVE
role in the Raft layerFOLLOWER
which gets canceled, and subsequently it transitions toINACTIVE
:As a consequence, the node becomes inactive on partition 22. That means the node will reject all incoming requests on that partition. Thus, the leader is not able to replicate any entries to that follower.
Observed Behavior:
In a nutshell, there are two problems:
LogStoragePartitionTransitionStep
when transitioning toLEADER
in the Zeebe layer:https://github.com/camunda-cloud/zeebe/blob/bba1d425614f76698a335132f4f8c2d614e97537/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/LogStoragePartitionTransitionStep.java#L113-L128
Basically, the term changes so that this transition step fails in the Zeebe layer. Typically, if the term changes it will immediately transition (from
LEADER
) to theFOLLOWER
in the Raft layer. Because the Zeebe layer forces the Raft layer to go in an inactive role, the Raft layer will become inactive:https://github.com/camunda-cloud/zeebe/blob/bba1d425614f76698a335132f4f8c2d614e97537/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L357-L362
Expected behavior
Additional information:
Environment:
related to SUPPORT-12825
The text was updated successfully, but these errors were encountered: