Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with: Failing installation of 'LogStoragePartitionStep' #9040

Closed
Zelldon opened this issue Apr 1, 2022 · 7 comments · Fixed by #9122
Closed

Error with: Failing installation of 'LogStoragePartitionStep' #9040

Zelldon opened this issue Apr 1, 2022 · 7 comments · Fixed by #9122
Labels
kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user version:1.3.8 version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@Zelldon
Copy link
Member

Zelldon commented Apr 1, 2022

Describe the bug

We see an error on prod which says: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20. on broker 6.

Based on metrics we can see:

roles

that we have a leader (Broker 5), Broker 6 becomes leader, Broker 5 steps down, Broker 0 becomes Leader and Broker 6 steps down. Based on that I would say everything worked as expected ?

Error group: https://console.cloud.google.com/errors/detail/CPf-xtb-3czwbw;service=zeebe;time=P7D?project=camunda-cloud-240911

Interesting is that this error happend in the cluster multiple times on the same time:

Occurrences
 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48. 
 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
 31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 13.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '146' is same as raft term '147', but was not. Failing installation of 'LogStoragePartitionStep' on partition 48.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '148' is same as raft term '149', but was not. Failing installation of 'LogStoragePartitionStep' on partition 34.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
31/03/2022, 09:30 LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '149' is same as raft term '150', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.

To Reproduce
IDk

Expected behavior

I think this is expected that this can happen, than I would expect we log a warning instead of an error.

Since we worked on this here #8717 I also expected a different exception ?

Log/Stacktrace

https://drive.google.com/drive/folders/18XzGehQ0z2ut4inT-wXiBbBGM-DD3Zpf

Full Stacktrace

io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep$LogStorageTermMissmatchException: Expected that current term '150' is same as raft term '151', but was not. Failing installation of 'LogStoragePartitionStep' on partition 20.
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.checkAndCreateAtomixLogStorage(LogStoragePartitionTransitionStep.java:123) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.lambda$createWritableLogStorage$0(LogStoragePartitionTransitionStep.java:106) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at java.util.Optional.map(Unknown Source) ~[?:?]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.createWritableLogStorage(LogStoragePartitionTransitionStep.java:105) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.buildAtomixLogStorage(LogStoragePartitionTransitionStep.java:83) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.transitionTo(LogStoragePartitionTransitionStep.java:50) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.broker.system.partitions.impl.PartitionTransitionProcess.lambda$proceedWithTransition$1(PartitionTransitionProcess.java:80) ~[zeebe-broker-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.3.6.jar:1.3.6]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.3.6.jar:1.3.6]

Environment:
Camunda saas

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user labels Apr 1, 2022
@npepinpe
Copy link
Member

npepinpe commented Apr 7, 2022

@deepthidevaki can you double check that the system did not react and that it's just a logging issue? 👍

@deepthidevaki
Copy link
Contributor

It behaved as expected. ZeebePartition ignored the install failure and continued with the next transition.

2022-04-06 11:49:53.482 CEST
zeebe
"Expected that current term '159' is same as raft term '160', but was not. Failing installation of 'LogStoragePartitionStep' on partition 30."
Error
2022-04-06 11:49:53.483 CEST
zeebe
"Failed to install leader partition 30"
Info
2022-04-06 11:49:53.483 CEST
zeebe
"Aborted installation of partition 30, cause: Expected that current term '159' is same as raft term '160', but was not. Failing installation of 'LogStoragePartitionStep' on partition 30."
Info

However it was logged as an error in PartitionTransitionProcess ` https://github.com/camunda/zeebe/blob/acd6aff1a3d29959235871a1b9c0e4a9216b2b9c/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/PartitionTransitionProcess.java#L87.

and also here https://github.com/camunda/zeebe/blob/660f790e932870cf2c325a8622fea5ca5a4e3a5b/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L253

We can fix this by conditionally logging the errors. But of course a better solution will be to revisit the transitions and as part of it remove the term-check.

@npepinpe
Copy link
Member

npepinpe commented Apr 7, 2022

Let's reduce the log level for these recoverable errors to at least WARN, warning being setup for issues which may recover by themselves but give hints in case the operator notices something is wrong, or if the warning consistently repeats.

zeebe-bors-camunda bot added a commit that referenced this issue Apr 13, 2022
9122: fix(broker): do not log transition failure due to term mismatch as error r=deepthidevaki a=deepthidevaki

## Description

If a transition failed due to a term mismatch, there will be a new transition following. There is no need to log it as an error because conceptually it is not an error. It is expected to fail to prevent inconsistencies. When this failure happens, it is ignored and ZeebePartition continue with the next transition.

## Related issues

closes #9040 



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Apr 13, 2022
9124: [Backport stable/1.3] fix(broker): do not log transition failure due to term mismatch as error r=deepthidevaki a=github-actions[bot]

# Description
Backport of #9122 to `stable/1.3`.

relates to #9040

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this issue Apr 13, 2022
9125: [Backport stable/8.0] fix(broker): do not log transition failure due to term mismatch as error r=deepthidevaki a=github-actions[bot]

# Description
Backport of #9122 to `stable/8.0`.

relates to #9040

9133: [Backport stable/8.0] Prevent duplicate key insertion for DMN r=remcowesterhoud a=github-actions[bot]

# Description
Backport of #9121 to `stable/8.0`.

relates to #9115

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@deepthidevaki deepthidevaki added the version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 label May 3, 2022
@Zelldon
Copy link
Member Author

Zelldon commented May 10, 2022

I feel we not completely resolve the issue with the recent PR.

We can still see errors like: IllegalStateException: Not leader of partition 2

Which seem to be more or less expect if I take a look at the log.

Log:

2022-05-04 01:18:12.979 CEST
zeebe
Transition to LEADER on term 36 starting
2022-05-04 01:18:12.979 CEST
zeebe
Transition to LEADER on term 36 - transitioning LogStorage
2022-05-04 01:18:12.979 CEST
zeebe
RaftServer{raft-partition-partition-2}{role=LEADER} - Accepted VoteRequest{term=37, candidate=1, lastLogIndex=970255264, lastLogTerm=36}: candidate's log is up-to-date
2022-05-04 01:18:12.981 CEST
zeebe
Transition to FOLLOWER on term 37 requested.
2022-05-04 01:18:12.982 CEST
zeebe
Partition role transitioning from LEADER to FOLLOWER in term 37
2022-05-04 01:18:12.982 CEST
zeebe
Received cancel signal for transition to LEADER on term 36
2022-05-04 01:18:12.982 CEST
zeebe
Failed to install partition 2

2022-05-04 01:18:12.983 CEST
zeebe
RaftServer{raft-partition-partition-2} - Found leader 1
2022-05-04 01:18:12.994 CEST
zeebe
ZeebePartition-2 failed, marking it as unhealthy: Partition-2{status=UNHEALTHY, issue=ZeebePartition-2{status=UNHEALTHY, issue='Initial state'}}
2022-05-04 01:18:12.994 CEST
zeebe
Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-2{status=UNHEALTHY, issue='Services not installed'}, raft-partition-partition-2{status=HEALTHY}]

Logs are stored as well under https://drive.google.com/drive/u/0/folders/1vYJilfkLRlrF9CqTAUwRT9dLeDqTZULb

Exception:

java.lang.IllegalStateException: Not leader of partition 2
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.lambda$createWritableLogStorage$1(LogStoragePartitionTransitionStep.java:111) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at java.util.Optional.orElseGet(Optional.java:364) ~[?:?]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.createWritableLogStorage(LogStoragePartitionTransitionStep.java:107) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.buildAtomixLogStorage(LogStoragePartitionTransitionStep.java:83) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.broker.system.partitions.impl.steps.LogStoragePartitionTransitionStep.transitionTo(LogStoragePartitionTransitionStep.java:50) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.broker.system.partitions.impl.PartitionTransitionProcess.lambda$proceedWithTransition$1(PartitionTransitionProcess.java:80) ~[zeebe-broker-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) ~[zeebe-util-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) ~[zeebe-util-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) ~[zeebe-util-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) ~[zeebe-util-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) ~[zeebe-util-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) ~[zeebe-util-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]

@Zelldon Zelldon reopened this May 10, 2022
@deepthidevaki
Copy link
Contributor

This is a different issue, but related. Might make sense to open a new issue.

@deepthidevaki deepthidevaki removed their assignment May 11, 2022
@Zelldon
Copy link
Member Author

Zelldon commented Jun 10, 2022

Triage: Moved to todo in order to open a new issue with the related information, afterwards this issue should be closed.

@Zelldon
Copy link
Member Author

Zelldon commented Jun 23, 2022

Created a new issue #9586

@Zelldon Zelldon closed this as completed Jun 23, 2022
@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user version:1.3.8 version:8.1.0-alpha1 Marks an issue as being completely or in parts released in 8.1.0-alpha1 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants