Improve coordination of Raft and Zeebe role transitions #8852
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
component/partition-transitions
component/raft
component/zeebe
Related to the Zeebe component/team
kind/toil
Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
Description
This issue is an outcome of the discussions in #8717. In a nutshell, this issue is about improving the coordination of Raft and Zeebe role transitions that happens asynchronously. Currently, if the Raft layer transitions to
FOLLOWER
then the corresponding transition toFOLLOWER
in the Zeebe layer happens asynchronously. That current approach causes different issues, things to consider, and seems to overly complicate the communication between these layers, for instance:https://github.com/camunda-cloud/zeebe/blob/1081002029552e091779869fc419efc400993f7b/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/impl/steps/LogStoragePartitionTransitionStep.java#L113-L128
ZeebePartition#handleRecoverableFailure()
it assumes thatcontext.getCurrentRole()
returns the role that the Zeebe layer tried to transition to. But actuallycontext.getCurrentRole()
returns the role it was trying to transition from:https://github.com/camunda-cloud/zeebe/blob/1081002029552e091779869fc419efc400993f7b/broker/src/main/java/io/camunda/zeebe/broker/system/partitions/ZeebePartition.java#L341-L364
INACTIVE
in case of a failure during the installation of the services: When installing services fails during a transition, the partition does not recover and becomes inactive #8717To Do
The text was updated successfully, but these errors were encountered: