Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition stays in force configuration after force failover #17334

Closed
deepthidevaki opened this issue Apr 5, 2024 · 0 comments · Fixed by #17641
Closed

Partition stays in force configuration after force failover #17334

deepthidevaki opened this issue Apr 5, 2024 · 0 comments · Fixed by #17641
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0-alpha1 Label that represents issues released on verions 8.6.0-alpha1

Comments

@deepthidevaki
Copy link
Contributor

In some failed e2e multi-region failover test, failback took long time because a partition did not come out of forced configuration for almost 1hr. Eventually it succeeds though, but the test failed because of the timeout. It is not expected to take so long. Coming out of force configuration should be done with in seconds after force configuration is succeeded.

Example of such a run: https://camunda.slack.com/archives/C013MEVQ4M9/p1711523599459309

We can see the following error repeated for almost 1 hour. Then eventually the operation succeeds (probably after a leader change.)

java.util.concurrent.CompletionException: io.atomix.primitive.PrimitiveException$Unavailable: Force configuration change is in progress. Cannot accept request from 3 which is not a member of the new configuration.
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
	at io.atomix.raft.impl.ReconfigurationHelper.lambda$joinWithRetry$5(ReconfigurationHelper.java:135) ~[zeebe-atomix-cluster-8.5.0-SNAPSHOT.jar:8.5.0-SNAPSHOT]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) ~[?:?]
	at io.atomix.utils.concurrent.SingleThreadContext$WrappedRunnable.run(SingleThreadContext.java:178) ~[zeebe-atomix-utils-8.5.0-SNAPSHOT.jar:8.5.0-SNAPSHOT]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: io.atomix.primitive.PrimitiveException$Unavailable: Force configuration change is in progress. Cannot accept request from 3 which is not a member of the new configuration.
	at io.atomix.raft.RaftError$Type$5.createException(RaftError.java:139) ~[zeebe-atomix-cluster-8.5.0-SNAPSHOT.jar:8.5.0-SNAPSHOT]
	at io.atomix.raft.RaftError.createException(RaftError.java:65) ~[zeebe-atomix-cluster-8.5.0-SNAPSHOT.jar:8.5.0-SNAPSHOT]
	at io.atomix.raft.impl.ReconfigurationHelper.lambda$joinWithRetry$5(ReconfigurationHelper.java:133) ~[zeebe-atomix-cluster-8.5.0-SNAPSHOT.jar:8.5.0-SNAPSHOT]
	... 10 more

related to #16126

@deepthidevaki deepthidevaki added kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround component/zeebe Related to the Zeebe component/team labels Apr 5, 2024
@deepthidevaki deepthidevaki self-assigned this Apr 5, 2024
github-merge-queue bot pushed a commit that referenced this issue Apr 22, 2024
## Description

In the first attempt, both members received the force configure request,
the current leader steps down. When the leader role is started the
leader comes out of force configuration and also sends the new
configuration to the follower. However, it is possible that the first
request was interpreted as failed because either the response was lost
or the the pod got restarted before it could mark it as completed. When
the request is retried, the follower that receives the request accepts
its, overwrites the local configuration with force configuration and
send the request to the current leader. The current leader sees that the
membership already matches the requested configuration, so it simply
accepts the requests without overwriting the local configuration. As a
result, the follower remains in the force configuration state because
the leader will never re-send the normal configuration.

To fix this, we remove the optimization where the leader short-circuit
the duplicate force request. Instead, the leader always overwrites the
local configuration and go through the whole force configuration
process. This is ok because at worst, it will go through another leader
election.

## Related issues

closes #17334
github-merge-queue bot pushed a commit that referenced this issue Apr 29, 2024
…gure request (#18028)

# Description
Backport of #17641 to `stable/8.5`.

relates to #17334
original author: @deepthidevaki
@Zelldon Zelldon added version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0-alpha1 labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug severity/mid Marks a bug as having a noticeable impact but with a known workaround version:8.5.1 Marks an issue as being completely or in parts released in 8.5.1 version:8.6.0-alpha1 Label that represents issues released on verions 8.6.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants