-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes outdated topology when no new leader is assigned #5979
Conversation
One note: I think actually the fix is incomplete. We might need to also propagate the terms on follower updates as well to make sure that we aren't removing a leader with a newer term than the follower event we just received. To check with Deepthi or Miguel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks good. Just a small comment. 👍
Just for future reference, as we discussed it already: Updates from the same node is guaranteed to be delivered in order by our gossip. Hence we don't have to worry about receiving update from a broker saying it is the follower in previous term after the update with leader for newer term. |
I tightened the conditions a little and fixed some things in |
@@ -30,7 +32,7 @@ public final TopologyAssert isComplete(final int clusterSize, final int partitio | |||
final List<BrokerInfo> brokers = actual.getBrokers(); | |||
|
|||
if (brokers.size() != clusterSize) { | |||
failWithMessage("Expected broker count to be <%s> but was <%s>", clusterSize, brokers.size()); | |||
throw failure("Expected broker count to be <%s> but was <%s>", clusterSize, brokers.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The javadoc from failWithMessage
actually recommends using throw failure
instead, as the compiler can now realize we're throwing an error (whereas with failWithMessage
it thinks execution will continue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks.
01dddc2
to
11a295a
Compare
bors r+ |
5979: Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe ## Description This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate. This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #2501 ## Definition of Done _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [x] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually. Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually * [ ] The impact of the changes is verified by a benchmark Documentation: * [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.) * [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape) Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>
Build failed: |
Looking at the failed container logs, it looks like we broker backwards compatibility. We were not catching this in the rolling update test before because we did not check in between that a leader was elected, just that the node was removed/added to the topology. In this instance, it fails (sometimes - not sure why) because node 0 is up (and updated), node 1 is down, and node 2 is up (but outdated). Then node 0 is printing out Kryo error, saying it cannot deserialize something, and node 2 is getting connection timeouts from node 0 trying to get elected (I can see it switches to candidate, but can never become leader). However, I don't get why the test is flaky...I would expect, if we broke backwards compat with serialization, that this always fails. Maybe it depends who was previously leader? If Zeebe 2 was already leader, maybe it doesn't matter? idk Logs from node 0:
Logs from node 2:
This was an unintended side effect here, and it looks like by adding the condition we may have found caught an unexpected break in our rolling update, so I would like to keep this condition, but possibly the fix for it might go into another PR - so we'd need to extract the assert logic and the fix for this test into a different PR before merging this. @MiguelPires could this be related to the checksum stuff? I can't think of anything else we did, but of course it's possible we broke something else. |
I think I understand the issue - VersionFieldSerializer allows newer version to read previously written data (i.e. they can receive message from the older nodes), but it cannot read new fields. So the older nodes cannot read data from the newer nodes, and they don't ignore the fields either (why not? good question, it seems like an easy thing to do, just skip it if the version is higher than what you know). Can this cause issues during updates? When we update one node, it can receive message from the other two, and will probably not be leader. When we update the second node, then the first updated node could become leader (which we see here), which will cause issue with the older node. However, the two updated nodes should be able to work together - however our fault tolerance guarantees are lowered, I guess, since the older node is now "useless" until it's updated. I don't see an easy solution here - the only think I can think of is postponing adding checksums to 0.27, as we will most likely be breaking backwards compatibility with the new workflow engine. At this point we can change how we do serialization and ignore the issue. Let me know what you think. |
- fixes an issue in the gateway topology when the old leader becomes follower, and no new node is elected leader yet, by removing the new follower if it's still identified as the leader
- TopologyAssert#isComplete now also checks that all partitions have a leader
11a295a
to
87cddba
Compare
bors r+ |
Build succeeded: |
The process '/home/runner/work/_actions/zeebe-io/backport-action/master/backport.sh' failed with exit code 4 |
1 similar comment
The process '/home/runner/work/_actions/zeebe-io/backport-action/master/backport.sh' failed with exit code 4 |
6011: [Backport stable/0.25] Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe # Description Backport of #5979 to `stable/0.25`. There was some minor conflicts, where I had to bump the AssertJ version as `failure` did not exist in 3.17. Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>
…stcontainers dependency versions (#5979) * chore(backend): update elasticsearch, awssdk, aws-java, opensearch-testcontainers dependency versions * chore(backend): update spring-boot version to 3.1.6
Description
This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate.
This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID.
Related issues
closes #2501
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/0.25
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation: