New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone gateway returns out-of-date topology when brokers go away #2501
Comments
Topology query gives a cached result. Attempts to deploy time out:
|
The gateway will not request the broker for the topology. The Gateway is part of the Atomix cluster. It should take some time that the topology is adjusted, since the configuration changes/node events are spread over SWIM. |
After 7 minutes it still hasn't updated. Is that expected? |
Nope, then it is a bug I would say. |
Is this still occurring? We patched Atomix so that the nodes periodically sync against each other to propagate a consistent view (every 10 seconds by default). There is still a SPoF since the Gateway initially only knows a single broker, but this is only an issue if that broker isn't there when the Gateway initially connects. |
Can we check if this still occurs, and verify that this is only reporting the wrong topology and not actually using it? |
Blocked by #4554 |
So I thought it wasn't occurring anymore, as we have many tests indirectly covering this - turns out, while we test when there are leaders, we almost never test what happens if almost all brokers leave, e.g. 2 out of 3. Turns out the gateway doesn't remove leaders, just replaces them, so yes, we sometimes return the wrong topology. This can affect that the gateway will keep trying to ping the old leader even if it isn't leader, and thus return the wrong error. |
5979: Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe ## Description This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate. This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #2501 ## Definition of Done _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [x] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually. Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually * [ ] The impact of the changes is verified by a benchmark Documentation: * [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.) * [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape) Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>
Part of a series of "you're doing it wrong" tests to determine the behavior of the system when misconfigured or in a failed state.
How does the standalone gateway behave / respond when the broker behind it goes away?
The text was updated successfully, but these errors were encountered: