Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone gateway returns out-of-date topology when brokers go away #2501

Closed
jwulf opened this issue May 21, 2019 · 8 comments · Fixed by #5979
Closed

Standalone gateway returns out-of-date topology when brokers go away #2501

jwulf opened this issue May 21, 2019 · 8 comments · Fixed by #5979
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user

Comments

@jwulf
Copy link
Member

jwulf commented May 21, 2019

Part of a series of "you're doing it wrong" tests to determine the behavior of the system when misconfigured or in a failed state.

How does the standalone gateway behave / respond when the broker behind it goes away?

@jwulf
Copy link
Member Author

jwulf commented May 27, 2019

Topology query gives a cached result. Attempts to deploy time out:

zeebe-docker-compose/bin on  0.18 [!] at ☸️  ZeebeCluster 
➜ ./zbctl.darwin status --address 127.0.0.1:26500
Cluster size: 3
Partitions count: 2
Replication factor: 3
Brokers:
  Broker 0 - 192.168.128.3:26501
    Partition 1 : Follower
    Partition 2 : Follower
  Broker 1 - 192.168.128.4:26501
    Partition 1 : Follower
    Partition 2 : Follower
  Broker 2 - 192.168.128.5:26501
    Partition 1 : Leader
    Partition 2 : Leader

zeebe-docker-compose/bin on  0.18 [!] at ☸️  ZeebeCluster 
➜ ./zbctl.darwin deploy ~/Downloads/diagram_1.bpmn
Error: rpc error: code = DeadlineExceeded desc = stream terminated by RST_STREAM with error code: CANCEL
Usage:
  zbctl deploy <workflowPath> [flags]

Flags:
  -h, --help   help for deploy

Global Flags:
      --address string   Specify the Zeebe addressFlag

@Zelldon
Copy link
Member

Zelldon commented May 27, 2019

The gateway will not request the broker for the topology. The Gateway is part of the Atomix cluster. It should take some time that the topology is adjusted, since the configuration changes/node events are spread over SWIM.

@jwulf
Copy link
Member Author

jwulf commented May 27, 2019

After 7 minutes it still hasn't updated. Is that expected?

@Zelldon
Copy link
Member

Zelldon commented May 27, 2019

Nope, then it is a bug I would say.

@jwulf jwulf changed the title Chaos: Standalone gateway loses broker Standalone gateway returns out-of-date topology when brokers go away May 27, 2019
@npepinpe
Copy link
Member

Is this still occurring? We patched Atomix so that the nodes periodically sync against each other to propagate a consistent view (every 10 seconds by default). There is still a SPoF since the Gateway initially only knows a single broker, but this is only an issue if that broker isn't there when the Gateway initially connects.

@npepinpe npepinpe added scope/gateway Marks an issue or PR to appear in the gateway section of the changelog Status: Needs Priority kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user labels May 26, 2020
@npepinpe
Copy link
Member

Can we check if this still occurs, and verify that this is only reporting the wrong topology and not actually using it?

@npepinpe
Copy link
Member

npepinpe commented Nov 8, 2020

Blocked by #4554

@npepinpe npepinpe removed their assignment Nov 8, 2020
@npepinpe
Copy link
Member

npepinpe commented Dec 7, 2020

So I thought it wasn't occurring anymore, as we have many tests indirectly covering this - turns out, while we test when there are leaders, we almost never test what happens if almost all brokers leave, e.g. 2 out of 3. Turns out the gateway doesn't remove leaders, just replaces them, so yes, we sometimes return the wrong topology. This can affect that the gateway will keep trying to ping the old leader even if it isn't leader, and thus return the wrong error.

@npepinpe npepinpe self-assigned this Dec 7, 2020
zeebe-bors bot added a commit that referenced this issue Dec 9, 2020
5979: Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe

## Description

This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate.

This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #2501 

## Definition of Done

_Not all items need to be done depending on the issue and the pull request._

Code changes:
* [x] The changes are backwards compatibility with previous versions
* [x] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually.

Testing:
* [x] There are unit/integration tests that verify all acceptance criterias of the issue
* [x] New tests are written to ensure backwards compatibility with further versions
* [ ] The behavior is tested manually
* [ ] The impact of the changes is verified by a benchmark 

Documentation: 
* [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
* [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape)


Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/gateway Marks an issue or PR to appear in the gateway section of the changelog severity/low Marks a bug as having little to no noticeable impact for the user
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants