Standalone gateway returns out-of-date topology when brokers go away #2501

jwulf · 2019-05-21T14:13:51Z

Part of a series of "you're doing it wrong" tests to determine the behavior of the system when misconfigured or in a failed state.

How does the standalone gateway behave / respond when the broker behind it goes away?

jwulf · 2019-05-27T07:08:45Z

Topology query gives a cached result. Attempts to deploy time out:

zeebe-docker-compose/bin on  0.18 [!] at ☸️  ZeebeCluster 
➜ ./zbctl.darwin status --address 127.0.0.1:26500
Cluster size: 3
Partitions count: 2
Replication factor: 3
Brokers:
  Broker 0 - 192.168.128.3:26501
    Partition 1 : Follower
    Partition 2 : Follower
  Broker 1 - 192.168.128.4:26501
    Partition 1 : Follower
    Partition 2 : Follower
  Broker 2 - 192.168.128.5:26501
    Partition 1 : Leader
    Partition 2 : Leader

zeebe-docker-compose/bin on  0.18 [!] at ☸️  ZeebeCluster 
➜ ./zbctl.darwin deploy ~/Downloads/diagram_1.bpmn
Error: rpc error: code = DeadlineExceeded desc = stream terminated by RST_STREAM with error code: CANCEL
Usage:
  zbctl deploy <workflowPath> [flags]

Flags:
  -h, --help   help for deploy

Global Flags:
      --address string   Specify the Zeebe addressFlag

Zelldon · 2019-05-27T07:13:35Z

The gateway will not request the broker for the topology. The Gateway is part of the Atomix cluster. It should take some time that the topology is adjusted, since the configuration changes/node events are spread over SWIM.

jwulf · 2019-05-27T07:14:39Z

After 7 minutes it still hasn't updated. Is that expected?

Zelldon · 2019-05-27T07:19:20Z

Nope, then it is a bug I would say.

npepinpe · 2020-03-16T13:39:27Z

Is this still occurring? We patched Atomix so that the nodes periodically sync against each other to propagate a consistent view (every 10 seconds by default). There is still a SPoF since the Gateway initially only knows a single broker, but this is only an issue if that broker isn't there when the Gateway initially connects.

npepinpe · 2020-05-26T07:47:47Z

Can we check if this still occurs, and verify that this is only reporting the wrong topology and not actually using it?

npepinpe · 2020-11-08T18:44:33Z

Blocked by #4554

npepinpe · 2020-12-07T19:36:10Z

So I thought it wasn't occurring anymore, as we have many tests indirectly covering this - turns out, while we test when there are leaders, we almost never test what happens if almost all brokers leave, e.g. 2 out of 3. Turns out the gateway doesn't remove leaders, just replaces them, so yes, we sometimes return the wrong topology. This can affect that the gateway will keep trying to ping the old leader even if it isn't leader, and thus return the wrong error.

5979: Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe ## Description This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate. This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID. ## Related issues  closes #2501 ## Definition of Done _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [x] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually. Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually * [ ] The impact of the changes is verified by a benchmark Documentation: * [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.) * [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape) Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>

jwulf mentioned this issue May 27, 2019

Document how to create and configure a standalone gateway #2002

Closed

jwulf changed the title ~~Chaos: Standalone gateway loses broker~~ Standalone gateway returns out-of-date topology when brokers go away May 27, 2019

npepinpe added scope/gateway Marks an issue or PR to appear in the gateway section of the changelog Status: Needs Priority kind/bug Categorizes an issue or PR as a bug severity/low Marks a bug as having little to no noticeable impact for the user labels May 26, 2020

npepinpe added Priority: Low and removed Status: Needs Priority labels May 27, 2020

npepinpe self-assigned this May 27, 2020

npepinpe removed their assignment Nov 8, 2020

npepinpe self-assigned this Dec 7, 2020

npepinpe added Status: Planned and removed Status: Ready labels Dec 7, 2020

npepinpe mentioned this issue Dec 7, 2020

Fixes outdated topology when no new leader is assigned #5979

Merged

8 tasks

npepinpe added Status: Needs Review and removed Status: In Progress labels Dec 7, 2020

zeebe-bors bot closed this as completed in c4ab987 Dec 14, 2020

KerstinHebel removed the Status: Needs Review label Dec 14, 2020

npepinpe added the Release: 0.26.0 label Jan 5, 2021

oleschoenburg added the version:8.5.0 label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone gateway returns out-of-date topology when brokers go away #2501

Standalone gateway returns out-of-date topology when brokers go away #2501

jwulf commented May 21, 2019

jwulf commented May 27, 2019

Zelldon commented May 27, 2019

jwulf commented May 27, 2019

Zelldon commented May 27, 2019

npepinpe commented Mar 16, 2020

npepinpe commented May 26, 2020

npepinpe commented Nov 8, 2020

npepinpe commented Dec 7, 2020

Standalone gateway returns out-of-date topology when brokers go away #2501

Standalone gateway returns out-of-date topology when brokers go away #2501

Comments

jwulf commented May 21, 2019

jwulf commented May 27, 2019

Zelldon commented May 27, 2019

jwulf commented May 27, 2019

Zelldon commented May 27, 2019

npepinpe commented Mar 16, 2020

npepinpe commented May 26, 2020

npepinpe commented Nov 8, 2020

npepinpe commented Dec 7, 2020