Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and correct DefaultClusterEventService implementation #4154

Closed
npepinpe opened this issue Feb 25, 2020 · 1 comment · Fixed by #4387
Closed

Simplify and correct DefaultClusterEventService implementation #4154

npepinpe opened this issue Feb 25, 2020 · 1 comment · Fixed by #4387
Assignees

Comments

@npepinpe
Copy link
Member

Expected behavior

The current DefaultClusterEventService implementation is complex, buggy, and difficult to test. It works as is:

  1. Every second, pick a node and gossip the list of newly created subscriptions since the last time we gossiped to that node.
  2. Every minute, purge all tombstone subscriptions which were marked as tombstones before the lowest update time of all nodes - this is done by tracking the last time we gossiped, for all nodes, and picking the minimum one (or 0 if there is a node we've never contacted in our cluster).
  3. On subscribing locally, create a non-tombstone subscription and send it to all nodes.
  4. On receiving a remote subscriptions, if we have never seen a subscription for that topic and that memberId, add it to our known subscriptions. If we have seen one and it is a tombstone subscription, add the tombstone subscription to list of known subscriptions. If it is not a tombstone subscriptions, do nothing.

So what we expect here is that if I subscribe to a topic, eventually all nodes in the cluster know about it through gossip; if I unsubscribe, all nodes also know about it eventually through gossip.

Actual behavior

It's possible that a tombstone subscription prevent us from subscribing to a topic. If I subscribe, and unsubscribe, then other nodes have a tombstone subscription for my memberId and my topic. If I re-subscribe, then as noted above, it is possible that my subscription is ignored. As we don't typically retry subscribing, then it's possible other nodes never know about our subscription, as on gossip we don't resend the whole list of subscriptions, only the newly created ones.

Another bug is that a node, on rejoining the cluster, will never be sent existing subscriptions, as the other nodes only gossip "new" subscriptions to known members - this would work fine if we removed members from the updateTimes map, but we never do, so on rejoining a new node with the same old ID will have it's old update time.

@npepinpe npepinpe self-assigned this Feb 25, 2020
@npepinpe
Copy link
Member Author

Proposed solution is to keep have a cluster event service keep track of known topics/subscriptions, and only gossip its own topics/subscriptions (the entire set) with other nodes - this greatly simplifies the implementation at the risk of being slightly inefficient (we always transmit all of our local subscriptions), but as we use very few topics and the gossip interval can be relatively high (in the order of seconds), this is most likely fine. In order to handle a partition/crash, we can also listen for cluster membership events and remove a dead node's subscriptions.

@Zelldon Zelldon transferred this issue from zeebe-io/atomix Mar 27, 2020
zeebe-bors bot added a commit that referenced this issue Apr 29, 2020
4387: Refactor cluster event service r=deepthidevaki a=deepthidevaki

## Description

* Use member properties to propagate subscription info instead of custom gossip
* EventService can now only  `broadcast`. Methods for unicast and send are removed from the interface. 

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #4154 
closes #4307 

#

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@zeebe-bors zeebe-bors bot closed this as completed in 86d86d1 Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants