Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Clustering support #2443

Closed
1 of 3 tasks
svagner opened this issue Jan 2, 2020 · 10 comments
Closed
1 of 3 tasks

Feature request: Clustering support #2443

svagner opened this issue Jan 2, 2020 · 10 comments
Labels

Comments

@svagner
Copy link
Contributor

svagner commented Jan 2, 2020

Short description

Currently, bosun doesn't support any ha and load distribution. We should provide something that will allow us to provide bosun as a high available and scalable service

How this feature will help you/your organisation

  • Automatic failover then server with bosun became unavailable
  • Avoid split-brain problem
  • Distribute check execution between multiple servers

Possible solution or implementation details

One of working implementation - #2441

I offer to use raft clustering implementation from hashicorp.
Possible roundmap:

  • Create cluster for improving availability. Have a simple master-slave configuration. We can use silence/nochecks flags to make node as standby. This step without ant snapshots etc. Just simple standby.
  • Add support for snapshot cluster state, rotate snapshots, recover the cluster state
  • Host leader can distribute tasks for checks (by check name as instance) between nodes using consistent hashing distribution. In that step we can stop to use flags as the main instrument for manage nodes within the cluster
@svagner
Copy link
Contributor Author

svagner commented Jan 2, 2020

Related issues: #2360

@johnewing1
Copy link
Contributor

Hi, thanks for this contribution 👍
What's the minimum number of nodes needed for an HA solution ?
In the description you talk about a master slave solution, but my understanding was it would require at least 3 nodes for raft to tolerate the loss of a node and elect a new master.

@svagner
Copy link
Contributor Author

svagner commented Jan 9, 2020

Yes this is correct. We need at least 3 nodes. It will allow us in feature spread checks load between the nodes. In that case leader will response to set the task to servers (I'm thinking about consistent hashing based on alarm name definition)

@johnewing1
Copy link
Contributor

I've been thinking about this approach, and one thing that would need resolved is how updates to the alert definitions are handled.

Currently editing the rule definitions via the ui or api and saving them to the local disk is supported. Without further work this would lead to the definitions being inconsistent in a multi node setup.

@svagner
Copy link
Contributor Author

svagner commented Feb 14, 2020

Yes. It definitely is. Currently, we are syncing configs while deployment. But I think I will add sync config over the cluster internally

@svagner svagner mentioned this issue Apr 23, 2020
18 tasks
@svagner
Copy link
Contributor Author

svagner commented Apr 23, 2020

I've been thinking about this approach, and one thing that would need resolved is how updates to the alert definitions are handled.

Currently editing the rule definitions via the ui or api and saving them to the local disk is supported. Without further work this would lead to the definitions being inconsistent in a multi node setup.

I've created a new pull request with rule sync support and some improvement. We are using clustering from #2472 on production now and didn't notice any issue.

@stale
Copy link

stale bot commented Apr 18, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 18, 2021
@langerma
Copy link

So no progress on this?

@stale stale bot removed the wontfix label Apr 19, 2021
@svagner
Copy link
Contributor Author

svagner commented Apr 19, 2021

Looks like the community didn't want to support this cluster implementation and I stopped using bosun. Probably someone else will pick up care of clustering for bosun but not me anymore. Sorry

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 16, 2022
@stale stale bot closed this as completed May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants