Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New safety checks on exclusions that could take a database unavailable #1292

Open
brownleej opened this issue Mar 13, 2019 · 3 comments
Open
Labels
operations Issues or features that would interest operations/SRE teams
Milestone

Comments

@brownleej
Copy link
Contributor

When we're running in a multi-DC configuration, it's possible to run an exclusion that can take the database unavailable. For instance, if a database is configured to run satellite logs, but we exclude all of the satellite processes, the database would go unavailable. Should we add a safety check in the exclusion to prevent this kind of mistake?

On a related note, it's possible that an exclusion could put the database into a configuration that is viable but undesirable. For instance, if a database is configured to run 5 proxies, but we exclude all of the stateless class processes but 1, the database will go down to 1 proxy, and it will put that proxy on that single process along with all of the other stateless roles. Should we add safety checks for this kind of thing as well?

@xumengpanda
Copy link
Contributor

xumengpanda commented Mar 13, 2019

Do we have a formal definition of what a desired database configuration should be?
Based on that, can we measure the "fitness" of a configuration? (Do we have an equation to compute the "fitness" of a database configuration?)

@hgray1 hgray1 added this to the 6.2 milestone Mar 18, 2019
@etschannen etschannen modified the milestones: 6.2, 7.0 Jul 31, 2019
@etschannen etschannen removed their assignment Jul 31, 2019
@etschannen etschannen self-assigned this Dec 10, 2019
@etschannen etschannen removed their assignment Jan 13, 2020
@brownleej brownleej added the operations Issues or features that would interest operations/SRE teams label Jun 16, 2020
@apkar
Copy link
Contributor

apkar commented Apr 5, 2021

Multi-DC is good example. In general we should add checks in exclude to avoid excluding instances that can make database available.

exclude command has checks to make sure cluster would not end up low disk space problems.

https://github.com/apple/foundationdb/blob/master/fdbcli/fdbcli.actor.cpp#L2427

We don't have any checks to prevent someone excluding too many stateless/tlog processes which can actually prevent cluster controller to recruit enough roles to form a cluster, which would cause immediate outage.

It would be great to have some checks around this.

@jzhou77 jzhou77 modified the milestones: 6.3, 7.0, 7.1 Apr 5, 2021
@sfc-gh-mpilman
Copy link
Collaborator

The way we want to do this is by calling into something like BetterMasterExists. So the high level idea is that the management API would ask the ClusterController whether a recovery after an exclude would succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations Issues or features that would interest operations/SRE teams
Projects
None yet
Development

No branches or pull requests

7 participants