Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protect controller from becoming unscheduleable #14

Closed
jahkeup opened this issue Nov 11, 2019 · 3 comments
Closed

Protect controller from becoming unscheduleable #14

jahkeup opened this issue Nov 11, 2019 · 3 comments

Comments

@jahkeup
Copy link
Member

jahkeup commented Nov 11, 2019

Also, the Operator Controller should handle killing itself, especially so in a single noded cluster! It should not prevent itself from getting scheduled in a Cluster.

Originally posted by @jahkeup in bottlerocket-os/bottlerocket#239 (comment)

@jahkeup jahkeup changed the title dogswatch: prevent Controller from unscheduleable conditions dogswatch: protect Controller from unscheduleable conditions Nov 11, 2019
@webern webern transferred this issue from bottlerocket-os/bottlerocket Feb 26, 2020
@jahkeup jahkeup changed the title dogswatch: protect Controller from unscheduleable conditions Protect controller from becoming unscheduleable Feb 27, 2020
@jahkeup
Copy link
Member Author

jahkeup commented Mar 12, 2020

One way to protect the controller could be to have the controller save its hosting node for to be updated last. Once it updated through the other nodes, the controller would delete its Pod to be rescheduled and only once started elsewhere would it continue to update that last node.

The controller's deployment should then include bottlerocket.aws/update-available in its antiAffinity weighted selector (preferring update-available==false) so that it lands on updated hosts first.

This method wouldn't account for a single noded cluster or one where only a node was Ready and Schedulable. The controller will have to check that it considers itself to be reschedulable prior to stopping its Pod.

@jhaynes jhaynes added this to the Backlog milestone May 21, 2021
@jhaynes jhaynes modified the milestones: Backlog, next May 21, 2021
@jhaynes jhaynes modified the milestones: next, next+1 Jul 28, 2021
@Vaishvenk Vaishvenk added this to Feature Backlog in Bottlerocket Roadmap Aug 6, 2021
@cbgbt cbgbt modified the milestones: brupop 0.1.x next, Backlog Feb 21, 2022
@cbgbt
Copy link
Contributor

cbgbt commented Apr 5, 2022

Some thoughts from conversation with @somnusfish:

  • Consider evicting the controller first when doing a drain
  • Add a timeout to drains, error out and trigger our new crash loop handling code (0.2.0: Handle update-reboot failures/ "crash loops" #123) if they get stuck. Drains should never roll forward on timeouts.
  • Add PDBs to apiserver deployment to ensure we always have at least 2 running in the cluster.

@cbgbt
Copy link
Contributor

cbgbt commented Apr 5, 2022

We want to add the ability to allow brupop to update many nodes simultaneously, which makes this more important. Adding this to the 1.0.0 release milestone.

@cbgbt cbgbt modified the milestones: Backlog, brupop 1.0.0 Apr 5, 2022
@somnusfish somnusfish assigned somnusfish and unassigned somnusfish May 2, 2022
@gthao313 gthao313 self-assigned this May 2, 2022
@gthao313 gthao313 modified the milestones: brupop 1.0.0, brupop 0.2.2 Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Bottlerocket Roadmap
Feature Backlog
Development

No branches or pull requests

5 participants