Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrapper: wipe disk and reboot on non-recoverable error #2971

Merged
merged 12 commits into from
Mar 12, 2024

Conversation

daniel-weisse
Copy link
Member

@daniel-weisse daniel-weisse commented Mar 6, 2024

Context

Constellation's bootstrapper may fail at multiple stages.
Some of these failures are transient, like joining an existing cluster using kubeadm and may be retried.
Others are not recoverable, meaning one can no longer re-run constellation apply on the same node, or the node can no longer retry joining an existing cluster.
To "fix" a VM that is in this failed state, the VM needs to be rebooted, and the state-disk wiped.

Proposed change(s)

  • Change how joinClient and initServer are run
    • Both routines now have return values and are run in parallel
    • If any error occurs in run(), the VM is rebooted
  • Adapt the flow of joinClient to more easily differentiate between transient and non recoverable errors
  • In both joinClient and initServer, mark the state disk as not initialized if a non recoverable error occurs (any error after the VM was marked as initialized)
  • Remove the ability to call JoinClient.Start() and JoinClient.Stop() idempotently
    • This was and is not required for the bootstrapper
    • This was unnecessarily complicating the set up and testing of the client

Related issue

Additional info

  • AB#3930
  • Why not use systemd native ways to restart the VM when constelllation-bootstrapper.service fails?
    • On debug images, the service is expected to fail until a bootstrapper is uploaded using cdbg. This makes using debug images impossible
  • How to test?
    • This PR currently contains a commit marked with [drop] that returns an error in the joinClient which causes a fatal error and causes the VM to reboot. The state disk should then be marked as non initialized and be wiped again on start.

Checklist

Copy link

netlify bot commented Mar 6, 2024

Deploy Preview for constellation-docs ready!

Name Link
🔨 Latest commit cac3b47
🔍 Latest deploy log https://app.netlify.com/sites/constellation-docs/deploys/65f02c12070d98000894cb2b
😎 Deploy Preview https://deploy-preview-2971--constellation-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@daniel-weisse daniel-weisse force-pushed the feat/bootstrapper/join-failure-reset branch from e28aab8 to 29d25d8 Compare March 6, 2024 13:52
@daniel-weisse daniel-weisse added the bug fix Fixing a bug label Mar 6, 2024
@daniel-weisse daniel-weisse changed the title bootstrapper: wiped disk and reboot on non-recoverable error bootstrapper: wipe disk and reboot on non-recoverable error Mar 6, 2024
@daniel-weisse daniel-weisse added this to the v2.17.0 milestone Mar 6, 2024
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
@daniel-weisse daniel-weisse force-pushed the feat/bootstrapper/join-failure-reset branch from fa61e82 to b3eff1b Compare March 7, 2024 13:57
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
@daniel-weisse daniel-weisse requested a review from malt3 March 7, 2024 14:31
@daniel-weisse daniel-weisse marked this pull request as ready for review March 7, 2024 14:31
Copy link
Member

@3u13r 3u13r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bootstrapper/internal/joinclient/joinclient.go Outdated Show resolved Hide resolved
bootstrapper/internal/joinclient/joinclient.go Outdated Show resolved Hide resolved
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
Copy link
Contributor

Coverage report

Package Old New Trend
bootstrapper/cmd/bootstrapper 0.00% 0.00% 🚧
bootstrapper/internal/diskencryption 6.50% 6.50% 🚧
bootstrapper/internal/initserver 17.90% 23.00% ↗️
bootstrapper/internal/joinclient 32.90% 40.00% ↗️

@daniel-weisse daniel-weisse merged commit 1077b7a into main Mar 12, 2024
8 checks passed
@daniel-weisse daniel-weisse deleted the feat/bootstrapper/join-failure-reset branch March 12, 2024 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants