Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding rollback mechanism to handle bootstrap failures #51718

Merged
merged 1 commit into from Jul 11, 2023

Conversation

rkachach
Copy link
Contributor

@rkachach rkachach commented May 23, 2023

This PR introduces a simple mechanism to rollback cluster files and processes in case of a failed bootstrap. Two new mutually exclusive options are provided:

  1. --cleanup-on-failure: when enabled this option will enable the rollback so broken cluster files are deleted automatically
  2. --no-cleanup-on-failure: when enabled this option will disable the rollback keeping the broken cluster files

These options are provided to ease the backport of this new feature. In older releases the rollback mechanism will be disabled by default to keep the legacy behaviour (nothing is deleted). In this case the user will need to pass the --cleanup-on-failure flag in case he wants to enable the rollback mechanism. In main the rollback feature will be enabled by default.

Fixes: https://tracker.ceph.com/issues/57016

Known issue:

In case of a keyboard interruption (ctrl+c) sometimes the following message appears. It seems to be related with some asyncio which is not being closed appropriately:

Ceph version: ceph version 18.0.0-4138-g1657989a (1657989a957fd4a6a732417127612a248586fe16) reef (dev)
Extracting ceph user uid/gid from container image...
Creating initial keys...
^CKeyboardInterrupt: 
Deleting cluster with fsid: 140e69ac-faec-11ed-bb04-525400d333a3
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f02cc99fca0>
Traceback (most recent call last):
  File "/usr/lib64/python3.9/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/usr/lib64/python3.9/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/usr/lib64/python3.9/asyncio/unix_events.py", line 536, in close
    self._close(None)
  File "/usr/lib64/python3.9/asyncio/unix_events.py", line 560, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 746, in call_soon
    self._check_closed()
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@rkachach rkachach force-pushed the fix_issue_57016 branch 3 times, most recently from bb94d83 to 00d31e6 Compare May 25, 2023 11:19
src/cephadm/cephadm.py Outdated Show resolved Hide resolved
@rkachach rkachach marked this pull request as ready for review May 25, 2023 15:31
@rkachach rkachach requested a review from a team as a code owner May 25, 2023 15:31
@rkachach
Copy link
Contributor Author

jenkins test make check

@phlogistonjohn
Copy link
Contributor

phlogistonjohn commented May 26, 2023

I'm still a little wary of having this auto-cleanup opt out rather than opt in by default. Take this as my last official request to, at least initially, make this opt-in for at least one version. I will not harp on this subject and this is the last I will pester you about this.

src/cephadm/cephadm.py Outdated Show resolved Hide resolved
src/cephadm/cephadm.py Outdated Show resolved Hide resolved
@rkachach
Copy link
Contributor Author

rkachach commented May 30, 2023

I'm still a little wary of having this auto-cleanup opt out rather than opt in by default. Take this as my last official request to, at least initially, make this opt-in for at least one version. I will not harp on this subject and this is the last I will pester you about this.

We haven't made a final decision yet, as you know this still on a proposal phase and all the options are still open so code can be adjusted easily in case we need to. Let's wait for some more feedback from the community and based on that have some more discussions on the upcoming weekly to decide whether we go with opt out or opt in alternative.

@rkachach rkachach changed the title [WIP] adding rollback mechanism to handle bootstrap failures Adding rollback mechanism to handle bootstrap failures Jun 20, 2023
@rkachach rkachach requested a review from adk3798 June 20, 2023 14:01
src/cephadm/cephadm.py Outdated Show resolved Hide resolved
src/cephadm/cephadm.py Outdated Show resolved Hide resolved
src/cephadm/cephadm.py Show resolved Hide resolved
src/cephadm/cephadm.py Outdated Show resolved Hide resolved
src/cephadm/cephadm.py Outdated Show resolved Hide resolved
@rkachach rkachach requested a review from adk3798 June 27, 2023 11:14
Copy link
Contributor

@adk3798 adk3798 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we'll just have to be careful when backporting to swap the default value for the --cleanup-on-failure flag as discussed in the weekly.

@rkachach
Copy link
Contributor Author

last force push is to squash all the commits

@adk3798
Copy link
Contributor

adk3798 commented Jul 11, 2023

https://pulpito.ceph.com/adking-2023-07-11_16:35:52-orch:cephadm-wip-adk-testing-2023-07-11-0946-distro-default-smithi/

reruns of failed/dead jobs: https://pulpito.ceph.com/adking-2023-07-11_19:17:14-orch:cephadm-wip-adk-testing-2023-07-11-0946-distro-default-smithi/

After reruns, 1 failed, 1 dead job

  • failed job was known issue with jaeger-tracing https://tracker.ceph.com/issues/59704
  • dead job was infra/teuthology issue, Error reimaging machines: reached maximum tries (101) after waiting for 600 seconds

Overall, a pretty clean run. Nothing to block merging

@adk3798 adk3798 merged commit c252248 into ceph:main Jul 11, 2023
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants