Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/cephadm: don't mark daemons created/removed in the last minute as stray #56957

Merged
merged 2 commits into from
May 10, 2024

Conversation

adk3798
Copy link
Contributor

@adk3798 adk3798 commented Apr 17, 2024

There is sometimes a slight delay between when the core
mgr knows a daemon has been removed and when cephadm knows
it as been removed. This can cause stray daemon warnings
to pop up for a few seconds at a time. This patch tries
to avoid that by not marking daemons as stray that it
knows it just removed in the past minute.

This will be inherently tested by the check for any CEPHADM_STRAY_DAEMON
warnings that is already happening in the tests, which we know are currently
present (and has been confirmed to happen when removing nfs and osd daemons
on occasion)

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@adk3798 adk3798 requested a review from a team as a code owner April 17, 2024 14:48
@adk3798 adk3798 force-pushed the no-stray-recent-removal branch 2 times, most recently from e663b1c to e7a9292 Compare April 17, 2024 14:53
… stray

There is sometimes a slight delay between when the core
mgr knows a daemon has been created/removed and when cephadm knows
it as been created/removed. This can cause stray daemon warnings
to pop up for a few seconds at a time. This patch tries
to avoid that by not marking daemons as stray that it
knows it just created/removed in the past minute.

Signed-off-by: Adam King <adking@redhat.com>
Was adding a line to serve.py that did

if ((datetime_now() - t).total_seconds() < 60)

and this was causing the remote_executables test to fail with

ValueError: _names: unexpected type: <ast.BinOp object at 0x7f0985c8d670>

where it seems the (datetime_now() - t) was resolving to an
ast.BinOp node which had no case in _names.

This patch makes it so the remote_executables test can also
handle these BinOp nodes and the binary operations that
could be within the node

Signed-off-by: Adam King <adking@redhat.com>
@adk3798 adk3798 changed the title mgr/cephadm: don't mark daemons removed in the last minute as stray mgr/cephadm: don't mark daemons created/removed in the last minute as stray Apr 22, 2024
@adk3798
Copy link
Contributor Author

adk3798 commented Apr 30, 2024

jenkins test make check

@adk3798
Copy link
Contributor Author

adk3798 commented Apr 30, 2024

https://pulpito.ceph.com/adking-2024-04-30_05:42:49-orch:cephadm-wip-adk-testing-2024-04-29-2009-distro-default-smithi/

Most failures were in cluster log failures that are still in the process of being cleaned up.

Besides that, failures were:

  • 1 instance of https://tracker.ceph.com/issues/65718 (known issue)
  • 4 instances of mds_upgrade_sequence test failing (known issue)
  • 1 instance of staggered upgrade with agent failing (known issue)
  • some dead jobs, either reimaging issue or getting stuck before the test has actually began (seemingly on nvme-loop task) (known issue)

Copy link
Contributor

@phlogistonjohn phlogistonjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adk3798
Copy link
Contributor Author

adk3798 commented May 10, 2024

https://pulpito.ceph.com/adking-2024-05-09_03:09:40-orch:cephadm-wip-adk-testing-2024-05-08-1927-distro-default-smithi/

failures:

  • 21 in cluster log type failures that are still being cleaned up, known issue
  • 3 mds_upgrade_sequence failures, known issue
  • 1 test failed installing ceph-fuse Status code: 503 for https://mirrors.centos.org/metalink?repo=centos-baseos-9-stream&arch=x86_64&protocol=https,http, happens on occasion, nothing to block merging over

Overall, nothing unexpected in the run

For this PR in particular, only stray daemon warnings from the run were related to laundry "daemons" that are a specific issue unrelated to this PR

@adk3798 adk3798 merged commit d4dde96 into ceph:main May 10, 2024
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants