Skip to content

qa/tasks/daemonwatchdog: Stop thrashers when barking#65978

Closed
Matan-B wants to merge 1 commit intoceph:mainfrom
Matan-B:wip-matanb-crimson-watchdog
Closed

qa/tasks/daemonwatchdog: Stop thrashers when barking#65978
Matan-B wants to merge 1 commit intoceph:mainfrom
Matan-B:wip-matanb-crimson-watchdog

Conversation

@Matan-B
Copy link
Copy Markdown
Contributor

@Matan-B Matan-B commented Oct 16, 2025

The watchdog would bark if:

  • daemon is failed for more than daemon_timeout
  • any thrasher had an exception

In the later case, we would also stop and join the thrashers. For an osd failing for more than the set timeout, we wouldn't try to stop the thrashers (even though we BARKED). This would result in failed test jobs running for over 8 hours until job timeout is hit with failed osds.

See, not timestamps:

2025-09-03T01:32:00.471 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK!
unmounting mounts and killing all daemons
teuthology.exceptions.CommandFailedError: Command failed on smithi094
with status 124: 'sudo adjust-ulimits ceph-coverage
/home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph
pg dump --format=json'
2025-09-03T09:01:39.050 DEBUG:teuthology.exit:Finished running handlers

Fixes: https://tracker.ceph.com/issues/71417

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@Matan-B
Copy link
Copy Markdown
Contributor Author

Matan-B commented Oct 16, 2025

Not really Crimson specific but Crimson would largely benefit from this as out thrash tests keep running for 8 hrs even when the osds are failing. Results in large logs files and makes it harder to find the issue.

example: http://qa-proxy.ceph.com/teuthology/xuxuehan-2025-09-03_00:08:41-crimson-rados:thrash_simple-wip-seastore-clone-range-latest-distro-crimson-debug-smithi/8478469/teuthology.log

@Matan-B Matan-B added this to Crimson Oct 16, 2025
@Matan-B Matan-B moved this to Awaits review in Crimson Oct 16, 2025
@Matan-B Matan-B moved this from Awaits review to Needs QA in Crimson Oct 16, 2025
@Matan-B
Copy link
Copy Markdown
Contributor Author

Matan-B commented Oct 16, 2025

@perezjosibm
Copy link
Copy Markdown
Contributor

Probably related to #65067 that was posted in https://ibm-systems-storage.slack.com/archives/C051Z80F25P/p1755274223074229 -- proposes similar changes to qa/tasks/daemonwatchdog.py as well

The watchdog would bark if:
- daemon is failed for more than daemon_timeout
- any thrasher had an exception

In the later case, we would also stop and join the thrashers.
For an osd failing for more than the set timeout, we wouldn't
try to stop the thrashers (even though we BARKED).
This would result in failed test jobs running for over 8 hours
until job timeout is hit with failed osds.

See, not timestamps:
```
2025-09-03T01:32:00.471 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK!
unmounting mounts and killing all daemons
teuthology.exceptions.CommandFailedError: Command failed on smithi094
with status 124: 'sudo adjust-ulimits ceph-coverage
/home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph
pg dump --format=json'
2025-09-03T09:01:39.050 DEBUG:teuthology.exit:Finished running handlers
```

Fixes: https://tracker.ceph.com/issues/71417

Signed-off-by: Matan Breizman <mbreizma@redhat.com>
@Matan-B Matan-B force-pushed the wip-matanb-crimson-watchdog branch from 31b5b24 to bf40efe Compare October 20, 2025 07:54
@Matan-B
Copy link
Copy Markdown
Contributor Author

Matan-B commented Oct 20, 2025

Probably related to #65067 that was posted in https://ibm-systems-storage.slack.com/archives/C051Z80F25P/p1755274223074229 -- proposes similar changes to qa/tasks/daemonwatchdog.py as well

The PR above proposes a similar change to timeout and many more improvements. I don't think merging this PR is conflicting - as all the other improvements would still be valid. I'll go ahead and merge the PR here so the nightlies would be easier to debug. I'll try also to review Bill's PR. Thanks for sharing it here!


added logs:
https://pulpito.ceph.com/matan-2025-10-20_08:02:28-crimson-rados-main-distro-crimson-debug-smithi/

@Matan-B
Copy link
Copy Markdown
Contributor Author

Matan-B commented Oct 20, 2025

Process should also be killed and it seems #64889 has already implemented this well! Closing

@Matan-B Matan-B closed this Oct 20, 2025
@Matan-B Matan-B removed this from Crimson Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants